Posts on Miguel Fernandez Arce

Making your own array

Tue, 01 Aug 2023 00:00:00 +0000

Just use std::vector.

Which is what I did until a some weeks ago, when I decided enough was enough!

It was about time I made an array type for the needs of my library. In this post, I will go through the design decisions taken while doing so: Creating a custom array container.

Why a custom array?

Until now, I used a wrapper around std::vector, which was okay… No, really. But:

It makes solutions to simple problems unnecessarily complex.
Its API almost completely built using iterators.
It has an allocator types on the template
There is no built-in (or easy) way to have inline memory (try with allocators if you want to sacrifice 500 lines of code to the gods and obtain shitty syntax in return).
It has an extensive & rigid API with years of features that I don’t want or need to maintain.
std::vector? Really? Burn it.

And many others really, but most importantly:

It’s fun to do your own stuff sometimes, not going to lie.

These points are not necessarily the wrong choice for the standard library considering its scope, but for me, they very much are.

We, humans, should understand how the tools we use work. Otherwise, we could be using them the wrong way or the wrong tool. And containers are a tool like any other. If you ever read code inside std::vector, no matter which std implementation it was, I wouldn’t be surprised if you chose to not stick around.

Std implementations are often unintelligible, in good part, because the design they are built on top of has a long list of requirements that adds up.

Some honorable mentions from the previous points:

The iterator based API forces functions to be their own templates, where parameters could be iterators of any type, and many extra checks need to be run. The abstraction layer it adds, over simply using indexes, is not for free either.
Allocators make compatibility across otherwise equivalent vectors a nightmare. They try to solve memory allocation, yet fail to be of real use in real scenarios, and they multiply the number of compiled class variations (which makes compiling slower). Not forgetting, it also guarantees a complex implementation.

About Pipe’s Requirements

Pipe, the library that will contain these shiny new arrays, is the foundational library I use on most of my C++ projects. It has many great experimental features that I have repeatedly failed to share with others like they deserve.

I have used this library for more than 9 years, and overcoming the limitations of std::vector was increasingly frustrating. Specially when I needed to scratch extra performance with features like inline memory.

By “inline memory” I mean having N items contained directly inside the array’s instance

I needed an Array type that:

Natively supports inline memory, without sacrificing the syntax or user experience.
Integrates with arenas to control the memory it allocates.
Has a combined index and iterator based API with an extensive list of helpers.
Its implementation MUST be simple.

The Design

Lets see how we can achieve reasonable simplicity for arrays.

In Pipe, any container with a contiguous list of elements, whether it owns it or not, inherits from IArray (I welcome better name suggestions).

This class is not intended for the user to use directly, but it provides shared functionality for finding, checking, sorting, swapping and iterating the elements in the list.

Two classes use IArray (and some aliases too):

View: Points to one or more contiguous elements that it does not own. These elements can be literals, arrays, or any other pointer with a size. Equivalent to std::span, or what is sometimes called an ArrayView.
InlineArray: Owns a contiguous, mutable list of elements. It can use an optional inline buffer for performance. Because of this, it does not use allocators. Somewhat equivalent to std::vector or other array implementations.
Array: An alias for InlineArray with an inline buffer size of 0, meaning it uses exclusively allocated memory.

There can be other aliases like SmallArray that use different combinations of the inline buffer, but the point is that there is a single implementation class for arrays.

Allocation

Lets go back to “does not use allocators”:

Over the years, I have seen and used many implementations of arrays. Like everything, they have advantages and disadvantages. It is a balance. However, those that used templated allocators were specifically rigid, verbose or complex (or all those three).

Usually, you want to solve two problems with allocators:

Control how and where the container’s memory is allocated.
Inject and use inline elements in the container. Optionally, you may want to share these allocators with different containers.

Sharing allocators sounds ideal, but is very problematic when you also want to achieve the other points. Different containers allocate differently. If an allocator is used in an array, you know it only needs to maintain a single block of memory. However, maps, sets, or page buffers don’t work this way, and can allocate many blocks. They have requirements that can be incompatible with each other.

Most allocators also need to know the type the container holds, so they need to be templates. They have a dependency between the memory and the type since many times they are the ones doing the copying of elements, among other operations.

Okay, but they surely must have many uses… right?

I think it is pretty rare, would even dare to say extremely rare, to see in your everyday life a container allocator that is not for inline memory or for a very specific use.

If we imagine we had an “inline allocator” in different APIs, it could look like:

std::vector<String, InlineAllocator<String, 5>> values; // standard library
TArray<String, InlineAllocator<5>> values; // unreal engine

In Pipe, this would look a bit different:

InlineArray<String, 5> values;

I choose to split the problem of allocation:

Inline memory is handled by the array itself.
Allocated memory is handled by arenas.

Inline Memory

It is handled by the array itself. When we use for example InlineArray the array will be able to hold up to 5 inline elements. If we exceed this capacity, it will use allocation. Similarly, if it fits, it will move to inline from allocation. Of course, you can assign an inline buffer of size 0, this is actually very common.

The user does not need to remember how to use inline memory since it is always available on the container.

Allocated Memory

It is handled exclusively by arenas. Arenas handle allocation following a particular algorithm. They are non-templated, and completely independent of the container itself.

To give you an example: I use them for reflection, where a single linear arena is assigned to all containers allocating reflection data. This means reflection has great data locality and reduces cache misses. It makes operations like checking inheritance much faster, since we access memory that is very close.

An array can be assigned an arena during its construction.

MultiLinearArena arena;
Array<String> names{arena};

When no arena is provided, a global or scope arena is used. I should write another post about arenas…

With this design, an array can copy or move to another array with a different arena seamlessly, just like it does with the inline buffer. No extra code is needed to achieve arenas or inline, and if we need to control allocation, we can use an arena of our choice.

Indexes

On the topic of indexes, there is not that much to mention.

Simply put, most of the functions in the array prefer indexes or counts over iterators. This makes their use and implementation easier.

void Insert(i32 atIndex, const Type& value);
bool RemoveAt(i32 index, const bool shouldShrink = true);
Type* At(i32 index) const;

Of course, iterators are still supported to allow range-for or iterator algorithms, but the API prefers the use of indexes for simplicity.

Unsafe

Sometimes when we work with arrays, we might know the inputs we provide are safe. For that reason, many functions in Pipe have an unsafe version which skips some safety checks. Use them at your own risk.

This can help gain back some performance in the large scale of things.

Very often, the safe versions simply call the unsafe version after running those checks:

bool RemoveAt(i32 index, const bool shouldShrink)
{
	if (IsValidIndex(index))
	{
		RemoveAtUnsafe(index, shouldShrink);
		return true;
	}
	return false;
}

void Swap(i32 firstIdx, i32 secondIdx)
{
	if (IsValidIndex(firstIdx) && IsValidIndex(secondIdx) && firstIdx != secondIdx)
	{
		SwapUnsafe(firstIdx, secondIdx);
	}
}

Their API will always contain “Unsafe” at the end. This makes it likely that safe versions show up first while coding, and continuously gives a hint of their risk to the user.

Plurals

It is always better to do an operation once for N items, than N times for N items. This is why, in this design, many operations the array does (like adding or removing) can be performed in bulk.

You can add, remove, swap or sort many at once. This can provide a substantial performance benefit, while also simplifying the user code.

This can be done by providing another span or array to the function, or a range of indexes or iterators:

// Some examples of bulk operations in InlineArray:
void Append(const IArray<Type>& values);
void Assign(const IArray<Type>& values);
i32 Remove(const IArray<const Type>& items, bool shouldShrink = true);
i32 Remove(const IArray<i32>& indices, bool shouldShrink = true);

Final Notes

For anyone interested in taking a look at the full implementation, you can find it here (PipeArrays.h) along with the library ( Pipe).

I am sure I also forgot important details or didn’t explain something correctly, so feel free to leave a comment and feedback, and if you happened to like it, let me know! I don’t write often, but your encouragement will help :)

Finally, I am aware that topics like this have such a wide amount of uses that my described solution (that works for my needs) will be as good for some as it will be bad for others. Let’s keep it a constructive conversation anyway.

Until next time, Muit.

A new approach to ECS APIs

Thu, 10 Feb 2022 00:00:00 +0000

Let’s talk about a different approach to ECS I have been rumbling about lately. Well, specifically, about how we query entities, manage dependencies and access/modify data.

What is ECS you ask?

Fair question! ECS (as Entity-Component-System) is an architectural pattern based on DOD (data-oriented design), where you have three main elements:

Entities: They are just an identifier and don’t hold any data.
Components: Structs of data associated with a single entity (1 entity can have 1 component of each type). They don’t have any code/logic.
Systems: Functions executed operating entities and components.

I could explain ECS in greater detail, but there are plenty of resources online already that will do a better job than me. This talk is a good start, and for more resources, you can also read this.

I personally also like to consider Utilities as the forth secret child of ECS. Utilities are functions that can be reused between systems. Any code that is not part of a system is a utility. One example could be hierarchy where we can add, remove, or transfer children from entities from multiple systems.

Current approach to ECS APIs

Now that we know what ECS is and the basics of how it works, let’s talk about how we could improve it.

In most ECS libraries I have used so far, there is always the concept of a view, or a filter. This is a tool that allows fast iteration of entities following a set of conditions. You can say, for example, “iterate all entities with ‘Player’ and ‘Movement’ components, but ignore those with ‘Frozen’ component”.

Implementation details may differ, but I will be using using the popular library entt as an example (it’s great, check it out). In this library, a “view” caches pools from the world when it is created, and uses them to check for entities matching some included and excluded components.

So lets make an example with entt where we move agents (a system):

void MoveAgents(entt::registry& registry, float deltaTime)
{
	// We create a view matching all agents with movement and transform components
	auto view = registry.view<const Agent, const Movement, Transform>();
	// We iterate all entities in the view
	for(Id entity : view)
	{
		// We get components and apply position based on velocity
		const auto& movement = view.get<const Movement>(entity);
		auto& transform = view.get<Transform>(entity);
		transform.position += movement.velocity * deltaTime;
	}
}

Okay, so far, we are just fine. But what if we have props that can move? But only when they are enabled.

void MoveProps(entt::registry& registry, float deltaTime)
{
	auto view = registry.view<const Prop, const Movement, Transform>();
	for(Id entity : view)
	{
		const auto& prop = view.get<const Prop>(entity);
		if (prop.isEnabled)
		{
			// Can we reuse this?
			const auto& movement = view.get<const Movement>(entity);
			auto& transform = view.get<Transform>(entity);
			transform.position += movement.velocity * deltaTime;
		}
	}
}

Well, we get some duplicated code, we could export this into a utility. But how?

If we wanted to share code as utilities, we would be extremely limited, specially if we want to track which data we are reading and writing, which is crucial for scheduling (more on that later).

// We could use references, but it's not very practical since we need to get the components outside anyway
void ApplyMovement(const Movement& movement, Transform& transform, float deltaTime)
{
	transform.position += movement.velocity * deltaTime;
}

// We could pass the registry, but then we lose the fast access to pools from views.
// Also, we do not know from outside which components we are reading and writing
void ApplyMovement(entt::registry& registry, float deltaTime)
{
	const auto& movement = registry.get<const Movement>(entity); // Accessing component directly through world is slow
	auto& transform = registry.get<Transform>(entity);
	transform.position += movement.velocity * deltaTime;
}

// We could pass the view as a template parameter.
// But templates need to be declared where they are used, meaning all shared functions will need to be most likely on a header.
// To that, you add different views for the same function, and you get slower compile times.
// Outside of templates, Views also are not intended to control access, and they can not do all the things you can do with the world.
template<typename View>
void ApplyMovement(View view, float deltaTime)
{
	const auto& movement = view.get<const Movement>(entity);
	auto& transform = view.get<Transform>(entity);
	transform.position += movement.velocity * deltaTime;
}

Along with the problems sharing code (utilities) between systems, you will also have a really hard time tracking dependencies as your project grows if you want to do any sort of scheduling.

Problems scheduling

As I mentioned in the previous step, scheduling is a huge problem, and we should simplify it.

Scheduling helps us organize hundreds of system functions to execute safely in multithreading. To achieve that, we need to know where we read and modify components:

We can safely read components of the same type from many threads at the same time.
We can’t safely read components of the same type while any other thread is writing them.

We can, of course, schedule by hand, but this quickly becomes unmaintainable. That’s why there are many ways to automate it. But, as I said, you need to be able to know what you are doing inside a function from outside, or this won’t be possible.

// If we pass around the registry, we don't know our dependencies
// We don't know which components this function is accessing and modifying
void MoveProps(entt::registry& registry, float deltaTime) {}

Problems controlling data-flow

One of the points of DOD is that all code serves a single purpose: It converts data (input) into other data (output). “It’s all about the data.”

Having a view that we mostly only iterate is limiting us if we want to do proper algorithms where we use multiple steps to (efficiently) operate data.

Fixing the problems

Lets see what we need:

We need to be able to easily share code
We need to express dependencies when reading and writing components, allowing us to schedule
We need to be able to apply complex data flows, allowing more cache and cpu friendly code
It has to be blazing fast
Errors must be simple and straight forward …proceeds to look at templates with disapproval

I experimented with a solution to this for a while and ended up implementing it in Rift. This solution I came up with solves all the points above, so let’s have a look rebuilding the previous examples with it:

// We pass an Access with the types we can write, and those we can only read (const)
void MoveProps(TAccess<const Prop, const Movement, Transform> access, float deltaTime)
{
	for(Id entity : ListAll<Prop, Movement, Transform>(access))
	{
		const auto& prop = access.Get<const Prop>(entity);
		if (prop.isEnabled)
		{
			// Can we reuse this? Yes
			//const auto& movement = access.Get(entity);
			//auto& transform = access.Get(entity);
			//transform.position += movement.velocity * deltaTime;

			// So, lets reuse it
			ApplyMovement(access, entity, deltaTime);
		}
	}
}

// The parent access (MoveProps) must have these components.
// If it doesn't, we will get proper errors telling us what's missing.
void ApplyMovement(TAccess<const Movement, Transform> access, Id entity, float deltaTime)
{
	const auto& movement = access.Get<const Movement>(entity);
	auto& transform = access.Get<Transform>(entity);
	transform.position += movement.velocity * deltaTime;
}

Access

A access represents a set of components for efficient access and dependency tracking. It also can’t be directly iterated (by design). We have other tools for that.

It is very cheap to copy (only a pool pointer copy for each component type)
It provides instant access into component pools
Extremely simpler and less template-heavy than views
Can be constructed implicitly from the ECS world or other bigger accesses.

Access can have two flavors. Compile-time assisted TAccess or runtime based Access

It also makes sense to pass them as const reference to functions. They are cheap to copy yes, but we might not need to do it at all. That’s why I added an alias TAccessRef which is essentially the same as const TAccess&. It’s just easier to write.

Filtering entities

If a access can’t iterate on its own, how do we do it?

Iteration is done by creating and modifying lists of ids:

ListAll(access): Returns all entity ids containing all the provided components.
ListAny(access): Returns all entity ids containing at least one of the provided components

Then we can also apply new filters like excluding components:

RemoveIf(access, ids): Exclude entities not having a component
RemoveIfNot(access, ids): Exclude entities having a component

It should be mentioned that these functions don’t ensure the order is kept by default (for performance), but we can use their counterparts for that:

RemoveIfStable(access, ids): Exclude entities not having a component
RemoveIfNotStable(access, ids): Exclude entities having a component

The potential of this is that we are just operating a list of indexes, and we are not limited by the functions above on what we can do. Its just “filtering” lists of ids.

One example could be in Rift, where the compiler precaches two lists, one for classes and one for structs:

TArray<AST::Id> classes, structs;
AST::Hierarchy::GetChildren(ast, moduleId, classes);
AST::RemoveIfNot<CType>(ast, classes);
structs = classes;

AST::RemoveIfNot<CClassDecl>(ast, classes);
AST::RemoveIfNot<CStructDecl>(ast, structs);

As you can see, it is filtering different components to finish with those two lists of types.

It also shows how filtering can also be done directly from the world (ast in the example) without an access. You wont get the benefit of cached pools, but it will still be really fast to iterate:
ListAll(world) RemoveIf(world, ids) RemoveIfNot(world, ids)

Performance

I mentioned many reasons why this style of API is attractive, but there is another one. It is fast.

When I implemented accesses for Rift, I already had filters (very similar to entt’s views). So I took the chance to do a one to one comparison with the following results:

In debug access filtering gets up to 3 times faster iterating than views.

While in release the difference is tighter, between 35% to 50% faster in most runs.

Should be noted that this benchmark runs an empty iteration loop. For views, this means their pool checks are very close in execution. In other words, it is their ideal scenario. It is unrealistically in their favor. However, they seem to run slower. Why is that?

Why is it faster?

Unlike views, accesses don’t need to find their pools, again and again, every time they get created. Most of the time, a access is created from another, which is literally just copying the relevant pool pointers.

However this is not where most of the performance benefit comes from.

It comes from the fact that, while in views, each entity is checked at once against all the pools to filter, with ListAll all ids are checked pool after pool:

Views

Iterate all ids in smallest pool
- Check that the id has components A, B, C

Access Filtering

Get all ids from smallest pool
Remove those that don’t have component A
Remove those that don’t have component B
Remove those that don’t have component C

This uses a single pool and its hash-set at a time, making it more cache-friendly.

I hope this post was not too dense. It is quite a specific topic, after all.

Consider having a look at Rift. It would be incredibly helpful to get your ideas, feedback and/or code contributions!

Implementing a general-use arena

Thu, 03 Feb 2022 00:00:00 +0000

Now that we have learned about arenas and allocators, we can get our hands dirty with an implementation of an arena.

Best Fit Arena

You see, for the last couple of months, I’ve been updating RiftCore with new features. RiftCore is a cross-platform framework I use for C++ projects, and it lacked some memory management.

So the time came to design a general-purpose arena!

This article will describe the design and implementation of a Best Fit Arena. Feel free to come up with a better name though (and put it in the comments below!)

General-purpose?

A general-purpose allocator (or arena) must be able to work on all scenarios with out any big limitation. As such, it has to be able to:

Allocate in any order and any size
Deallocate in any order
Use (and reuse) all space available
Minimize fragmentation

In RiftCore, Arenas always carry the size of the pointer in their Free() function. This opens the door to some optimizations, but, don’t worry, the BestFitArena can be adapted to avoid this pattern.

Implementation

A BestFitArena works by tracking all unused spaces, called free slots.

Let’s go through what we see in this picture:

Like most allocators, we have one or multiple memory blocks of pre-allocated memory.
We also keep a list of FreeSlots, sorted by size. Bigger first.
We don’t track allocations in any way. No headers, no offsets and no sizes.

Seen in more detail, each slot points to the start of its memory and its size.

This algorithm has zero overhead when fragmentation is low. The less fragmentation, the more performant it is. However, it is also designed to minimize it, and, as you will see later, even in an scenario with a lot of fragmentation, performance is still excellent.

Allocation

Allocation will always pick the smallest free slot possible and extract the pointer from it. Then, this slot is reduced removing the used space from it.

Find Smallest Slot

Before anything else, we check if the arena is marked as pending sort. This is an optimization that prevents unnecessary sorts on consequent Free calls. But we also perform shrink on the slots if necessary.

Once we know all slots are sorted, we perform a binary search by size. The binary search will provide a complexity of O(logN).

Free

Free expands the free slots that “touch” the freed memory, absorb it and growing the slot.

We know of the size of the allocation because it is contained on the free slots list which we check anyway.

PS: This is a post I never published when I wrote it. So some details might be missing but feel free to ask any questions :)

Introduction to allocators and arenas

Tue, 30 Mar 2021 00:00:00 +0000

Lately, I have been playing around with the implementation of custom allocators and arenas to replace native allocations on my C++ projects.

Wow! Stop right there, Miguel. This line already deserves some introductions! Let’s talk about allocators.

Crash course on allocations

To keep this brief, I will assume that we have some experience with C++ and heap allocation (malloc and new).

An allocation is when we request a pointer to a block of memory of a specified size. When we use malloc or new we are getting this block of memory from the heap.

When we deallocate a pointer (calling free or delete) its block of memory becomes once again available and no longer needed by us.

What are Allocators and Arenas

The definition of an allocator is somewhat flexible. It involves the encapsulation of allocation and deallocation of memory.

The allocators provided by the STD (the C++ standard library) are templated objects bound to a type. For example, std::vector can have different allocators.

In game development, we also use allocators as global memory managers. Using them, we can optimize allocations for specific parts of a game engine. For example, we can have an allocator that contains one render frame of data and gets cleared when a new frame starts.

But… Isn’t it confusing to call everything an allocator? I believe it is, and I don’t seem to be the only one because some engines call the global memory allocators arenas.

Therefore, let’s stick with the following terminology:

Allocators are objects that encapsulate allocation and deallocation of memory

Arenas are independent (often global) allocators

Container Allocators are stateful allocators that manage the memory used by a container

Why are they necessary?

Native allocation needs to work in all scenarios. It behaves like a general-purpose arena, meaning it can’t have limitations, and it must be good enough at doing everything. All this, while lacking any context about our particular use case.

Knowing this, I can think of three performance benefits from allocators: Allocation/free cost, memory locality and fragmentation.

Allocation/Free Cost

malloc acts as the intermediary between the program and the OS. For example, sometimes it will need to request more memory from the Kernel, and that is very slow

Memory Locality

Very briefly speaking, modern CPUs have cache-lines, caches and RAM. Since data is retrieved in blocks into the caches, if the data we need is cohesive, it’s much more likely that it will be already cached. Accessing RAM instead of CPU cache can be hundreds of times slower.

Since malloc and new don’t have context about our memory use-cases, the pointers allocated can be anywhere. However, allocators can give us much better memory locality.

Fragmentation Fragmentation occurs when we have allocated and freed multiple times leaving gaps that are not big enough to fit new allocations.

This means we will need to request more memory. Some allocator algorithms don’t have fragmentation at all. Others have the information to reduce it further than malloc can.

From a technical design standpoint, we will also simplify code, visualizing where memory is held at all times and under which rules. We can use the arena that fits our problem and change it if needed.

Types of Allocators

There are many types of allocators based on their algorithms. Each of them brings benefits as well as limitations.

There is no way I could explain all of them, but let me give you a quick rundown of the simplest ones.

Linear

A Linear allocator reserves a big block of memory and then moves an offset to the next available position when allocating. Since it doesn’t keep track of previous allocations, a linear allocator can’t be freed.

This algorithm is by far the most performant due to its simplicity. But it also has the most limitations, so its use in the real world is very specific.

Stack

Stack is one step more advanced than Linear. It knows the size of all allocations, allowing us to free the last allocation.

Pool

A Pool allocator contains a list of same size slots. All allocations must be smaller than one slot.

To track which slots are available, we can use a bitset. They are very performant and compact containers where 1 bit represents one occupied slot.

Some implementations keep track of allocations using a linked list. However, this means we need to iterate over the entire memory block. It also introduces 8 extra bytes for each allocation.

General

A general allocator can be used for all use-cases and doesn’t have any big limitation. I will soon publish how I implemented a general arena that is up to 130x faster than malloc.

Many more!

Those were not all allocators that exist. There are many more. Each algorithm has advantages and disadvantages, and it’s up to us to choose the best one for the job.

Some I didn’t mention:

Native allocation replacements

Some libraries just provide an extra layer between us and malloc but not necessarily using the concepts we described before. They still lack context about our use-case and need to solve every problem just like malloc. However, they manage to be considerably faster than the default solution.

Depending on what you do, these libraries might be enough. However, setup is not always as intuitive and straight-forward as it should be.

One example is microsoft/mimalloc.

Resources

Writing a Game Engine from Scratch - Part 2: Memory
CppCon 2014: Mike Acton “Data-Oriented Design and C++”
Custom Vector Allocation
Some allocator implementation examples: mtrebi/memory-allocators

Posts on Miguel Fernandez Arce

Making your own array

Why a custom array?

About Pipe’s Requirements

The Design

Allocation

Inline Memory

Allocated Memory

Indexes

Unsafe

Plurals

Final Notes

A new approach to ECS APIs

What is ECS you ask?

Current approach to ECS APIs

Problems sharing code

Problems scheduling

Problems controlling data-flow

Fixing the problems

Access

Filtering entities

Performance

Why is it faster?

Implementing a general-use arena

Best Fit Arena

General-purpose?

Implementation

Allocation

Find Smallest Slot

Free

Introduction to allocators and arenas

Crash course on allocations

What are Allocators and Arenas

Why are they necessary?

Types of Allocators

Linear

Stack

Pool

General

Many more!

Native allocation replacements

Resources