Custom Memory Allocator
This custom memory allocator is optimized for the allocation of small objects. It maintains chains of blocks that are the same size; the number of chains and maximum size of allocated object can be customized. By default, the allocator uses 5 chains with the block size 32, 48, 64, 128 and 256 bytes. Objects with a size greater than 128 bytes are allocated using standard malloc.
When this allocator is requested to allocate memory, it rounds the requested object size to nearest block size. If the chain of blocks of this size is not empty, the object is allocated by just unlinking a block from the chain. If the chain is empty, then the allocator allocates a new page (default size is 4Kb), splits it into blocks of the requested size, and adds all these blocks to the chain. The page itself is also allocated using standard C runtime malloc(). When an object is de-allocated, it is linked back into the chain from which it was allocated.
Usually, allocate and free operations require no synchronization – they are just implemented as L1 list link/unlink. So in cases when thread works locally (objects are de-allocated from the same thread that allocates them), performance is very high. In this scenario, adding extra processors can only speed up an application.
However, sometimes threads need to exchange data (so objects created in one thread are sent to another thread where they are ultimately deallocated). In such cases, synchronization can’t be avoided—but its overhead can be reduced. Instead of using one centralized mutex, this algorithm creates one mutex per thread. When the free method finds that an object was allocated by some other thread, it locks the mutex of this thread and adds this block to the pending requests list. Each thread checks this list at each allocate operation. When number of pending requests exceeds some specified threshold, the thread locks the mutex, stores the head of the list, resets the list, unlocks the mutex and performs the deallocation. This approach to deallocation minimizes the mutex’s holding time, significantly reduces the number of locks, and ensures that locking can block only one thread (instead of a centralized lock which can block all running threads).
When a thread is terminated, the algorithm checks counters of live objects at each page and deallocates pages with a zero counter of used objects. Other pages will be deallocated later by other threads that will deallocate these living objects.
How to use
This package provides three C functions:
void* thread_alloc(size_t size);
void* thread_realloc(void* addr, size_t size);
void thread_free(void* addr);
These can be used instead of the standard malloc, free and realloc in a C application. Just include the “threadalloc.h” header file and “threadalloc.obj” file in the linker list for your project.
Applications written in C++ could redefine the default new and delete operators to use the custom allocator by defining the REDEFINE_DEFAULT_NEW_OPERATOR macros in the allocator’s header file. This allocator can be built using the make utility. Makefiles for Microsoft Visual C++ (makefile.mvc) and GCC (makefile) are provided. The command file make.bat can be used to invoke the Microsoft nmake utility for makefile.mvc.
Make will the build allocator object file and four tests: stdalloc_test1, stdalloc_test2, threadalloc_test1, threadalloc_test2. Test 1 implements a model in which each thread works in isolation from the others (allocates and deallocates objects locally). In this case, the performance benefits compared to using the standard allocator are most meaningful. Test 2 implements a model of communication between two threads: objects are created in one thread (producer) and deleted in another thread (consumer). In our results – published in the article – the custom allocator is 2-3 times faster than the standard C-runtime allocator when running on a single CPU machine. (The tests were performed under Scientific Linux 184.108.40.206 for x86_64). However, in a multi-core environment, the benefits of such an allocator truly shine through, with a performance gain of up to 10 times. Certainly there are many other factors, other than the memory allocator, that determine the scalability and performance of embedded systems applications, but hopefully using this allocator, or one like it, can contribute significantly to improving performance in your application.
Copyright © 2003 – 2006 by Konstantin Knizhnik
Copyright © 2006 – 2020 McObject LLC
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of the copyright holders not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. The copyright holders makes no representations about the suitability of this software for any purpose. It is provided “as is” without express or implied warranty
THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION BASED UPON CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
At McObject our only focus is database management systems. We are a dedicated group of specialists and invite you to listen to our on-demand Webinars.