malloc performance

by syoyo


I’m now tring to write high performance BVH(Bouding Volume Hierarchy) raytracer combined with Intel’s recently proposed approximate SAH construction method [1].
I’ve started from writing efficient memory allocator.

System’s default malloc() does not guarantee 16 byte aligned memory allocation(glibc: 8 byte, darwin: 16 byte) for SSE data manipulation.
And more, we might need 32 byte or 64 byte aligned memory allocation for efficient L1 cache behavior.

posix_memalign(), provided in linux, can allocate arbitrarily aligned memory but it is not cross platform and it is slow(see below).

Thus if we need high perfomance computation(SSE and better L1 cache behavior), we must write our custom memory allocator.

Kilauea style pooled memory allocator

I wrote Kilauea style pooled memory allocator [2] which is simple and efficient for small memory allocation. It is well suited for constructing spatial data structure.

Here is the perfomance of malloc, posix_memalign and Kilauea style pooled memory allocator.

malloc() uniform: 0.210081 (sec)
posix_memalign() uniform: 0.533821 (sec)
mem_pool_alloc() uniform: 0.064749 (sec)

The performance test is performed by sequentially allocating 32 byte data 1048576(1024*1024) times and no free() operatin in this test.
(align is 32 byte exept for malloc() ).

posix_memalign() is 2.5x worser than malloc().
mem_pool_alloc() is fastest, but it should be so since malloc() and posix_memalign() is general-purpose memory allocator.

[1] Highly Parallel Fast KD-tree Construction for Interactive Ray Tracing of Dynamic Scenes

[2] Practical Parallel Rendering

Wow, we can see all content of the book!