Optimizing noise function with MUDA

by syoyo

http://lucille.svn.sourceforge.net/svnroot/lucille/
angelina/haskellmuda/examples/noise/

mnoise2.jpg
(Turbulence by modified noise)

pnoise2.jpg
(Turbulance by Perlin noise)

After I wrote my first 4k raycasting demo(link), I found it is important to use procedural graphics technique, especially (Perlin) noise, for this area, so I decided to implement noise function in MUDA and optimize it.

Noise would be also important for shaders in offline renderer.

MUDAize noise function

Original Perlin noise is not suited for SIMD evaluation since it requires lookup of the permutation table.

Olano proposed fast, HW friendly, no-looup-table-is-required version of noise called modified noise [1].

My first attempt for optimizing noise function was SIMDizing this modified noise function by using MUDA language.

Memoizing

I don’t satisfy just implementing Olano’s modified noise function in MUDA. I am performance eager!

During I’ve been implementing modified noise in MUDA, I found one possible direction for further optimizing noise function is exploiting coherency.

Fortunatelly, I found there’s high probability that subsequent call of noise() has same integer part of input coordinate and most time consuming part is to getting gradient vector by hasing integer part of input coordiante. Thus getting rid of this coherency is good idea for further performance optimization.

Strategy is very simple. Storing previous result and reusing it unless integer part of input coordinate changes. This technique is called memoization in language community but it might be better to denote it cache which is well known word in graphics community.

Performance

I’ve measured performance for 4 versions of 2D noise function: Perlin noise(reference), modified noise(reference), modified noise(MUDA) and modified noise(MUDA + memoization) in 64bit environment in my Core2 Intel Mac(64bit binary is always faster than 32bit binary). Compiler used is llvm-gcc from LLVM 2.3.

mnoise2_perf.jpg

MUDA version of modified noise is about 2x fasther than reference non-SIMD vesion of modified noise. And adding memoization acheves about 2x performance boost further. In total, MUDA + memoization version is about 5x faster than reference version.

I’ve also compiled and measured previous work of SIMDized 2D noise function,

http://www.flipcode.com/archives/Perlin_Noise_Functions_SIMD.shtml

but unfortunately it is not so fast. SIMD version is as fast as it’s scalar version.
(33 M noise2/sec in my Core2 2.16 GHz single core, 64bit)

Think theoretically

MUDA version of modified noise in 2D case, mnoise2(), consists of 120 instructions per 4 noises(30 instructions/noise in average). If we assume one instruction is executed in one cycle, MUDA version of mnoise2() does not finish less than 120 cycles.

Under this assumption, let me calculate theoretical peak performance of noise function in my Core2 2.16 GHz machine.

2.16(GHz) / (120/4) = 72 M mnoise2/sec

MUDA + memoized version consists of 68 instructions, so in this case.

2.16(GHz/) / (68/4) = 127 M mnoise2/sec

Thus, compared with measured performance(51 M and 107 M respectively), performance of MUDA and MUDA + memoized version is not so different from theoretical peak performance.

To be honest, in x86 Core2, logical instruction(e.g. and, or) is executed with 0.5 cycles, so theoretical peak execution cycle is a bit lower.

Further optimization

Another direction of optimizing noise function is dynamically generate optimized noise function with Just-in-time compiling. This strategy would take effect when we optimize octaves of noise functions(e.g. turbulence).

This direction of optimization is TODO.

[Ja]

4k デモでは、どうしてもプロシージャルグラフィックスが必要になるので noise 関数を MUDA で最適化してみました. それにこれを最適化しておけば、オフラインレンダラのシェーダにも使えますし.

100 M noise2/sec 出るとはいえ…

MUDA + memo で 100 M noise2/sec 出せるといっても、1024×1024 画面全体で 4 octaves の turbulence を生成するだけけで 25 M noise2/sec = 25 M frames/sec となって、30 fps 出せないんだよね.

デモシーンとかでノイズをインタラクティブにレンダリングするなら、コアをさらに使うか、もっと別のアルゴリズムによる高速化手段を考えないといけないです.

あと、modified noise は perlin noise ほどランダム性がないですね. 結構周波数を高くすると repeatability が見えます. 周期を長くしてもダメみたい.

References

[1] Marc Olano. “Modified Noise for Evaluation on Graphics Hardware”. Graphics Hardware 2005.
http://www.csee.umbc.edu/~olano/papers/

MUDA 版 mnoise2 の objdump

大体 200 命令くらい(1 noise2 あたり 50 命令).

MUDA + memoize 版 mnoise2 の objdump

memo にヒットする場合は 100 命令くらいで実行できる(1 noise2 あたり 25 命令).

Advertisements