Syoyo Fujita's Blog

raytracing monte carlo

Month: December, 2007

Houdini 9 apprentice

Houdini apprentice … Free
Scripting … OK
HDK(dev sdk) … OK

Hmm… model a scene with this Houdini apprentice,
then dump a scene data with python script or file out chop,
render it with my lucille renderer,
would be possible?


python スクリプティング可能、
dev sdk も使える…

これってもしかして自分で RIB エクスポータ書けば、
lucille でレンダリングさせることができるってこと?

file out chop も自由に使える?

もしそうなら、テストシーンデータ作成に houdini を
それってつまり、シェーダ作成などの op ネットワークエディタが

Automatic SIMD optimization

I’ve started to carefully reading “Efficient Utilization of SIMD Extensions” [1].

This paper is a good starting point to survey automatic SIMD vectorization.

According to this papar, automaic SIMD vectorization is roughly separated into 3 layers.

1. Symbolic vectorization
Do vectorization in language or application context level.

2. Straight line code vectorization
Find a coherence in scalar code path and vectorize it?

3. Special purpose compiler for vectorized code.
Custom instruction scheduler and register allocator for vector instruction for fast binary code generation.

So far, MUDA doesn’t do any optimizations shown above.

I’d like to find a way to incorporate MUDA with optimization like layer (1).
For layer (2), I think there is nothing to do for MUDA, since MUDA is a vector language already.
Layer (3) is very low-level and seems machine-dependent technique, thus I’m left it for now.

Much more information on automatic SIMD vectorization this paper is discussing, especially (2) and (3), can be found at [2].


… [1] をきちんと読みはじめているのですが、、、[1] も [2] もすげぇや…

手探りでやったり車輪の再発明ばっかりして、よくよく調べてみると n 年前に


なんとか MUDA も形ができはじてきたので、これを掲げて、

[1] Efficient Utilization of SIMD Extensions
Franz Franchetti, Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber. Invited paper

[2] Automatic SIMD vectorization
Jourgen Lorenz, Ph.D thesis

Faust, Signal Processing Language

The name FAUST stands for Functional AUdio STream. Its programming model combines two approaches : functional programming and block diagram composition. You can think of FAUST as a structured block diagram language with a textual syntax.

– Functional programming
– block diagram composition

OMG, This is what I am thinking about core features necessary for coming GI era’s GI language(shader + raytracing + MC sampler).

Faust already did it for their DSP application… the World is wide…
I must investigate their language design to rethink about my GI language(possibly MUDA based) idea.

It seems that Faust supports SSE and AltiVec code output, and more,
are trying to implement LLVM backend. Cool!

Apple shares hit $200




のような本に 2002 年ころに出会っていれば…
いままでの自分の financial literacy のなさが悔やまれる…

2008 年は、lucille 0.2 もそうですが、

Performance Counter Super-Resolution

Performance Counter Super-Resolution


This idea seems good.
Commodity CPUs has precise HW performance counter facility,
but when using it, it has a side effect on measured program.

Increasing the sampling rate also increases
system bus accesses, memory accesses, etc to transfer sampled data,
which affects behavior of application running(measuring),
resulting in poor and inaccurate profiling.

Intel’s VTune have a recommended lower bound of one millisecond as the minimum interval in between taking counter measurements to constrain the impact of these two types of error.

I wondered that VTune’s sampling interval is too sparse(msec order),
In such a coarse sampling rate case, short function are easily missed in result sampling data.
But I understand it should be so, according to quotes and the link info.

Usually we(performance eager) want to profile program’s behavior in
100~1000 cycle accurate order for such a profiling purpose.

The idea using super-resolution techniques may solve such a problem.

Super resolution profiling,
i.e. running your app(measured function) multiple times with low frequency but assign unique jittering,
then assembling it to get one high frequency profiling result,
may gives more accurate sampling using HW performance monitor facility.

I’m considering to support such a HW sampling techniques for MUDA optimization platform.
For example, running oprofile profiler multiple times with different start time jittering.



lucille 0.2 plan

Here I wrote a plan for lucille 0.2, next version of luiclle.


In 2007, I tought really a lot and a lot about lucille.
How to make lucille faster, more robust and beautiful.

I’ve been collected building blocks for it,
and now I confirms I got all building blocks to realize
my new idea, as presented in the lucille 0.2 plan.

Coming 2008, I’ll turn my plan into the action, one by one, and steadily.

New MUDA developer!

I’m so glad to introduce new MUDA developer, Luca Barbato.

He is a cool SIMD-ist on PPC/AltiVec and did many SIMD optimization contribution for another projects.

Now he is contributing VMX(AltiVec) backend for MUDA language, and will contribute Cell/SPE as well.

Thanks a lot, Luca!

About MUDA

MUDA is a vector language for CPU.

MUDA site is here

And for past posts on MUDA, see,

MUDA project on Launchpad

I’ve also launched MUDA project on Launchpad.

Launchpad provides some facility missing on
For example, managing translation and Q&A.

And also Launchpad provides cutting edge Bazaar VCS(Version Control System) for hosting codebase.

I’m considering to host MUDA codebase on this Bazaar VCS, instead of sourceforge’s svn.

How To Write Fast Numerical Code: A Small Introduction

How To Write Fast Numerical Code: A Small Introduction
Srinivas Chellappa, Franz Franchetti and Markus Püschel
to appear in Proc. Summer School on Generative and Transformational Techniques in Software Engineering, Lecture Notes in Computer Science, Springer, 2008

とりあえず MUDA という SIMD 言語を作ったりしているものの、
この article でも示されているように、最適化で一番重要なものを上から並べると、

– 効率的なアルゴリズム
– メモリアクセス(キャッシュまわり)
– 最後に、SIMD 化

になるわけです。結局のところ、小手先の SIMD 化というのは、実はあまり効果がないのです。


そして効率の良いアルゴリズムをさらに SIMD 化などで最適化すれば、



最近の CPU は、キャッシュミスするとレジスタアクセスのサイクルにくらべて容易に
10~100 倍以上のサイクルストールしてしまうわけで、
SIMD 化で演算効率 4 倍に比べたら、メモリアクセス周りを最適化して

# ちなみにこれは x86 などの一般的な汎用 CPU でのお話です。


最後に SIMD 化です。

そう、本来なら最後に SIMD 化するべきなのです。

でも多くの場合、まずは SIMD 化から、というひとが多いのではないでしょうか?

アルゴリズムやメモリアクセスに比べて、SIMD 最適化は直感的に分かりやすい(気がする)のと、

しかし、CPU 依存性が大きいので、ポータブルでない。
SIMD 化では、データ構造から根本的に変えないと、効果が出にくいから。
(私は将来の CPU では SIMD unit は無くなると思っています.
理由はスカラコード + マルチコア化のほうがやりやすいからです.
なので SIMD 化には拘らないほうが長期的にはよいと思います。

それに、SIMD 化なんて、ちょっと覚えればあとはサルでもできるわけです。


なので MUDA ではそんな非生産的なコーディング時間を減らしたい、


いずれにせよ、そんなわけで、MUDA では、 SIMD 化をするだけではなく、


automate できるフレームワークを作る、という方向性を目指したい。

まずは SPRAL や ATLAS あたりの仕組みを調べてみるのが役に立ちそうでしょうか。

MUDA site opens!

I’ve launched MUDA project page! Check it.

MUDA is a vector language for CPU.
Yeah, not for GPU or (dead and gone) GPGPU 😉

I’m planning to use MUDA to code core computation part of my lucille renderer.



Past posts on MUDA

[1] Idea: MUDA, MUltiple Data Accelerator language for high performance computing

[2] Load to MUDA. SIMD code generation, domain specific language, automatic optimization, functional programming, etc.

[3] Work in progress: Initial Haskell version of MUDA