Course notes on Beyond Programmable Shading

by syoyo


SIGGRAPH 2008: Beyond Programmable Shading

Extra infos on Larrabee, and next-gen graphics techs(parallel computing framework, raycasting, etc)

Here’s my impression on these slides.

Interactive Cinematic Lighting

It’s a just review of previous works. There’s no impressive things.
I know existing research has a serious drawback(to be solved in the future work), but Fabio doesn’t talks on it.

Next Generation Parallelism in Games.

Raycasting seems interesting. I am also thinking about efficient ray traversal using octree.
I believe raytracing(ray traversal) could be done in much more faster using brilliant new data structure, compared to using current hierarchical-based traversal algorithm(e.g. kd-tree, BVH).

For example, review exising BVH raytraversal.
1 BVH – 1ray packet traversal requires 100~300 cycles on CPU.
And moderately complex scene requires 50-100 traversals for each ray(or ray packet).
5,000 ~ 30,000 cycles is too much, isn’t it?

Larrabee Graphics Architecture: Software is the New Hardware

Main memory is thousands of clocks away

Thousands of clocks!!!

When we assume texture miss rate was 5%(95% of hit rate), At least texturing requires 50 cycles.
The miss rate significantly raises when multi-texturing was used(10 ~ 30% of miss rate I guess).

Fiber can hide latency of texuring, but number of fibers are limited to size of L2.
So I don’t think Larrabee core hold many fibers to completely hide latency of texturing.

Beyond Data Parallel: Advanced Rendering on Larrabee

I am a bit disappointing on this course note. There’s no new insight on the future(beyond data-parallel). Pharr just reviews current problem & situation.

Compiler tech become more important beyond progarammable shading

I think compiler optimization technique is the most important thing in the future of graphics.

Thus now I am intensively doing a research on compiler technology.
Research focus on automatic parallelization and optimized use of memory hierarchy automatically, and found some quite impressive previous works.

I’d like to list up some of previous works which is not yet imported or unveiled to graphics community, but is much valuable.

Global Multi-Threaded Instruction Scheduling
GREMIO, the proposed method in this paper, automatically partition the sequential program and execute it on multi-cores or multi-threads.
You don’t need to write paralell code!

Efficient dynamic heap allocation of scratch-pad memory
Memory Allocation for Embedded Systems with a Compile-Time-Unknown Scratch-Pad Size

Effectively using scratchpad memory but programmers can write their code without knowing memory hierarchy explicitly. This will greatly reduce programmer’s effort.

PLuTo – An automatic parallelizer and locality optimizer for multicores

Application where PLuTo is applicable(nested loops) might not helpful for graphics apps, but with PLuTo, you no longer need to blockize data to improve memory access. PLuTo automatically do it!

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

This is good for optimizing tree data structure for example ray traversal.
With this method, you now don’t need to think about optimal packing for BVH & Kd tree node for optimial cache usage!