Tom Hammersley's Graphics and Games Development Blog

Thursday, 2 February 2012

Peak Abstraction

An interesting phenomena I've observed over my years as a programmer is the occurrence of "Peak Abstraction" within a young programmer - and their subsequent improvement.

There is a common pattern. It typically occurs after 3-4 years of professional practice. A programmer is growing in his confidence. He begins to tackle more complex, challenging problems. And, he reflects those problems with complex, challenging data structures. He has not yet learnt the fine art of simplicity.

Every member of every data structure is a pointer or a reference. There are no simple data types. Everything becomes a "value dictionary", a "graph-based visitor" or a "multi-dimensional ring buffer". Disassembly of his code reveals it to be little more than an infinite sequence of lwz, lwz, lwz, lwz and more lwz. Perhaps he/she develops a hobbyist interest in Java. They may develop an editor that connects boxes with lines - such editors are great for filling with lots and lots of abstraction! Often, she/he may favour Boost; perhaps he even becomes your office's Boost Guy. Even simple plain old data types such as integers are suddenly held within a smart-pointer container and reference counted. And with a pimpl to retrieve their value.

Those around them become increasingly desperate. They can see the code becoming inflated, inefficient, unmaintainable. Spending time reading this individual's code is like spending time exposed to a radioactive isotope. You can only tolerate a few moments before exhibiting signs of sickness. You attempt to modify their code. 478 pages of template errors dissuade you from considering this foolish course of action again.

And then it happens. The programmer realises the error of their ways. They realise that they are adding neither design value nor computational value. They realise they are adding overhead and maintenance trouble. Perhaps they undergo their first encounter with profiling software. Said software may disclose the full terror of the D$ dirty bomb they have unleashed upon the code. Perhaps they are "introduced" to the profiler by an irate lead in some sort of shotgun-wedding affair. Whatever, the programmer realises that solving complex problems does not require needlessly complex data structures. Such structures impress nobody. People are impressed by simple, elegant solutions. And thus the recovery begins. Data structures become simple, expressive, and maintainable. Data cache misses dip under 20ms/frame for the first time. Co-workers enjoy modifying their code. The compiler begins to emit instructions such as "add" or "mul". Sadly the compiler may be too fond of outputting "jump" instructions at this stage... The programmer themself becomes wary of others who are pre-Peak Abstraction.

One even wonders if, like it's namesake "Peak Oil", there may be a global Peak Abstraction. Perhaps the practising body of programmers may all undergo Peak Abstraction in the next 5-10 years and we will all left never having to refactor some template monstrosity.

But then, where's the fun in that? Don't programmers just love to be indignant about others' code?

Tuesday, 13 September 2011

Haskell Photon Mapper Released to Hackage

I've now uploaded my Haskell photon mapper and raytracer to hackage. It is listed as a package called "crocodile" (named after my son's current obsession with crocodiles!).

It's all working on the Cornell box. It's pretty memory hungry, and has a few self-declared things to fix (eg irradiance gradients), but it's out there to play with.

Enjoy.

Thursday, 25 August 2011

This is what happens

When universities teach Java!

(some random bit of disassembled code found on my travels)

00A5CA74 EC7E1824 fdivs f3,f30,f3                          56 PIPE
00A5CA78 FC406850 fneg f2,f13
00A5CA7C D0660010 stfs f3,0x10(r6)
00A5CA80 D0C90010 stfs f6,0x10(r9)
00A5CA84 D981FF60 stfd f12,-0xA0(r1)                        PIPE
00A5CA88 E881FF60 ld r4,-0xA0(r1)                          70 (00A5CA84) LHS[01]
00A5CA8C D961FF60 stfd f11,-0xA0(r1)                        PIPE
00A5CA90 EBA1FF60 ld r29,-0xA0(r1)                         70 (00A5CA8C) LHS[01]
00A5CA94 D941FF60 stfd f10,-0xA0(r1)                        PIPE
00A5CA98 EB81FF60 ld r28,-0xA0(r1)                         70 (00A5CA94) LHS[01]
00A5CA9C D921FF60 stfd f9,-0xA0(r1)                         PIPE
00A5CAA0 EB61FF60 ld r27,-0xA0(r1)                         70 (00A5CA9C) LHS[01]
00A5CAA4 D901FF60 stfd f8,-0xA0(r1)                         PIPE
00A5CAA8 EB41FF60 ld r26,-0xA0(r1)                         70 (00A5CAA4) LHS[01]
00A5CAAC D8E1FF60 stfd f7,-0xA0(r1)                         PIPE
00A5CAB0 EB21FF60 ld r25,-0xA0(r1)                         70 (00A5CAAC) LHS[01]
00A5CAB4 D1A6000C stfs f13,0xC(r6)                          PIPE
00A5CAB8 D029000C stfs f1,0xC(r9)
00A5CABC D801FF60 stfd f0,-0xA0(r1)                         PIPE
00A5CAC0 E801FF60 ld r0,-0xA0(r1)                          70 (00A5CABC) LHS[01]
0

Wednesday, 24 August 2011

Pattern matching as an optimisation method, and a means of insight

One feature of Haskell that I particularly like is its pattern-matching scheme. To route code down the correct codepath, Haskell uses pattern matching against data types and values, instead of extensive flow-control statements. This makes the code much simpler to read, as your code merely concerns itself with calling the correct functions, with the correct handler cases, and the underlying Haskell compiler deals with the flow-control minutiae.

I've increasingly found myself using a similar approach in C++ and HLSL using features such as polymorphic functions. A good example comes from optimisation: frequently, you find cases in the code where someone has written a "kitchen sink" function. This function internally handles many cases, and accordingly takes many parameters.

Frequently, many of these parameters are called with specificliteral values or constants. For example, many parameters may be zero which disable entire codepaths inside the function. For me, this is a bad code smell. But why? Many people would argue that since the code is branched away, what's the big deal? Especially if it means we can have just one definition of a function. Well, branches are very expensive on modern processors. On contemporary consoles, a floating point branch can yield a 27-cycle penalty. Not to mention all the preamble and branch set-up code.

I've therefore found myself increasingly providing multiple specialised versions of these functions, which omit some of the parameters that are permanently called with 0.0f or 1.0f and tailoring the code accordingly. Obviously this saves a lot of needless computation, branching and I$ misses.

What's interesting about this approach is that once you've performed a few of these refactorings, you start to expose patterns in the code and its structure that you previously were unaware of. You may find that your major axis of variation lies not in your chosen object hierarchy composition, but elsewhere. You may have a multitude of objects that add little value at a higher level, but now a large set of lower level functions that process the objects differently. Sometimes it makes more sense to structure and divide your code along the lines of what processing it does, not what you think the object "is", as that is the greater source of variation.

Tuesday, 23 August 2011

On the limitations of caching

I've been spending a little time thinking about caching lately. Irradiance caching, optimising some caching in production code, and hardware caches. Here's a little brain-dump.

I've come to the conclusion that if you ever need anything more than a trivially small cache, your algorithm is wrong. This is particularly so for parallel programming, as caches are often a difficult to handle piece of interdependent common state.

My reasoning goes a little something like this. If you are relying on a cache for performance, it is usually to re-use the results of some expensive transformation a -> b. Crucially, there needs to be 1 a to one or more b (otherwise there is no cache reuse!). For an efficient cache, you need:

To be caching the results of a fairly complex computation,
To have an efficient means of storing the results of the computation,
To have a well-behaved cache indexing function.

Now, what often ends up happening is that you iterate over the many objects 'b', and populate the cache on-demand from a common ancestor piece of data, a. And here's the first problem. You are primarily iterating over the wrong dataset, and filtering and culling parts of the other dataset by proxy. You have an implicit data filtering or reduction operation that could be better handled more explicitly.

Ok, so, we switch our iteration round to iterate over the source objects 'a'. We iterate over each 'a' object, and produce the resulting 1 or more 'b' objects. At this point, do we really have a cache? We've turned our cache into a data transformation kernel. All of sudden our cache is no longer some intermediate state used in a computation, but it is now simply a piece of input data. Since we now have a piece of code transforming [a] -> [b], we have a piece of code that is very amenable to data parallelism. Since we've decoupled the code from a cache, it's also much easier to parallelise.

I think ultimately an extensive caching scheme is evidence that the data domain hasn't been adequately structured and managed. If you need a cache, you probably need more data conditioning, filtering, and reduction.

A good counter-example is the vertex stream processing unit of a modern GPU. These units read huge streams of data, transform them, and maintain minimal cached state. Vertex caches are typically miniscule in comparison to the volume of data they process. Yet a post-transform cache can often maintain a very high level of efficiency. The key to this cache's success is the pre-processing (ie structuring) of the data fed into it. It is not random. It is well conditioned, which yields efficiency at a number of levels, and enables a vertex cache to be used only as a relatively modest addition to the pipeline. The cache is the icing on the cake, not a fundamental component of the algorithm.

I think this is an important lesson when transitioning from writing single-threaded to parallel code. Don't dream up increasingly elaborate caching schemes. Instead think up separable, well-structured data-parallel transformations. When you feel the need for a cache, you probably need to carefully re-examine your data flow structure.

Tuesday, 5 July 2011

Where'd all the Haskell go?

It's not gone. It'll be back.

I've recently moved house, and obviously that consumes a lot of time both before, and after the move. In order to write a Haskell article, I need to develop supporting code. I don't have much coding time right now.

On the other hand, it's pretty easy to knock out an article based on the knowledge gained during my Master's, and through my industry experience, so I'm writing on those subjects for a while.

Haskell will be back!

Thursday, 2 June 2011

Photon Mapping in Haskell

For a while I have been working on extending my Haskell raytracer to include a global illumination model. I chose to implement photon mapping, as it seemed to offer both a number of challenges, and it quite suited a functional programming approach.

Photon Tracing

One of the main challenges of implementing an algorithm like photon mapping is that it relies on a small amount of mutable state due to its monte-carlo integration. As each photon is traced through the scene, a random number is required to determine the fate of a photon for a given photon-to-material interaction. A random number generator typically requires state to iterate from one number to the next. Such state is easy to incorporate into imperative code, but it presents a challenge to the Haskell programmer.

There are two main challenges with state: management and propagation of state, and selecting what code should share a given state context. For example, in the photon mapper I could have introduced a single, global state for the random number generator. Or, I could have given each light source its own state. Or each photon. The greater scope that the state has, the greater the coupling in the code and the greater the difficulty in parallelisation.

I chose to give each photon traced its own random number generator state - and each photon had a different initial seed.

-- Realistic Image Synthesis Using Photon Mapping p60
tracePhoton :: [Photon] -> Photon -> SceneGraph -> StdGen -> (Int, Int) -> [Photon]
tracePhoton !currentPhotons (Photon !photonPower !photonPosDir) sceneGraph !rndState !(bounce, maxBounces) =
-- See if the photon intersects any surfaces
case findNearestIntersection sceneGraph ray of
Nothing -> currentPhotons
Just (obj, t, subId) -> case photonFate of
-- Diffuse reflection. Here, we store the photon that got reflected, and trace a new photon - but only if it's bright enough to be worthwhile
DiffuseReflect -> if Colour.magnitude newPhotonPower > brightnessEpsilon && (bounce + 1) <= maxBounces
then tracePhoton (storedPhoton : currentPhotons) reflectedPhoton sceneGraph rndState'' (bounce + 1, maxBounces)
else storedPhoton : currentPhotons
where
!reflectedPhoton = Photon newPhotonPower (surfacePos, reflectedDir)
!(reflectedDir, rndState'') = diffuseReflectionDirection rndState' tanSpace

-- Specular reflection. Here, we reflect the photon in the fashion that the surface would reflect towards the viewer and
-- aim to absorb it somewhere else in the photon map
SpecularReflect -> if Colour.magnitude newPhotonPower > brightnessEpsilon && (bounce + 1) <= maxBounces
then tracePhoton currentPhotons reflectedPhoton sceneGraph rndState' (bounce + 1, maxBounces)
else currentPhotons
where
!reflectedPhoton = Photon newPhotonPower (surfacePos, reflectedDir)
!reflectedDir = Vector.negate (snd photonPosDir) `reflect` normal

-- Absorb. The photon simply gets absorbed into the map
Absorb -> storedPhoton : currentPhotons
where
!(photonFate, rndState') = runState (choosePhotonFate coefficients) rndState
!coefficients = russianRouletteCoefficients (material obj)
!newPhotonPower = computeNewPhotonPower photonFate coefficients photonPower (material obj)
!tanSpace = primitiveTangentSpace (primitive obj) subId hitPosition obj
!normal = thr tanSpace
!hitPosition = pointAlongRay ray t
!surfacePos = hitPosition + (normal Vector.<*> surfaceEpsilon)
!brightnessEpsilon = 0.1
!storedPhoton = Photon photonPower (surfacePos, snd photonPosDir)
where
!ray = rayWithPosDir photonPosDir 10000

KD-Tree Construction

The resulting list of photons is inserted into a KD tree in order to efficiently locate the set of photons near to a point in space. This is a simple recursive operation:

buildKDTree :: [Photon] -> PhotonMapTree
buildKDTree photons
| length photons == 1 = PhotonMapLeaf (head photons)
| length photons >= 2 = let (boxMin, boxMax) = photonsBoundingBox photons
!axis = largestAxis (boxMax - boxMin)
!photonsMedian = foldr ((+) . fst . posDir) zeroVector photons Vector. fromIntegral (length photons)
!value = component photonsMedian axis
photonsGT = Prelude.filter (\x -> component ((fst . posDir) x) axis > value) photons
photonsLE = Prelude.filter (\x -> component ((fst . posDir) x) axis <= value) photons
in if length photonsGT > 0 && length photonsLE > 0
then PhotonMapNode axis value (buildKDTree photonsGT) (buildKDTree photonsLE)
else let (photons0', photons1') = trace "Using degenerate case" $ degenerateSplitList photons in PhotonMapNode axis value (buildKDTree photons0') (buildKDTree photons1')
| otherwise = error ("Invalid case, length of array is " ++ show (length photons) ++ "\n")

KD-Tree Query

The photon mapping algorithm uses an interesting combination of a kd tree and max-heap to locate only the closest N photons to a point. The max-heap is sorted on the squared distance to a photon. This means the most distant photons are stored towards the top of the max-heap, making them easy to discard.

Traversal of the tree commences with a point of interest and a radius. Any photons found within that radius are added to the max-heap. If the max-heap exceeds its specified size then excess, distant photons are dropped from the top of the heap. Since the max-heap stores the most distant photon at the top of the heap, it is easy to monitor the current-furthest photon and tighten our search radius during traversal.

-- Use a max heap to make it easy to eliminate distant photons
data GatheredPhoton = GatheredPhoton Float Photon deriving (Show)
type PhotonHeap = MaxHeap GatheredPhoton

instance Ord GatheredPhoton where
compare (GatheredPhoton dist1 _) (GatheredPhoton dist2 _)
| dist1 == dist2 = EQ
| dist1 <= dist2 = LT
| otherwise = GT

instance Eq GatheredPhoton where
(GatheredPhoton dist1 _) == (GatheredPhoton dist2 _) = dist1 == dist2

minimalSearchRadius !rSq !photonHeap = case viewHead photonHeap of
Nothing -> rSq
Just (GatheredPhoton dSq _) -> Prelude.min rSq dSq

-- Gather photons for irradiance computations
-- Algorithm adapted from Realistic Image Synthesis Using Photon Mapping p73
gatherPhotons :: PhotonMapTree -> Position -> Float -> PhotonHeap -> Int -> PhotonHeap
gatherPhotons (PhotonMapNode !axis !value gtChild leChild) !pos !rSq !photonHeap !maxPhotons
-- In this case, the split plane bisects the search sphere - search both halves of tree
| (value - posComponent) ** 2 <= rSq = let heap1 = gatherPhotons gtChild pos rSq' photonHeap maxPhotons
rSq'' = minimalSearchRadius rSq' heap1
heap2 = gatherPhotons leChild pos rSq'' photonHeap maxPhotons
newHeap = union heap1 heap2
in Data.Heap.drop (size newHeap - maxPhotons) newHeap

-- One side of the tree...
| posComponent > value = gatherPhotons gtChild pos rSq' photonHeap maxPhotons

-- ... or the other
| posComponent <= value = gatherPhotons leChild pos rSq' photonHeap maxPhotons

-- Prolapse
| otherwise = error "gatherPhotons: unexplained/unexpected case here"
where
!posComponent = component pos axis
!rSq' = minimalSearchRadius rSq photonHeap -- Refine search radius as we go down tree to search no further than closest allowed photon
gatherPhotons (PhotonMapLeaf !p) !pos !rSq !photonHeap !maxPhotons
| distSq < rSq = let newHeap = insert (GatheredPhoton distSq p) photonHeap
in Data.Heap.drop (size newHeap - maxPhotons) newHeap -- Discard any excess photons - we get rid of the furthest ones
| otherwise = photonHeap
where !distSq = pos `distanceSq` (fst . posDir) p

Parallelisation

The photon tracing stage is trivially easy to parallelise, provided that you have carefully treated the shared state required by the random number generator. Since I have provided each traced photon with it's own state, which is uniquely initialised, all photons are data independent and therefore easy to trace in parallel.

I have not yet parallelised the photon gathering stage, due to some issues with the current implementation's efficiency. Two options present themselves for parallelisation. The first is to parallelise the traversal of high-level sub-branches of the KD tree. This naive approach would yield some speedup, but the work of each thread is likely to be highly unbalanced.

Speculatively, a work-stealing queue type approach may help here. The high-level nodes to be traversed could be initially inserted into a queue. When a worker thread pulls data, it could compare how many nodes remain to be traversed against the total number of workers. If the ratio of nodes:workers becomes very low, nodes could be repeatedly removed from the queue and replaced with their child nodes, until the ratio becomes favourable. This would populate the queue with a fairly heterogenous mixture of large and small amounts of work to be done, which would help fill up and balance processor time across the threads.

Results

Here is the current output image, with all options cranked up:

The code requires refinement and tuning to eliminate the bias, improve the colour bleeding and fix a few minor artefacts.

The photon mapper is currently very slow. It can take hours on my MacBook Pro to trace a scene containing 100,000 photons, and that is a very small number of photons.

Most of the time is spent in the gathering phase. The current implementation is very space intensive and requires further serious attention. The photon emission and tracing step, however, is extremely quick, requiring only a few seconds to trace 100,000 photons.

Code

Full source code is available on github:

https://github.com/TomHammersley/HaskellRenderer

Future Work

I am now working to improve the efficiency of the gathering phase which will improve efficiency and development speed. I am also implementing an irradiance cache to further optimise execution speed. Clearly various minor bugs also need addressing.

References

Further implementation details can be found primarily in Henrik Wann Jensen's photon mapping book, supplemented by various SIGGRAPH course materials.