Tuesday, 5 July 2011

Where'd all the Haskell go?

It's not gone. It'll be back.

I've recently moved house, and obviously that consumes a lot of time both before, and after the move. In order to write a Haskell article, I need to develop supporting code. I don't have much coding time right now.

On the other hand, it's pretty easy to knock out an article based on the knowledge gained during my Master's, and through my industry experience, so I'm writing on those subjects for a while.

Haskell will be back!

Thursday, 2 June 2011

Photon Mapping in Haskell

For a while I have been working on extending my Haskell raytracer to include a global illumination model. I chose to implement photon mapping, as it seemed to offer both a number of challenges, and it quite suited a functional programming approach.

Photon Tracing

One of the main challenges of implementing an algorithm like photon mapping is that it relies on a small amount of mutable state due to its monte-carlo integration. As each photon is traced through the scene, a random number is required to determine the fate of a photon for a given photon-to-material interaction. A random number generator typically requires state to iterate from one number to the next. Such state is easy to incorporate into imperative code, but it presents a challenge to the Haskell programmer.

There are two main challenges with state: management and propagation of state, and selecting what code should share a given state context. For example, in the photon mapper I could have introduced a single, global state for the random number generator. Or, I could have given each light source its own state. Or each photon. The greater scope that the state has, the greater the coupling in the code and the greater the difficulty in parallelisation.

I chose to give each photon traced its own random number generator state - and each photon had a different initial seed.

-- Realistic Image Synthesis Using Photon Mapping p60
tracePhoton :: [Photon] -> Photon -> SceneGraph -> StdGen -> (Int, Int) -> [Photon]
tracePhoton !currentPhotons (Photon !photonPower !photonPosDir) sceneGraph !rndState !(bounce, maxBounces) = 
    -- See if the photon intersects any surfaces
    case findNearestIntersection sceneGraph ray of
      Nothing -> currentPhotons
      Just (obj, t, subId) -> case photonFate of
                                -- Diffuse reflection. Here, we store the photon that got reflected, and trace a new photon - but only if it's bright enough to be worthwhile
                                DiffuseReflect -> if Colour.magnitude newPhotonPower > brightnessEpsilon && (bounce + 1) <= maxBounces
                                                  then tracePhoton (storedPhoton : currentPhotons) reflectedPhoton sceneGraph rndState'' (bounce + 1, maxBounces)
                                                  else storedPhoton : currentPhotons
                                    where
                                      !reflectedPhoton = Photon newPhotonPower (surfacePos, reflectedDir)
                                      !(reflectedDir, rndState'') = diffuseReflectionDirection rndState' tanSpace


                                -- Specular reflection. Here, we reflect the photon in the fashion that the surface would reflect towards the viewer and
                                -- aim to absorb it somewhere else in the photon map
                                SpecularReflect -> if Colour.magnitude newPhotonPower > brightnessEpsilon && (bounce + 1) <= maxBounces
                                                   then tracePhoton currentPhotons reflectedPhoton sceneGraph rndState' (bounce + 1, maxBounces)
                                                   else currentPhotons
                                    where
                                      !reflectedPhoton = Photon newPhotonPower (surfacePos, reflectedDir)
                                      !reflectedDir = Vector.negate (snd photonPosDir) `reflect` normal


                                -- Absorb. The photon simply gets absorbed into the map
                                Absorb -> storedPhoton : currentPhotons
          where
            !(photonFate, rndState') = runState (choosePhotonFate coefficients) rndState
            !coefficients = russianRouletteCoefficients (material obj)
            !newPhotonPower = computeNewPhotonPower photonFate coefficients photonPower (material obj)
            !tanSpace = primitiveTangentSpace (primitive obj) subId hitPosition obj
            !normal = thr tanSpace
            !hitPosition = pointAlongRay ray t
            !surfacePos = hitPosition + (normal Vector.<*> surfaceEpsilon)
            !brightnessEpsilon = 0.1
            !storedPhoton = Photon photonPower (surfacePos, snd photonPosDir)
    where
      !ray = rayWithPosDir photonPosDir 10000

KD-Tree Construction

The resulting list of photons is inserted into a KD tree in order to efficiently locate the set of photons near to a point in space. This is a simple recursive operation:

buildKDTree :: [Photon] -> PhotonMapTree
buildKDTree photons
    | length photons == 1 = PhotonMapLeaf (head photons)
    | length photons >= 2 = let (boxMin, boxMax) = photonsBoundingBox photons
                                !axis = largestAxis (boxMax - boxMin)
                                !photonsMedian = foldr ((+) . fst . posDir) zeroVector photons Vector. fromIntegral (length photons)
                                !value = component photonsMedian axis
                                photonsGT = Prelude.filter (\x -> component ((fst . posDir) x) axis > value) photons
                                photonsLE = Prelude.filter (\x -> component ((fst . posDir) x) axis <= value) photons
                            in if length photonsGT > 0 && length photonsLE > 0
                               then PhotonMapNode axis value (buildKDTree photonsGT) (buildKDTree photonsLE)
                               else let (photons0', photons1') = trace "Using degenerate case" $ degenerateSplitList photons in PhotonMapNode axis value (buildKDTree photons0') (buildKDTree photons1')
    | otherwise = error ("Invalid case, length of array is " ++ show (length photons) ++ "\n")

KD-Tree Query

The photon mapping algorithm uses an interesting combination of a kd tree and max-heap to locate only the closest N photons to a point. The max-heap is sorted on the squared distance to a photon. This means the most distant photons are stored towards the top of the max-heap, making them easy to discard.

Traversal of the tree commences with a point of interest and a radius. Any photons found within that radius are added to the max-heap. If the max-heap exceeds its specified size then excess, distant photons are dropped from the top of the heap. Since the max-heap stores the most distant photon at the top of the heap, it is easy to monitor the current-furthest photon and tighten our search radius during traversal.


-- Use a max heap to make it easy to eliminate distant photons
data GatheredPhoton = GatheredPhoton Float Photon deriving (Show)
type PhotonHeap = MaxHeap GatheredPhoton


instance Ord GatheredPhoton where
    compare (GatheredPhoton dist1 _) (GatheredPhoton dist2 _)
        | dist1 == dist2 = EQ
        | dist1 <= dist2 = LT
        | otherwise = GT


instance Eq GatheredPhoton where
    (GatheredPhoton dist1 _) == (GatheredPhoton dist2 _) = dist1 == dist2


minimalSearchRadius !rSq !photonHeap = case viewHead photonHeap of
                                         Nothing -> rSq
                                         Just (GatheredPhoton dSq _) -> Prelude.min rSq dSq


-- Gather photons for irradiance computations
-- Algorithm adapted from Realistic Image Synthesis Using Photon Mapping p73
gatherPhotons :: PhotonMapTree -> Position -> Float -> PhotonHeap -> Int -> PhotonHeap
gatherPhotons (PhotonMapNode !axis !value gtChild leChild) !pos !rSq !photonHeap !maxPhotons
    -- In this case, the split plane bisects the search sphere - search both halves of tree
    | (value - posComponent) ** 2 <= rSq = let heap1 = gatherPhotons gtChild pos rSq' photonHeap maxPhotons
                                               rSq'' = minimalSearchRadius rSq' heap1
                                               heap2 = gatherPhotons leChild pos rSq'' photonHeap maxPhotons
                                               newHeap = union heap1 heap2
                                           in Data.Heap.drop (size newHeap - maxPhotons) newHeap


    -- One side of the tree...
    | posComponent > value = gatherPhotons gtChild pos rSq' photonHeap maxPhotons


    -- ... or the other
    | posComponent <= value = gatherPhotons leChild pos rSq' photonHeap maxPhotons


    -- Prolapse
    | otherwise = error "gatherPhotons: unexplained/unexpected case here"
    where
      !posComponent = component pos axis
      !rSq' = minimalSearchRadius rSq photonHeap -- Refine search radius as we go down tree to search no further than closest allowed photon
gatherPhotons (PhotonMapLeaf !p) !pos !rSq !photonHeap !maxPhotons
    | distSq < rSq = let newHeap = insert (GatheredPhoton distSq p) photonHeap
                     in Data.Heap.drop (size newHeap - maxPhotons) newHeap -- Discard any excess photons - we get rid of the furthest ones
    | otherwise = photonHeap
    where !distSq = pos `distanceSq` (fst . posDir) p


Parallelisation

The photon tracing stage is trivially easy to parallelise, provided that you have carefully treated the shared state required by the random number generator. Since I have provided each traced photon with it's own state, which is uniquely initialised, all photons are data independent and therefore easy to trace in parallel.

I have not yet parallelised the photon gathering stage, due to some issues with the current implementation's efficiency. Two options present themselves for parallelisation. The first is to parallelise the traversal of high-level sub-branches of the KD tree. This naive approach would yield some speedup, but the work of each thread is likely to be highly unbalanced.

Speculatively, a work-stealing queue type approach may help here. The high-level nodes to be traversed could be initially inserted into a queue. When a worker thread pulls data, it could compare how many nodes remain to be traversed against the total number of workers. If the ratio of nodes:workers becomes very low, nodes could be repeatedly removed from the queue and replaced with their child nodes, until the ratio becomes favourable. This would populate the queue with a fairly heterogenous mixture of large and small amounts of work to be done, which would help fill up and balance processor time across the threads.

Results

Here is the current output image, with all options cranked up:



The code requires refinement and tuning to eliminate the bias, improve the colour bleeding and fix a few minor artefacts.

The photon mapper is currently very slow. It can take hours on my MacBook Pro to trace a scene containing 100,000 photons, and that is a very small number of photons.

Most of the time is spent in the gathering phase. The current implementation is very space intensive and requires further serious attention. The photon emission and tracing step, however, is extremely quick, requiring only a few seconds to trace 100,000 photons.

Code

Full source code is available on github:

https://github.com/TomHammersley/HaskellRenderer

Future Work

I am now working to improve the efficiency of the gathering phase which will improve efficiency and development speed. I am also implementing an irradiance cache to further optimise execution speed. Clearly various minor bugs also need addressing.

References

Further implementation details can be found primarily in Henrik Wann Jensen's photon mapping book, supplemented by various SIGGRAPH course materials.



Sunday, 15 May 2011

Haskell Renderer Source Code on Github

Just a quick note to let anyone interested that the source code to my work-in-progress Haskell raytracer and photon mapper is now available at:

https://github.com/TomHammersley/HaskellRenderer

I'm currently mid way through debugging and optimising the photon map solution. It works, though it takes 10 minutes or so on my MacBook Pro.

When run it will output the following image:


To build the program you just need the latest Haskell Platform and the bmp package from cabal.

I'm still a beginner at Haskell so some of the code may not represent the most idiomatic or best practice Haskell. If anyone has any advice, comments or constructive criticism, I'm keen to hear it at tomhammersley@gmail.com.

(I plan to make the package Cabal compatible soon).

Cheers.

Monday, 11 April 2011

Designs That Leak State: Using Haskell Notation to Analyse C++ Design

I came across an interesting design failure today. In short, a C++ object was internally storing the state specific to a piece of multi-threaded client code. The client code was leaking its state data into the C++ object. In generalised, abstracted terms:

class Object
{
    Real member variables
...
    unsigned char m_state1[MaxContexts];
    float m_state2[MaxContexts];
    bool m_state3[MaxContexts];
};

Each of the threads would read from 'Object', process it, and store the results in one of the elements of m_state.

It's a bad pattern, but I daresay a common pattern.

What is wrong here?

The first thing to notice is the series of member variables, all arrays, sized by MaxContexts. A series of variables like this is a dead giveaway that really there is semantically some concept in the system that is not explicitly represented in the code. Given that we are trying to represent an object in a series of "contexts", with different values for each, it suggests that these variables are not properties of the object, but actually variables of some other, auxiliary state. If there is logically one of something, it's a member variable of the class in question. If an object embeds a number of them, it's probably someone else's variable.

Fundamentally, my problem with this is that we're embedding some client's data within the object itself. When the client code operates on the object, it is no longer a pure transformation of one object into another, it is now a conceptually more complex, confusing transformation of one object, to the same object with new state.

This additional state presents serious problems to multi-threading a piece of code. When foreign state is embedded unnecesarily in an object like this it makes it much more difficult to manage the ownership, lifetime and correctness of data. All the other clients of the Object are burdened with helping manage the state of some other system. Appropriate ownership, coupling, and lifetime of data is all-important in parallel programming.

Our problem is that we have inverted the "has-a" relationship. The object does not "have a" state in a number of contexts. In reality, the contexts have the state and they "have", or refer to the object.

Haskell Function Signature Analysis

I increasingly find that Haskell's function signature notation is a lucid yet lightweight means of analysing data flow. There are many other alternative, successful methods, yet I find the Haskell analysis direct and natural. These days I find myself thinking directly in these terms to understand data transformations.

In Haskell terms, we have:


f :: Object -> Object


Or possible even:


f :: Object -> (Object, State)


We're not sure! We would prefer:

f :: Object -> State


Usage of State

The next question is, how is this state really used? It is not really part of an object's update function, as the multiple contexts suggest the data is not part of the object. So, what is the client code of the object doing with the state? It doesn't really matter. Whatever it's doing, we can infer that:
a) The state is a product of the object. If it were not, it would not be in the object.
b) We have multiple contexts, implying multiple passes over the data.

When a piece of state is the product of an object, it is likely that it is intermediate data. Given that we expect multiple passes over the data, we're going to be caching a lot of pieces of state data that have nothing to do with the context in question. We could switch the state representation from SoA to AoS, but it's still data in the wrong place.

Let's say we have 10 'contexts' and 100 objects. Let's say the object is 20 bytes, and the 'state' is 6 byte. We have a total data volume of 100 * (20 + 6 * 10) bytes, of which 100 * (20 + 6) bytes is used. Out of the total data volume, we're therefore using only 33% of the total data volume.

So, if we switch from the:

[Object] -> [Object]

implementation to a

[Object] -> [State]

implementation, we now have 100 * 20 + 100 * 6 bytes total data volume and 100% of the data is used.

Where is the data going?

The previous transformation is essentially a map operation. We're applying a function a list of Objects, and returning a list of State objects. A state is not an endpoint of a data processing; it implies that further processing is going to use the data.

Returning to Haskell, let's say our original state-producing function is f:

state = map f objects

Something else, g, is going to use that data:

output = map g (map f objects)



Or in terms of function application:

g . f

Here's our first big clue. All that data that originally lived inside the object, most likely on the heap, is really just intermediate data polluting the cache. It's just input to a succeeding function. We can most likely eliminate that "state" data altogether and pass the output directly through to the succeeding function. Or, we can store the state in some piece of intermediate thread-local storage. Either way, we've now got a clearer understanding of the real data flow in the program.

Eliminating the containing Object

The next question becomes, which bits of the Object really contribute to the new output State?

As the State is the product of an Object, and Object is non-trivial it is unlikely that the State is a product of the entire Object. It is probably some subset of member variables.

To use something of a straw man example, suppose the code in question is a visibility operation. Suppose that Object holds a bounding volume, some geometry, and the State holds various visibility test results, distances, LODs and the likely. Our transformation is therefore more likely to be:

BoundingVolume -> LodDescription -> State

Most of the Object's data is not used here. We've passed a subset of the Object's data to some function. This is a naive pattern in C++ code. Useful, general, common patterns of code are often hidden behind a facade of encapsulation. Unrelated pieces of data (in a given context) are brought in unnecessarily. When we develop a semi-abstract purely functional description of the operation, we can build a much clearer understanding of the data flow and dependencies.

Now imagine rewriting and structuring this operation in C++ based on this understanding of the data. Arguably an experienced programmer can arrive directly at the same conclusion, but I find the Haskell notation a useful intermediate tool for reasoning about data flows.

Thursday, 7 April 2011

Ray-Triangle Intersection in Haskell

Recently, I've been experimenting with different ray-triangle intersection algorithms. There are many alternatives out there, and many more optimisations and special cases, but I was looking for an version that ran quickly, and "suited" functional programming.

I settled on the half-planes solution. This algorithm defines a plane at each edge of the triangle by crossing the triangle's normal and the edge direction. To check if a triangle is inside the triangle, you simply check to see if a point (on the plane of the triangle) is on the same side of all of those planes.

This costs a little extra storage space, but it's a very fast test.

So. How does the code pan out? Well, pointInsideTriangle is simple enough:



distanceToPlane :: Shape -> Vector -> Float
distanceToPlane (Plane !norm !dist) !pos = pos `dot3` norm + dist
distanceToPlane _ _ = undefined



pointInsideTriangle :: Triangle -> Position -> Bool
pointInsideTriangle !tri !point = foldr (&&) True $ map (\pln -> (distanceToPlane pln point) >= 0) (halfPlanes tri)

This is best read right-to-left. The map applies a lambda function to each of the half planes. This function simply returns a bool, telling us if a point is on the correct side of a triangle. This yields a list of bools. We need to turn that into a single result; this is done using a fold to reduce many values to a single Bool. I've deliberately used a right-fold so that the fold lazily consumes only the necessary parts of the list. It can early-out.

So there we have it. One line of Haskell (with a couple of helper functions) to do the test.

The rest of the Haskell code for a ray-triangle intersection is pretty straightforward:

intersectRayTriangle :: Ray -> Object -> Triangle -> Bool -> Maybe (Float, Triangle)
intersectRayTriangle !ray !obj !triangle !doubleSided
    | doubleSided == False && (direction ray) `dot3` (normal $ plane triangle) > 0 = Nothing
    | otherwise = case shapeClosestIntersect (plane triangle) ray obj of
                    Nothing -> Nothing
                    Just (dist', _) -> if pointInsideTriangle triangle (pointAlongRay ray dist')
                                       then Just (dist', triangle)
                                       else Nothing

Here we have a simple guard condition to permit backface culling. If the triangle is not double-sided, we do a backface culling test and possibly reject it. If not, we fall through to the default case and intersect the ray against the plane, and then test that intersection point for containment in the triangle.

Intersecting against a list is an interesting case. A simple version would just map against the list, and then attempt to find the closest intersection out of the resulting list. However, if we use tail recursion, we can add a parameter to hold our current "state" and therefore maintain a current-closest intersection. This eliminates a second search-the-list step, and also permits an early-reject optimisation:

intersectRayTriangleList :: [Triangle] -> Int -> Maybe (Float, Int) -> Ray -> Object -> Maybe (Float, Int)
intersectRayTriangleList !(x:xs) !index !currentResult !currentRay !obj = intersectRayTriangleList xs (index + 1) newResult newRay obj
    where
      (newRay, newResult) = case intersectRayTriangle currentRay obj x False of
                              Nothing -> (currentRay, currentResult)
                              Just (dist, _) -> (shortenRay currentRay dist, Just (dist, index))
intersectRayTriangleList [] _ !currentResult _ _ = currentResult

The pattern of "searching a list of x and return the closest" occurs frequently in raytracing. This is a potential candidate to be factored out into a re-usable function, or possibly even a monad.

The resulting code is also quite efficient. I'm now running at less than a minute for my test scene of the Cornell Box with 8x8 distributed raytracing. This is 30% quicker than my previously Möller-Trumbore ray-triangle intersection test.

Sunday, 3 April 2011

How to Parallelise A Haskell Raytracer

One of the big attractions of functional languages is that they're theoretically very easily to parallelise as they have no problematic side-effects You only describe how to perform a calculation, and never touch on the details of moving data around the machine. The run-time is then free to process data, given your functional expressions, in parallel.

I've been experimenting with Haskell's basic parallelism primitives, par, and pseq. These primitives trigger ("spark") evaluation of expressions in parallel - but only if the runtime deems it necessary.

After many failed or mildly successful experiments, I've finally managed to get a speedup. In short, I stopped trying to reinvent the wheel myself. I simply used parListChunk:


rayTraceImage :: Camera -> Int -> Int -> SceneGraph -> [Light] -> [Colour]
rayTraceImage camera renderWidth renderHeight sceneGraph lights = map (clamp . tracePixel eyePosition sceneGraph lights) rayDirections `using` parListChunk 256 rseq
    where !rayDirections = [makeRayDirection renderWidth renderHeight camera (x, y) | y <- [0..(renderHeight - 1)], x <- [0..(renderWidth - 1)]]
          !eyePosition = Camera.position camera

Here, we break the list of rays to be traced into data parallel chunks of 256, each free to be evaluated by a different thread. Each of these chunks is processed using map. Map applies the raytracing function to convert a ray direction into a colour, and we clamp the result.

And that's it.

Running it on two threads on my dual-core MacBook Pro halves the execution time.

What's nice is that the core 'map' expression is separated from the details of parallelisation. We've separated the concepts of raytracing and its parallel decomposition into two separate concerns. We are therefore free to vary either without intertwining their concerns.

Not a critical section, lock-free queue or atomic operation in sight!

Clearly, the nature of the functional approach surrenders a lot of low-level implementation details to your platform's runtime. However, it is more frequently the case that when you require such low-level intervention, you are working around deficiencies in your run-time platform, rather than dealing with fundamental algorithm details. Parallelising many algorithms follows established patterns.

It is also worth noting that the raytracer code is currently pretty straightforward code. I've not implemented any particular tricky optimisations, and certainly none of the ideas in work such as Ingo Wald's parallel raytracing PhD thesis. Not yet, anyway...

Thursday, 24 March 2011

Laziness vs Strictness in Haskell

I've just taken over a minute of execution time off my raytracer.

One of Haskell's biggest language features is the lazy evaluation system. By default, Haskell evaluates nearly all expressions lazily, as needed. This is one of the features that drew me to Haskell, and indeed it seems a reasonably apt way of writing a raytracer. Raytracing often consists of selectively traversing different parts of large databases for different parts of the scene. Many acceleration schemes involve simple mechanisms to lazily evaluate results.

Laziness, however, does pose an overhead. Haskell introduces a "thunk" for every lazy expression. A thunk describes how to calculate a deferred expression when required. For some expressions, a thunk is a potential time saver: you can defer work until needed, calculate the required value, then re-use it. For other things, such as simple arithmetic operations like vector addition, a thunk is an overhead often comparable to the original operation.

Haskell offers a couple of ways to avoid thunks and therefore lazy evalation. The main two I use are the BangPatterns extension to explicitly mark something as strictly (ie, non-lazily) evaluated, and seq.

The BangPatterns extension allows you to mark expressions that you want to be explicitly evaluated with a !. So, for a Vector, you would write Vector !Float !Float !Float !Float. This way, you force anything you assign to a vector to be strictly evaluated and stored in the vector.

Seq is used to strictly force the evaluation of a function. Seq has the signature of a -> b -> a. It therefore takes two parameters, and returns the second.

An interesting example of where lazy evaluation proven to be an overheard, rather than a win was my light accumulation function. Given a point in space to be shaded, this accumulates the contribution of every light source. Now, the inputs to this function are going to be different every time this function is called as every ray intersection is going to be unique. Thunks will therefore present an additional overhead.

I rewrote my function to use seq. I use seq to strictly evaluate the application of this function, then process the remainder of the list:


accumulateLight :: [Light] -> Colour -> SceneGraph -> Vector -> Vector -> Material -> Vector -> Colour
accumulateLight (x:xs) acc sceneGraph shadePos normal objMaterial viewDirection = let result = acc + (applyLight sceneGraph shadePos normal objMaterial viewDirection x)
                                                                                  in seq result (accumulateLight xs result sceneGraph shadePos normal objMaterial viewDirection)
accumulateLight [] acc _ _ _ _ _ = acc

This change alone saved over a minute of execution time! Since a raytracer generally consists of many similar kinds of loops, I am now searching for wider application of this technique. It's a very difficult call to make with a limited data set. Some functions may be faster if strictly evaluated with a small dataset, but may perform better if evaluated lazily with a larger dataset.

Profiling will guide me.