Tom Hammersley's Graphics and Games Development Blog: Optimisation

Showing posts with label Optimisation. Show all posts

Thursday, 25 August 2011

This is what happens

When universities teach Java!

(some random bit of disassembled code found on my travels)

00A5CA74 EC7E1824 fdivs f3,f30,f3                          56 PIPE
00A5CA78 FC406850 fneg f2,f13
00A5CA7C D0660010 stfs f3,0x10(r6)
00A5CA80 D0C90010 stfs f6,0x10(r9)
00A5CA84 D981FF60 stfd f12,-0xA0(r1)                        PIPE
00A5CA88 E881FF60 ld r4,-0xA0(r1)                          70 (00A5CA84) LHS[01]
00A5CA8C D961FF60 stfd f11,-0xA0(r1)                        PIPE
00A5CA90 EBA1FF60 ld r29,-0xA0(r1)                         70 (00A5CA8C) LHS[01]
00A5CA94 D941FF60 stfd f10,-0xA0(r1)                        PIPE
00A5CA98 EB81FF60 ld r28,-0xA0(r1)                         70 (00A5CA94) LHS[01]
00A5CA9C D921FF60 stfd f9,-0xA0(r1)                         PIPE
00A5CAA0 EB61FF60 ld r27,-0xA0(r1)                         70 (00A5CA9C) LHS[01]
00A5CAA4 D901FF60 stfd f8,-0xA0(r1)                         PIPE
00A5CAA8 EB41FF60 ld r26,-0xA0(r1)                         70 (00A5CAA4) LHS[01]
00A5CAAC D8E1FF60 stfd f7,-0xA0(r1)                         PIPE
00A5CAB0 EB21FF60 ld r25,-0xA0(r1)                         70 (00A5CAAC) LHS[01]
00A5CAB4 D1A6000C stfs f13,0xC(r6)                          PIPE
00A5CAB8 D029000C stfs f1,0xC(r9)
00A5CABC D801FF60 stfd f0,-0xA0(r1)                         PIPE
00A5CAC0 E801FF60 ld r0,-0xA0(r1)                          70 (00A5CABC) LHS[01]
0

Wednesday, 24 August 2011

Pattern matching as an optimisation method, and a means of insight

One feature of Haskell that I particularly like is its pattern-matching scheme. To route code down the correct codepath, Haskell uses pattern matching against data types and values, instead of extensive flow-control statements. This makes the code much simpler to read, as your code merely concerns itself with calling the correct functions, with the correct handler cases, and the underlying Haskell compiler deals with the flow-control minutiae.

I've increasingly found myself using a similar approach in C++ and HLSL using features such as polymorphic functions. A good example comes from optimisation: frequently, you find cases in the code where someone has written a "kitchen sink" function. This function internally handles many cases, and accordingly takes many parameters.

Frequently, many of these parameters are called with specificliteral values or constants. For example, many parameters may be zero which disable entire codepaths inside the function. For me, this is a bad code smell. But why? Many people would argue that since the code is branched away, what's the big deal? Especially if it means we can have just one definition of a function. Well, branches are very expensive on modern processors. On contemporary consoles, a floating point branch can yield a 27-cycle penalty. Not to mention all the preamble and branch set-up code.

I've therefore found myself increasingly providing multiple specialised versions of these functions, which omit some of the parameters that are permanently called with 0.0f or 1.0f and tailoring the code accordingly. Obviously this saves a lot of needless computation, branching and I$ misses.

What's interesting about this approach is that once you've performed a few of these refactorings, you start to expose patterns in the code and its structure that you previously were unaware of. You may find that your major axis of variation lies not in your chosen object hierarchy composition, but elsewhere. You may have a multitude of objects that add little value at a higher level, but now a large set of lower level functions that process the objects differently. Sometimes it makes more sense to structure and divide your code along the lines of what processing it does, not what you think the object "is", as that is the greater source of variation.

Tuesday, 23 August 2011

On the limitations of caching

I've been spending a little time thinking about caching lately. Irradiance caching, optimising some caching in production code, and hardware caches. Here's a little brain-dump.

I've come to the conclusion that if you ever need anything more than a trivially small cache, your algorithm is wrong. This is particularly so for parallel programming, as caches are often a difficult to handle piece of interdependent common state.

My reasoning goes a little something like this. If you are relying on a cache for performance, it is usually to re-use the results of some expensive transformation a -> b. Crucially, there needs to be 1 a to one or more b (otherwise there is no cache reuse!). For an efficient cache, you need:

To be caching the results of a fairly complex computation,
To have an efficient means of storing the results of the computation,
To have a well-behaved cache indexing function.

Now, what often ends up happening is that you iterate over the many objects 'b', and populate the cache on-demand from a common ancestor piece of data, a. And here's the first problem. You are primarily iterating over the wrong dataset, and filtering and culling parts of the other dataset by proxy. You have an implicit data filtering or reduction operation that could be better handled more explicitly.

Ok, so, we switch our iteration round to iterate over the source objects 'a'. We iterate over each 'a' object, and produce the resulting 1 or more 'b' objects. At this point, do we really have a cache? We've turned our cache into a data transformation kernel. All of sudden our cache is no longer some intermediate state used in a computation, but it is now simply a piece of input data. Since we now have a piece of code transforming [a] -> [b], we have a piece of code that is very amenable to data parallelism. Since we've decoupled the code from a cache, it's also much easier to parallelise.

I think ultimately an extensive caching scheme is evidence that the data domain hasn't been adequately structured and managed. If you need a cache, you probably need more data conditioning, filtering, and reduction.

A good counter-example is the vertex stream processing unit of a modern GPU. These units read huge streams of data, transform them, and maintain minimal cached state. Vertex caches are typically miniscule in comparison to the volume of data they process. Yet a post-transform cache can often maintain a very high level of efficiency. The key to this cache's success is the pre-processing (ie structuring) of the data fed into it. It is not random. It is well conditioned, which yields efficiency at a number of levels, and enables a vertex cache to be used only as a relatively modest addition to the pipeline. The cache is the icing on the cake, not a fundamental component of the algorithm.

I think this is an important lesson when transitioning from writing single-threaded to parallel code. Don't dream up increasingly elaborate caching schemes. Instead think up separable, well-structured data-parallel transformations. When you feel the need for a cache, you probably need to carefully re-examine your data flow structure.

Thursday, 24 March 2011

Laziness vs Strictness in Haskell

I've just taken over a minute of execution time off my raytracer.

One of Haskell's biggest language features is the lazy evaluation system. By default, Haskell evaluates nearly all expressions lazily, as needed. This is one of the features that drew me to Haskell, and indeed it seems a reasonably apt way of writing a raytracer. Raytracing often consists of selectively traversing different parts of large databases for different parts of the scene. Many acceleration schemes involve simple mechanisms to lazily evaluate results.

Laziness, however, does pose an overhead. Haskell introduces a "thunk" for every lazy expression. A thunk describes how to calculate a deferred expression when required. For some expressions, a thunk is a potential time saver: you can defer work until needed, calculate the required value, then re-use it. For other things, such as simple arithmetic operations like vector addition, a thunk is an overhead often comparable to the original operation.

Haskell offers a couple of ways to avoid thunks and therefore lazy evalation. The main two I use are the BangPatterns extension to explicitly mark something as strictly (ie, non-lazily) evaluated, and seq.

The BangPatterns extension allows you to mark expressions that you want to be explicitly evaluated with a !. So, for a Vector, you would write Vector !Float !Float !Float !Float. This way, you force anything you assign to a vector to be strictly evaluated and stored in the vector.

Seq is used to strictly force the evaluation of a function. Seq has the signature of a -> b -> a. It therefore takes two parameters, and returns the second.

An interesting example of where lazy evaluation proven to be an overheard, rather than a win was my light accumulation function. Given a point in space to be shaded, this accumulates the contribution of every light source. Now, the inputs to this function are going to be different every time this function is called as every ray intersection is going to be unique. Thunks will therefore present an additional overhead.

I rewrote my function to use seq. I use seq to strictly evaluate the application of this function, then process the remainder of the list:

accumulateLight :: [Light] -> Colour -> SceneGraph -> Vector -> Vector -> Material -> Vector -> Colour
accumulateLight (x:xs) acc sceneGraph shadePos normal objMaterial viewDirection = let result = acc + (applyLight sceneGraph shadePos normal objMaterial viewDirection x)
in seq result (accumulateLight xs result sceneGraph shadePos normal objMaterial viewDirection)
accumulateLight [] acc _ _ _ _ _ = acc

This change alone saved over a minute of execution time! Since a raytracer generally consists of many similar kinds of loops, I am now searching for wider application of this technique. It's a very difficult call to make with a limited data set. Some functions may be faster if strictly evaluated with a small dataset, but may perform better if evaluated lazily with a larger dataset.

Profiling will guide me.

Sunday, 27 February 2011

Few more optimisations

Well, I've managed to knock off a further half minute from the execution time. The execution time now stands at 3.5 minutes down from just over 4.

I've achieved this with two main methods: strictness, and rewriting loops to be tail recursive.

By default, Haskell is a lazy language. Haskell defers evaluation of an expression until it is needed by creating a thunk. Whilst thunks may save work, thunks are not free. Thunks are an example of a work-saving technique that do not provide a net benefit for very small pieces of work or data.

Thunks can be eliminated by marking expressions as strictly evaluated. This technique forces an expression to be evaluated when it is encountered, rather than deferred. The Haskell compiler automatically treats some expressions as strict, but a programmer can force an expression to be strict using the BangPatterns language extension. You simply prefix the variables that you wish to be strictly evaluated using a pling (!).

I went through and marked up all of the trivial expressions and fields in the raytracer code. Things like the xyzw fields of a vector, rgba fields of a colour, field-of-views, radii, that sort of thing. Eliminating those thunks provided around half of the total savings.

More information on strictness here: http://www.haskell.org/haskellwiki/Performance/Strictness

Ensuring that loops were properly tail recursive provided the second half of the savings. Haskell does not provide traditional for() loops to iterate over arrays and lists; instead, recursion must be used. This may introduce extra space and time overhead due to the state accumulated at each function call.

Tail recursive functions eliminate this overhead as they allow the compiler to use the equivalent of a goto rather than a function call to traverse the next item of a list. This essentially converts the Haskell code into something resembling a traditional for loop.

More information here:
http://www.haskell.org/haskellwiki/Tail_recursion
http://book.realworldhaskell.org/read/profiling-and-optimization.html

I applied this optimisation to the light accumulation, object intersection and ray tracing loops. I also taken the opportunity to add a few simple optimisations to the main object intersection loop.

So, all in all, a few modest gains. What's becoming very apparent is that to seriously optimise Haskell code I need to learn Core. Core is Haskell's intermediate compilation language. 3.5 minutes is still a long time to ray-trace a very simple scene and I'm sure there must be a lot of win left in there. I could brute-force it in other languages in far less! Hopefully the Core will show some major overhead in the low level vector, colour, lighting and intersection code.

Thursday, 24 February 2011

Baby Steps in Haskell Optimisation

So, now I'm doing depth of field with 64 rays per pixel, my runtime has shot up from a handful of seconds to 5 minutes or so. Time for some Haskell optimisation.

I've been working through the often-cited Haskell optimisation strategies.

"Use Doubles instead of Floats". This one seemed a little unintuitive, but reasonable. On the one hand, many modern processors natively evaluate floating point expressions as doubles, and the Haskell compiler writes claim the Double path is a faster path. On the other hand, it's a larger data footprint. I switched over all of my floats to doubles, and that doubled the execution time. So that one turned out to be an anti-optimisation, in practice.

"Add missing type signatures". The rationale here is that if you give a type signature, you offer the compiler the opportunity to write more specialised code. This one was worth a good 20 seconds of execution time.

"Explictly export only certain symbols". Instead of making everything in your module publically visible, only export certain things. This allows the Haskell compiler the opportunity and freedom to implement optimisations to the now-hidden code. This one turned out to be quite a win, a good 30-40 seconds.

So, with this set of optimisations, I've taken it from 5 minutes down to 4 minutes. That's still a hell of a lot for a compiled program rendering a 640x480 image of a few spheres and planes. I get the (admittedly speculative) gut feeling that the code is spending a lot more time in the Haskell runtime than executing ray tracing code. So, my next set of optimisation avenues are:

* Unboxing
* Strictness
* LLVM
* Parallelism
* Manual CSE

Friday, 18 February 2011

Polymorphism

Polymorphism is one of the object oriented paradigms that leads to pain when optimising.

Why?

Virtual.

Polymorphism in itself isn't a bad thing. It's a useful design mechanism and has its place. The virtual function call adds a level of abstraction that makes it difficult to port to SPU or GPU architectures and adds a barrier to many optimisation techniques.

A common example is using polymorphism and virtual functions to process a list of heterogenous types with a common base class.

This is bad because:
(a) You're adding the cost of virtual function calls to every object
(b) You're dispatching to different pieces of code, in a potentially incoherent order, causing I$ thrash.
(c) You often end up with oversized objects that carry dead code and data around, again hurting cache efficiency
(d) It's harder to port the code to the SPU (or GPU)

I've been exploring an alternative way of structuring this approach. I've retained a polymorphic external interface for registering objects of different types in a list. However, rather than have a single add() function that takes the base type, I've explicitly made several overloaded functions that take the derived types. Each of these functions adds that object to a list of objects of the same specific derived type. So, instead of having one big heterogenous list, internally I have several type-specific homogenous lists.

This way, I eliminate all virtual calls in my processing-side code. I can now process all of the objects of type X, Y and Z. I can explicitly call X::process() in a list, Y::process(), Z::process() and so on. This makes the code much easier to optimise and much easier to port to the SPU because you no longer need to know about the vtable. I now have much better I$ coherency.

Who said we must have one list, after all? Why does the contained object have to pay the cost of the abstraction mechanism? Doesn't it make more sense for the containing object to pay the cost? And a cheaper cost at that! Provided the set of types you are dealing with remains reasonable, maintaining multiple separate lists can be an easy way to enhance efficiency and make your code easier to port to SPU.