So, now I'm doing depth of field with 64 rays per pixel, my runtime has shot up from a handful of seconds to 5 minutes or so. Time for some Haskell optimisation.
I've been working through the often-cited Haskell optimisation strategies.
"Use Doubles instead of Floats". This one seemed a little unintuitive, but reasonable. On the one hand, many modern processors natively evaluate floating point expressions as doubles, and the Haskell compiler writes claim the Double path is a faster path. On the other hand, it's a larger data footprint. I switched over all of my floats to doubles, and that doubled the execution time. So that one turned out to be an anti-optimisation, in practice.
"Add missing type signatures". The rationale here is that if you give a type signature, you offer the compiler the opportunity to write more specialised code. This one was worth a good 20 seconds of execution time.
"Explictly export only certain symbols". Instead of making everything in your module publically visible, only export certain things. This allows the Haskell compiler the opportunity and freedom to implement optimisations to the now-hidden code. This one turned out to be quite a win, a good 30-40 seconds.
So, with this set of optimisations, I've taken it from 5 minutes down to 4 minutes. That's still a hell of a lot for a compiled program rendering a 640x480 image of a few spheres and planes. I get the (admittedly speculative) gut feeling that the code is spending a lot more time in the Haskell runtime than executing ray tracing code. So, my next set of optimisation avenues are:
* Unboxing
* Strictness
* LLVM
* Parallelism
* Manual CSE
Very interesting series of posts! Thanks for publishing these notes for the rest of us to benefit from.
ReplyDeleteIn theory, single-precision floats should be able to be processed in SSE2 registers for a roughly four-fold increase in throughput. I wonder if the Haskell compiler has any facility for this kind of optimization?
Perhaps if the numbers were recast as quads of m128d values, the routines could be recompiled to take SSE2 into account. I wonder if this would flip the pendulum back around to single precision being faster? :-)
Of course, it depends on how much resolution you need in your results.
I suspect for most graphical applications single precision is plenty...
Hi Brent,
ReplyDeleteGlad you're enjoying the posts :-)
I noticed that the compiler has a -fsse flag. I haven't got around to checking it yet, but it certainly sounds very promising for this kind of application.
I agree that single-precision floats are enough for this kind of application. I find it strange that the Haskell literature asserts that Doubles are preferable for speed. I mean, what optimisations can be applied to a Double that are not also a win (or at least no penalty) for a Float, too?
I've now got the tracer down to just over a minute, but looking at the .hc files there is still a lot of bloat even in the vector code. I may search for a pre-authored high performance 3d Vector library. Surely someone has solved this problem before!
If not, a high-performance, SSE-enabled 3D vector library would be a big win for Haskell.