Polymorphism is one of the object oriented paradigms that leads to pain when optimising.
Why?
Virtual.
Polymorphism in itself isn't a bad thing. It's a useful design mechanism and has its place. The virtual function call adds a level of abstraction that makes it difficult to port to SPU or GPU architectures and adds a barrier to many optimisation techniques.
A common example is using polymorphism and virtual functions to process a list of heterogenous types with a common base class.
This is bad because:
(a) You're adding the cost of virtual function calls to every object
(b) You're dispatching to different pieces of code, in a potentially incoherent order, causing I$ thrash.
(c) You often end up with oversized objects that carry dead code and data around, again hurting cache efficiency
(d) It's harder to port the code to the SPU (or GPU)
I've been exploring an alternative way of structuring this approach. I've retained a polymorphic external interface for registering objects of different types in a list. However, rather than have a single add() function that takes the base type, I've explicitly made several overloaded functions that take the derived types. Each of these functions adds that object to a list of objects of the same specific derived type. So, instead of having one big heterogenous list, internally I have several type-specific homogenous lists.
This way, I eliminate all virtual calls in my processing-side code. I can now process all of the objects of type X, Y and Z. I can explicitly call X::process() in a list, Y::process(), Z::process() and so on. This makes the code much easier to optimise and much easier to port to the SPU because you no longer need to know about the vtable. I now have much better I$ coherency.
Who said we must have one list, after all? Why does the contained object have to pay the cost of the abstraction mechanism? Doesn't it make more sense for the containing object to pay the cost? And a cheaper cost at that! Provided the set of types you are dealing with remains reasonable, maintaining multiple separate lists can be an easy way to enhance efficiency and make your code easier to port to SPU.