Monday, 11 April 2011

Designs That Leak State: Using Haskell Notation to Analyse C++ Design

I came across an interesting design failure today. In short, a C++ object was internally storing the state specific to a piece of multi-threaded client code. The client code was leaking its state data into the C++ object. In generalised, abstracted terms:

class Object
    Real member variables
    unsigned char m_state1[MaxContexts];
    float m_state2[MaxContexts];
    bool m_state3[MaxContexts];

Each of the threads would read from 'Object', process it, and store the results in one of the elements of m_state.

It's a bad pattern, but I daresay a common pattern.

What is wrong here?

The first thing to notice is the series of member variables, all arrays, sized by MaxContexts. A series of variables like this is a dead giveaway that really there is semantically some concept in the system that is not explicitly represented in the code. Given that we are trying to represent an object in a series of "contexts", with different values for each, it suggests that these variables are not properties of the object, but actually variables of some other, auxiliary state. If there is logically one of something, it's a member variable of the class in question. If an object embeds a number of them, it's probably someone else's variable.

Fundamentally, my problem with this is that we're embedding some client's data within the object itself. When the client code operates on the object, it is no longer a pure transformation of one object into another, it is now a conceptually more complex, confusing transformation of one object, to the same object with new state.

This additional state presents serious problems to multi-threading a piece of code. When foreign state is embedded unnecesarily in an object like this it makes it much more difficult to manage the ownership, lifetime and correctness of data. All the other clients of the Object are burdened with helping manage the state of some other system. Appropriate ownership, coupling, and lifetime of data is all-important in parallel programming.

Our problem is that we have inverted the "has-a" relationship. The object does not "have a" state in a number of contexts. In reality, the contexts have the state and they "have", or refer to the object.

Haskell Function Signature Analysis

I increasingly find that Haskell's function signature notation is a lucid yet lightweight means of analysing data flow. There are many other alternative, successful methods, yet I find the Haskell analysis direct and natural. These days I find myself thinking directly in these terms to understand data transformations.

In Haskell terms, we have:

f :: Object -> Object

Or possible even:

f :: Object -> (Object, State)

We're not sure! We would prefer:

f :: Object -> State

Usage of State

The next question is, how is this state really used? It is not really part of an object's update function, as the multiple contexts suggest the data is not part of the object. So, what is the client code of the object doing with the state? It doesn't really matter. Whatever it's doing, we can infer that:
a) The state is a product of the object. If it were not, it would not be in the object.
b) We have multiple contexts, implying multiple passes over the data.

When a piece of state is the product of an object, it is likely that it is intermediate data. Given that we expect multiple passes over the data, we're going to be caching a lot of pieces of state data that have nothing to do with the context in question. We could switch the state representation from SoA to AoS, but it's still data in the wrong place.

Let's say we have 10 'contexts' and 100 objects. Let's say the object is 20 bytes, and the 'state' is 6 byte. We have a total data volume of 100 * (20 + 6 * 10) bytes, of which 100 * (20 + 6) bytes is used. Out of the total data volume, we're therefore using only 33% of the total data volume.

So, if we switch from the:

[Object] -> [Object]

implementation to a

[Object] -> [State]

implementation, we now have 100 * 20 + 100 * 6 bytes total data volume and 100% of the data is used.

Where is the data going?

The previous transformation is essentially a map operation. We're applying a function a list of Objects, and returning a list of State objects. A state is not an endpoint of a data processing; it implies that further processing is going to use the data.

Returning to Haskell, let's say our original state-producing function is f:

state = map f objects

Something else, g, is going to use that data:

output = map g (map f objects)

Or in terms of function application:

g . f

Here's our first big clue. All that data that originally lived inside the object, most likely on the heap, is really just intermediate data polluting the cache. It's just input to a succeeding function. We can most likely eliminate that "state" data altogether and pass the output directly through to the succeeding function. Or, we can store the state in some piece of intermediate thread-local storage. Either way, we've now got a clearer understanding of the real data flow in the program.

Eliminating the containing Object

The next question becomes, which bits of the Object really contribute to the new output State?

As the State is the product of an Object, and Object is non-trivial it is unlikely that the State is a product of the entire Object. It is probably some subset of member variables.

To use something of a straw man example, suppose the code in question is a visibility operation. Suppose that Object holds a bounding volume, some geometry, and the State holds various visibility test results, distances, LODs and the likely. Our transformation is therefore more likely to be:

BoundingVolume -> LodDescription -> State

Most of the Object's data is not used here. We've passed a subset of the Object's data to some function. This is a naive pattern in C++ code. Useful, general, common patterns of code are often hidden behind a facade of encapsulation. Unrelated pieces of data (in a given context) are brought in unnecessarily. When we develop a semi-abstract purely functional description of the operation, we can build a much clearer understanding of the data flow and dependencies.

Now imagine rewriting and structuring this operation in C++ based on this understanding of the data. Arguably an experienced programmer can arrive directly at the same conclusion, but I find the Haskell notation a useful intermediate tool for reasoning about data flows.

1 comment:

  1. Curry-Howard strikes again :)

    Haskell types really help me think about problems too. The last time I needed to modify a python or C program I wrote Haskell types for all my functions just to get the design clear in my head. The time after that I saved a step and just used Haskell :)