Friday 17 May 2013

Image Histogram Construction In DX11


Recently, I have been researching some next-gen rendering techniques. There's some great presentations out there by the usual suspects - Epic, Crytek, DICE, Guerilla and so on. Many of the vendors also have good sample code to explore. There's a lot of buzz about certain new techniques... but there's sometimes a little bit of a gap between the academic literature, the presentations and sample code. Sometimes its a little hard for the time-pressed practioner to google an example and jump straight in.

So, I thought I'd show a little example code to help people get started.


Tone mapping using a histogram is becoming a trend in next-generation graphics. People are starting to experiment with moving away from the DX9/360/PS3 style "compute average luminance, apply exposure, apply film response curve" algorithms towards trying something different.

Histogram based methods aren't new. An early, commonly cited example is Ward et al '97. More recent examples are things like Duan et al 2010. The basic idea is that you take an input high dynamic range image, compute a histogram measuring the frequency of different luminances in that scene, and use that to build a response curve. The response curve maps from the scene's luminance range to your output display lumiance range.

So, in one fell swoop, you can take your HDR image and convert it to an LDR image. No repeated downsampling and cache thrashing, no incorrect half-pixel or half-texel offsets, no frame-by-frame adaptation. A simple algorithm that "just works".

This algorithm fits very naturally with compute shaders in DX11. However, not everyone out there is a DX11 expert; there are many practictioners busy shipping titles on DX9 class hardware who have not had the opportunity to experiment with DX11 yet. I shall detail the implementation of a very basic histogram based tone mapping operator that may serve as the basis for future expansion.

Running a Compute Shader Over an Image

Compute Shader Introduction

Compute shaders off a great deal of general, flexible power on modern DX11 GPUs. They are very flexible and offer immense general computation power. They are not, however, a direct replacement for C++ code; they have a specific programming model that you have to follow. Whilst you can ignore some details of this model, you will not achieve maximum performance. This programming model therefore largely dictates how you must structure your algorithm.

An individual compute shader contains the inner most operation of some loop. However, the compute shader is not executed as some kind of iterative loop. Instead, many separate invocations of the compute shader code are scheduled, dispatched and retired. Each invocation of the compute shader is known as a thread, and each thread receives unique addressing information. And these threads are executed in parallel. Threads are grouped into thread groups. Each thread is therefore separately executing some part of the desired operation. It is very akin to a parallel map operation in a functional language.

A compute shader is therefore unlike a general CPU process or job. It is unlike the jobs or task programming model of the Cell processor. It is very similar to shading a pixel. In contrast to pixel shading, compute shader threads can communicate with other concurrent threads. A thread may communicate with other threads in the same group using some on-chip fast memory known as group shared memory.

The conclusion of this is that when processing an image with a compute shader, you do not sit in a big x * y loop, fetching pixels and processing them. You do not DMA in parts of an image, process them, and DMA them out again. You write code that processes each pixel independently and dispatch that work to be processed as a whole in parallel. This requires a certain structure to your code, especially if you want to obtain maximum parallelism.


A common paradigm when executing a compute shader over an image is to break the image up into tiles. Each thread typically corresponds to an input pixel.A thread group maps to a tile. Each thread is executed and aggregates data to the local, on-chip memory known as group shared memory. When all the threads in a group have completed, the results are exported from the transient group shared memory to some persistent off-chip memory.

But what is the reasoning behind this model?

If we liked, we could dispatch a compute shader with a thread group size of 1 x 1 x 1 thread. Inside this one thread of one shader, we would iterate through every single pixel in an image and perform our computation. But, this model offers very little parallelism. We would only enjoy the SIMD instruction level parallelism, and would have no thread-level parallelism. GPUs are built to be very, very wide. We would only be occupying a fraction of the GPU.

Ideally. we want to launch as many threads as possible. This undermines any kind of approach where one thread iterates across many pixels. We want a thread per pixel.

And this is what gives rise to the image tiling model. Here, we break the image into tiles of 16x16, 32x32 or whatever size yields the maximum threads and therefore parallelism possible. We exploit group shared memory rather than work directly in off-chip memory for performance reasons. A modern GPU will not simply sit and churn through every every thread of your compute shader until completion. It will execute wavefronts (groups of threads) from multiple different shaders simultaneously across the GPU. If we can read from off-chip memory but write to on-chip memory, we minimise our bandwidth consumption and play better with other shaders. Ultimately we will need to export the results, but we can do that once in a burst rather than constantly causing memory hazards.

We may need to execute subsequent passes that combine together this intermediate metadata to yield a final result. This arrangement is akin to the common map-reduce parallel programming model.

I will now dive straight into the HLSL compute shader code. For clarity and brevity, I won't present the detailed DX11 CPU-side code; I shall just detail what it needs to do.

Computing the Per-Tile Histogram.

Our first task is to compute a histogram for each tile of the image. These will later be combined to form a histogram for the whole image. Each histogram will map the scene's luminance range across N bins. Each bin holds a uint which represents the number of pixels in that luminance band.

To perform this, we need a few things on the DX11 side:
  • A shader resource view that allows us to read the source texture.
  • A shader resource view for our constant buffer.
  • A large buffer to store all of the per-tile histograms. Its size will be numTiles * numHistogramBins.
  • An unordered access view for the per-tile histogram buffer.
We can then dispatch our compute shader. Note, that we have to round up the image size to be a multiple of our compute size; we must then be careful to only sample the real pixels. Here's the code:


Texture2DMS sourceTextureSRV;

cbuffer toneMapGammaConstants
    uint outputWidth;
    uint outputHeight;

    float inputLuminanceMin;
    float inputLuminanceMax;

    uint numPerTileHistograms;

    float outputLuminanceMin;
    float outputLuminanceMax;

// RGB to luminance
float rgbToLuminance(float3 colour)
     // Use this equation:
     return dot(float3(0.2126f, 0.7152f, 0.0722f), colour);

// RGB to log-luminance
float rgbToLogLuminance(float3 colour)
     return log(rgbToLuminance(colour));

// Convert a luminance to a histogram bin index
uint rgbToHistogramBin(float3 colour)
    // Convert to luminance
    float luminance = rgbToLogLuminance(colour);
    // Work out response curve response
    return uint(float(NUM_HISTOGRAM_BINS) * saturate((luminance - inputLuminanceMin) / (inputLuminanceMax - inputLuminanceMin)));

// Thread groups correspond to a tile.
// Threads correspond to a pixel

groupshared uint histogram[NUM_HISTOGRAM_BINS];

Buffer perTileHistogramSRV;
Buffer mergedHistogramSRV;
RWBuffer perTileHistogramUAV;
RWBuffer mergedHistogramUAV;

void compute_per_tile_histogram_cs(uint3 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID, uint3 groupIdx : SV_GroupID)
    uint localIdxFlattened = localIdx.x + localIdx.y * HISTOGRAM_TILE_SIZE_X; // Which pixel within a tile
    uint tileIdxFlattened = groupIdx.x + groupIdx.y * GetNumTilesX(); // Which tile from the screen

    // Initialise the contents of the group shared memory. Only one  thread does it to avoid bank conflict overhead

    if(localIdxFlattened == 0)
        for(uint index = 0; index < NUM_HISTOGRAM_BINS; ++index)
            histogram[index] = 0;


    // For each thread, update the histogram

    // It is possible that we emit more threads than we have pixels. This is caused due to rounding up an image to a multiple of the tile size
    if(globalIdx.x < outputWidth && globalIdx.y < outputHeight)
        // We just use sample 0 for efficiency reasons. It is unlikely that omitting the extra samples will make a significant difference to the histogram
        float3 pixelValue = sourceTextureSRV.Load(int2(globalIdx.xy), 0).rgb;
        uint bin = rgbToHistogramBin(pixelValue);
        InterlockedAdd(histogram[bin], 1u);

    // Thread 0 outputs this thread group's histogram to the buffer
    if(localIdxFlattened == 0)
        // This could write uint4s to the UAV as an optimisation
        uint outputHistogramIndex = NUM_HISTOGRAM_BINS * tileIdxFlattened;
        for(uint index = 0; index < NUM_HISTOGRAM_BINS; ++index)
            uint outputIndex = index + outputHistogramIndex;
            perTileHistogramUAV[outputIndex] = histogram[index];

We now need to take this buffer and combine it to form a single histogram for the whole image.

Merging the Per-Tile Histograms

This is a relatively simple compute shader. In this case, each thread maps to one of the tiles just computed, and each threadgroup therefore represents a group of tiles. This shader atomically adds each tile's histogram to the global histogram. In addition to the previous resources, we also now need:
  • A buffer to hold the histogram for the whole image.
  • An unordered access view to read said buffer.
  • A shader resource view for the per-tile histogram buffer.

// A thread per tile
#define NUM_TILES_PER_THREAD_GROUP 768 // DX10 maximum
[numthreads(NUM_TILES_PER_THREAD_GROUP, 1, 1)]
void merge_histograms_cs(uint3 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID, uint3 groupIdx : SV_GroupID)
    // Each thread has the job of adding in the contents of the per-tile histogram to the overall histogram stored in GSM
    // This is an atomic operation, because many, many threads are trying to hit that memory simultaneously
    uint tileIndex = globalIdx.x;
    if(tileIndex < numPerTileHistograms)
        for(uint binIndex = 0; binIndex < NUM_HISTOGRAM_BINS; ++binIndex)
            InterlockedAdd(mergedHistogramUAV[binIndex], perTileHistogramSRV[tileIndex * NUM_HISTOGRAM_BINS + binIndex]);

Note that here we perform atomic operations on the merged histogram. You may do this on both group shared memory and unordered access views; clearly there is extra overhead in the latter. We now have a histogram for the whole image. Our job is now to convert that to a response curve. This is a function that maps from the input HDR image to a new LDR image. Wikipedia offers an easy explanation.

Computing the Response Curve

Again, more resources are needed:
  • A buffer holding a float for each histogram bin.
  • An unordered access view to write to it with.
  • A shader resource view for the whole-image histogram.
There is a small wrinkle in this code. We also include an extra step - the adjustment phase discussed in Page 14 of the Ward '97 paper. This is included only as an example to get you started. I use a 1x1x1 compute shader here as this is such a minimal operation that parallelism would gain little.

RWBuffer responseCurveUAV;

groupshared float frequencyPerBin[NUM_HISTOGRAM_BINS];
groupshared float initialFrequencyPerBin[NUM_HISTOGRAM_BINS];
[numthreads(1, 1, 1)]
void compute_histogram_response_curve_cs(uint3 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID, uint3 groupIdx : SV_GroupID)
    // Compute the initial frequency per-bin, and save it
    float T = 0.0f;
    for(uint bin = 0; bin < NUM_HISTOGRAM_BINS; ++bin)
        float frequency = float(mergedHistogramSRV[bin]);
        frequencyPerBin[bin] = frequency;
        initialFrequencyPerBin[bin] = frequency;
        T += frequency;

    // Naive histogram adjustment. There are many, many such histogram modification algorithms you may seek to employ - this is 
    // an example implementation that will no doubt later be changed
    // This is an implementation of page 14 of "A Visibility Matching Tone Reproduction Operator for High Dynamic Range Scenes"
    // There are other, better approaches, like Duan 2010:
    float recipDisplayRange = 1.0f / (log(outputLuminanceMax) - log(outputLuminanceMin));

    // Histogram bin step size - in log(cd/m2)
    float deltaB = (inputLuminanceMax - inputLuminanceMin) / float(NUM_HISTOGRAM_BINS); // Luminance values are already log()d

    float tolerance = T * 0.025f;
    float trimmings = 0.0f;
    uint loops = 0;
        // Work out the new histogram total
        T = 0.0f;
        for(uint bin2 = 0; bin2 < NUM_HISTOGRAM_BINS; ++bin2)
            T += frequencyPerBin[bin2];

        if(T < tolerance)
            // This convergence is wrong - put it back to the original
            T = 0.0f;
            for(uint index = 0; index < NUM_HISTOGRAM_BINS; ++index)
                frequencyPerBin[index] = initialFrequencyPerBin[index];
                T += frequencyPerBin[index];

        // Compute the ceiling
        trimmings = 0.0f;
        float ceiling = T * deltaB * recipDisplayRange;
        for(uint bin3 = 0; bin3 < NUM_HISTOGRAM_BINS; ++bin3)
            if(frequencyPerBin[bin3] > ceiling)
                trimmings += frequencyPerBin[bin3] - ceiling;
                frequencyPerBin[bin3] = ceiling;
        T -= trimmings;

    while(trimmings > tolerance && loops < 10);

    // Compute the cumulative distribution function, per bin
    float recipT = 1.0f / T;
    float sum = 0.0f;
    for(uint bin5 = 0; bin5 < NUM_HISTOGRAM_BINS; ++bin5) // By now you are wondering about bin1, bin2, bin3 etc. It is a shader compiler thing.
        float probability = frequencyPerBin[bin5] * recipT;
        sum += probability;
        responseCurveUAV[bin5] = sum;

And we now have a response curve to apply to our image!

Applying the Response Curve

Again, we need one more new resource: A shader resource view to access the response curve buffer. The code is again quite simple. Here, we're running a compute shader over tiles of the image. I've also included an MSAA resolve and gamma correction step, as these are features you will probably need to implement too in a next-generation linearly correct renderer. Note, that you ideally tone-map before the MSAA resolve. I felt it was simpler to present the code this way around. Secondly, ideally your frame buffer would have separated chrominance from luminance and you would just scale the luminance. Again, I've presented it this way for a simple example.


Buffer responseCurveSRV;
RWTexture2D outputUAV;

void main_cs(uint3 threadID : SV_DispatchThreadID)
    float3 sourceRGB = 0;
    for(uint sampleIndex = 0; sampleIndex < NUM_MSAA_SAMPLES; ++sampleIndex)
        sourceRGB += sourceTextureSRV.Load(int2(threadID.xy), sampleIndex).rgb;
    sourceRGB /= NUM_MSAA_SAMPLES;

    // Work out response curve response
    uint bin = rgbToHistogramBin(sourceRGB);
    float scalar = bin < NUM_HISTOGRAM_BINS ? responseCurveSRV[bin] : 1.0f;
    sourceRGB *= scalar;

    outputUAV[threadID.xy] = float4(saturate(pow(sourceRGB, 1.0f / 2.2f)), 1.0f));}

That's It

And there we have it. A simple, bare-bones implementation to illustrate how these things may be implemented. Something to get the practitioner up and running.