Friday, January 2, 2015

Measuring the Performance of Halide Convolutions

In the previous post I detailed five ways to express a 3x3 Gaussian smoothing filter in Halide.  I was curious to understand if the choice of the algorithm expression will have any impact of its performance.  After all, I interpret Halide's promise to "write the algorithm once and then search for the optimal schedule" (not a direct quote) as telling us that, all things being equal, the algorithm implementation is not very important as long as it is functionally correct.  So off I went to do some experimenting and data collection.

To perform the tests, I used rgb.png (from the Halide distribution) as the input image.  This image is has dimensions 768x1280 and has a 24-bit RGB format, which means that each one of the three color channels (red, green, and blue) is represented by 8 bits.
My workstation uses an Intel i7-3770 CPU which supports SSE, SSE2, SSE3, SSE4.1, SSE 4.2 and AVX.  On a Linux machine you can learn about your cpu by invoking:
  $ cat /proc/cpuinfo
Interestingly, the cpuid program did not list all of the vectorization features supported by the CPU so I double checked here.
I selected a set of schedules for the separable Gaussian implementations (those expressed using two Halide functions) and a different set of schedules for the non-separable implementations.  I ran each of the schedules 50 times in a loop and calculated the mean value after removing the two smallest and two largest time samples.  It takes Halide a couple of rounds to "warm up", which I found a bit strange since I invoked the JIT compiler before starting each 50-run loop.  Perhaps some code gets mapped to the instruction cache.  I also calculated the standard deviation of each 50-run loop. Naturally, schedules using parallelization show more jitter.

Before I show the results, I want to discuss some interesting results I observed.

Simple update steps impact performance

Update steps are separately scheduled, but I never expected that a simple update such as the one highlighted below can impact performance.  I expected the default inline schedule will be used, but since the data is readily available in the cache, the update would be painless.  I pasted below two filter implementations, the first with an extra update step and the second without the update step:

Halide::Func gaussian_3x3_1(Halide::Func input) {
    Halide::Func k, gaussian;
    Halide::Var x,y,c;
   
    gaussian(x,y,c) = input(x-1, y-1, c) * 1   + input(x, y-1, c) * 2  + input(x+1, y-1, c) * 1 +
                                 input(x-1, y, c) * 2      + input(x, y, c) * 4     + input(x+1, y, c) * 2 +
                                 input(x-1, y+1, c) * 1  + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1;
    gaussian(x,y,c) /= 16;
    return gaussian;
}

Halide::Func gaussian_3x3_1(Halide::Func input) {
    Halide::Func k, gaussian;
    Halide::Var x,y,c;
   
    gaussian(x,y,c) = (input(x-1, y-1, c) * 1   + input(x, y-1, c) * 2  + input(x+1, y-1, c) * 1 +
                                  input(x-1, y, c) * 2      + input(x, y, c) * 4     + input(x+1, y, c) * 2 +
                                  input(x-1, y+1, c) * 1  + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1) /16;
    return gaussian;
}

How much does this affect the performance?  This really depends on the algorithm and the schedule, in ways that I cannot explain.  I ran the second Gaussian function (gaussian_3x3_2) with and without an update and sometimes I got better results, sometimes worse.  This is shown in the second and third results rows of the table below (2-update and 2-no-update).  In the best performing schedules of gaussian_3x3_2,the no-update implementation provided the best results.

Casting operations also impact performance

In the previous post I discussed why we need to cast the 8-bit pixel channel inputs before calling the Gaussian filter.  If we want to realize the results of the Gaussian filter to a 24-bit PNG image, then we also need to cast the results back to uint8_t.  I found that if I perform the update as part of the Gaussian filter I get the best results.  But if I insist on performing the casting on the results of the filter (i.e. after existing the filter function), then that cast is a legitimate part of the Halide schedule and should be optimized like all other parts.  
    Halide::Var x,y,xi,yi,c;
    Halide::Func padded, padded32, test2;

    padded(x,y,c) = input(clamp(x, 0, input.width()-1), clamp(y, 0, input.height()-1),  c);
    padded32(x,y,c) = Halide::cast(padded(x,y,c));
    Halide::Func test = gaussian_3x3_5(padded32, Separable2dConvolutionSched(7));
    test.compute_root();
    test2(x,y,c) = Halide::cast(test(x,y,c));
    test2.vectorize(x, 4).parallel(y);

The choice of schedule varies widely with the implementation of the algorithm

In the table below you can see seven schedules applied to six different algorithm implementation of Gaussian 3x3.  For each pair of {algorithm, schedule} I provide the mean and standard deviation. The first three implementations (1, 2-update, 2-no-update) are single function implementations while the latter three implementations use two functions (separable convolutions kernels).  That means that the seven schedules of the first three algorithms are different from the schedules of the latter three algorithms.  None the less, it is clear that the choice of schedule varies widely with the implementation of the algorithm, as expected.

Separable kernels are faster

Gaussian 3x3 is a separable filter which means that it can be expressed as the outer product of two vectors. The number of computations in the non-separated Gaussian is roughly (MxNx3x3) when applied to an input image of size MxN pixels.    For the separated version it is (MxN)x(3+3).  So less computations and we would expect it to run faster.  And indeed the results show that the separated implementation is the fastest (of course, I could be wrong.  It is possible that I have not found the optimal schedules).  This is expected, but disheartening.  It means that we should be thinking about optimizing our algorithm, not just the schedule.

Inline reductions perform best

Functions gaussian_3x3_4 and gaussian_3x3_5 are the same except for how they do the accumulation.  Function gaussian_3x3_4 uses the Halide:sum inline reduction while gaussian_3x3_5 accumulates over the domain using an update step.  The results speak for themselves: using the Halide::sum inline reduction provides almost 14-fold better results compared to using an update step (look at the best results in the rows for functions 4 and 5 in the table below).


Halide::Func gaussian_3x3_4(Halide::Func input, const Scheduler &s) {
    Halide::Func k, gaussian_x, gaussian_y;
    Halide::Var x,y,xi,yi,c;
    Halide::RDom r(-1,3);

    k(x) = 0;
    k(-1) = 1;    k(0) = 2;    k(1) = 1;
    gaussian_x(x,y,c) = sum(input(x+r.x, y, c) * k(r)) / 4;
    gaussian_y(x,y,c) = sum(gaussian_x(x, y+r, c) * k(r)) / 4;
   
    s.schedule(gaussian_x, gaussian_y, x, y);
    return gaussian_y;
}

Halide::Func gaussian_3x3_5(Halide::Func input, const Scheduler &s) {
    Halide::Func k, gaussian_x, gaussian_y;
    Halide::Var x,y,xi,yi,c;
    Halide::RDom r(-1,3);

    k(x) = 0;
    k(-1) = 1;    k(0) = 2;    k(1) = 1;
    gaussian_x(x,y,c) += input(x+r.x, y, c) * k(r) / 4;
    gaussian_y(x,y,c) += gaussian_x(x, y+r, c) * k(r) / 4;

    s.schedule(gaussian_x, gaussian_y, x, y);
    return gaussian_y;

}


Larger vector sizes perform better

OK, this one was predictable, but it's nice to see the empirical results.  And make sure your machine supports the vectorization you're trying out.


A sample set consisting of 50 measurement is usually too small

The data I present in the table below consists of 50 samples per tests, but I noticed that sometimes there was variation in the results (the average measurement of the 50 samples) between two test runs (of 50 samples each) which can't be explained by the standard deviation.  When I increased the sample set to 500 samples I got very stable results (I didn't try anything between 50-500, laziness).

Really bad schedules have a really high sample variance

I need to understand why this happens.  I would expect the high variance to appear when thread parallelization is used in high granularity (frequent context switches), but it seems to work the opposite.  I am missing something.

Choosing tile size is a delicate act

My workstation has 4x32KB 8-way set associative L1 data caches and 4x256KB 8-way set associative L2 caches.  Plenty of memory, no doubt.  The largest tile size with which I managed to achieve good performance has size 256x32x4=32KB.  Remember that each 8-bit pixel channel value is expanded to 32 bits to prevent overflow, and that is why I multiplied the tile size by 4.  This limits the vector sizes and also the tile sizes.  Now, if we also parallelize the tile computation, then Halide allocates several instances of the temporary tile buffer and I suspect this is why I saw the optimized tile size peek at 256x32.
Finally,pay attention to the relationship between your tile (or split) sizes and the vector dimensions. The dimension which you choose to vectorize should not be smaller than the vector size that you choose.  Once you set a size for that dimension, you can try several values for the second dimension, while keeping the size of the entire tile within the cache bounds.


Results of running different Gaussian 3x3 implementations with different schedules

No comments:

Post a Comment