Friday, January 23, 2015

Revisiting the Active Object Pattern - with C++11 Closures

I have a confession: with the never ending things going on (you know: life ;-) I missed the (fairly) recent changes in the C++ language.  C++ was my first OO language and it probably remains my favorite.  I can't help but love the mix of high level abstractions with metal grinding pointer arithmetic - it's like a cool sports car with manual transmission. Beauty and power.  You have to be more alert, more involved. That's part of the fun - you are taking control.  But for some time now C++ felt old, tired, disconnected from the endless stream of new languages.  Until C++11 came along.  

C++11  (formerly known as C++0x) is the most recent version of the standard of the C++ programming language. It was approved by ISO on 12 August 2011. C++11 includes several additions to the core language and extends the C++ standard library, incorporating most of the C++ Technical Report 1 (TR1) libraries.
Bjarne Stroustrup wrote that:
Surprisingly, C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
And it does feel like a new language. And this is exciting for geeks like me.  In this blog post I discuss how I implemented Schmidt's Active Object pattern in a novel way using C++11 Closures.

First, another confession: for a long time I've suffered from Node envy.  Node.js envy, to be precise. Look at this "Hello World" Javascript code:



What I "envy" is not the use of asynchronous I/O operation with callbacks ("the callback pattern"), but the compelling beauty of Lambda functions.  Lambda functions simplify asynchronous programming because they allow us to write code that is seemingly synchronous.  The code that is executed by the lambda function is temporally disjointed from the code that precedes it, and yet both parts are spatially co-located.  And the outcome is smaller, tighter code that feels more natural and is easier to read and maintain.  And this can be done in C++11.

I won't discuss C++11 lambda functions because others have done this better than I can.  This article is an example of some of the great coverage you can find on the net (Alex Allain has lots of interesting material to read).  But I do want to touch on the difference between Lambda functions and Closures, since my implementation below uses Closures.  Lambda functions are anonymous functions that don't need to be bound to a name and can be specified as lambda expressions.  A Closure is an example of a lambda function which "closes" over the environment in which it was specified (meaning that it can access the variables available in the referencing environment).  Alex Allain's article (which I referenced above) doesn't make a big distinction between lambdas and closures and simply treats closures as lambdas with "variable capture".  Syntactically in C++ lambdas and closures are almost identical, so the distinction is there and it is slight, yet I think it is important to note the semantic difference.

On to Active Object.

Douglas Schmidt describes the Active Object design pattern in Pattern Oriented Software Architecture (Volume 2: Patterns for Concurrent and Networked Objects):
The Active Object design pattern decouples method execution from method invocation to enhance concurrency and simplify synchronized access to objects that reside in their own threads of control. 
Once again, I don't want to paraphrase the work of others, so I assume that you are knowledgeable about the details of the Active Object pattern.  If not, you should probably familiarize yourself with the pattern before reading on.
To illustrate my ideas, I will only concentrate on one variation of the Active Object  pattern.  In this variation the Client and Proxy are "folded" into the same object and the Scheduler and ActivationList implement a simple message queue policy (this is reminiscent of Schmidt's original AO paper, which he later expanded on).  I think this is probably the most prevalent variation of the pattern - in which we want to serialize access to an object, and use an in-order queue (FIFO) to "bounce" the method invocation from one thread to another.
Let's look at the example code from the Wikipedia entry on Active Object.  The Wikipedia code is implemented in Java and I went and implemented it using C++11.  I placed the comments in the code to explain the logic.

The more "traditional" method of implementing ActiveObject in C++ involves defining two sets of interfaces: a public interface and a private interface.  Every method in the public interface also appears in the private interface.  The public interface is used by clients to invoke methods on the object, and they create a message indicating the request and its parameters and enqueue the message. The private interface is used by the dispatcher which dequeues messages and invokes the private method.  This works well enough but creates big classes that have a lot of extraneous code that is there just to get all this mechanics to work.  Every change to the interface requires a series of changes (public and private interface; message definition).

A somewhat more sophisticated implementation uses functors.  We no longer need the code which does the switching on the message type when we grab a message from the FIFO and dispatch it.  But the sophistication of the code probably only adds a layer of obfuscation if you are not familiar with the underlying idiom.   We gain too little from this to be worthwhile.

Now let's come full circle and return to the Closure implementation of ActiveObject and add a few of features to it.

Friday, January 2, 2015

Measuring the Performance of Halide Convolutions

In the previous post I detailed five ways to express a 3x3 Gaussian smoothing filter in Halide.  I was curious to understand if the choice of the algorithm expression will have any impact of its performance.  After all, I interpret Halide's promise to "write the algorithm once and then search for the optimal schedule" (not a direct quote) as telling us that, all things being equal, the algorithm implementation is not very important as long as it is functionally correct.  So off I went to do some experimenting and data collection.

To perform the tests, I used rgb.png (from the Halide distribution) as the input image.  This image is has dimensions 768x1280 and has a 24-bit RGB format, which means that each one of the three color channels (red, green, and blue) is represented by 8 bits.
My workstation uses an Intel i7-3770 CPU which supports SSE, SSE2, SSE3, SSE4.1, SSE 4.2 and AVX.  On a Linux machine you can learn about your cpu by invoking:
  $ cat /proc/cpuinfo
Interestingly, the cpuid program did not list all of the vectorization features supported by the CPU so I double checked here.
I selected a set of schedules for the separable Gaussian implementations (those expressed using two Halide functions) and a different set of schedules for the non-separable implementations.  I ran each of the schedules 50 times in a loop and calculated the mean value after removing the two smallest and two largest time samples.  It takes Halide a couple of rounds to "warm up", which I found a bit strange since I invoked the JIT compiler before starting each 50-run loop.  Perhaps some code gets mapped to the instruction cache.  I also calculated the standard deviation of each 50-run loop. Naturally, schedules using parallelization show more jitter.

Before I show the results, I want to discuss some interesting results I observed.

Simple update steps impact performance

Update steps are separately scheduled, but I never expected that a simple update such as the one highlighted below can impact performance.  I expected the default inline schedule will be used, but since the data is readily available in the cache, the update would be painless.  I pasted below two filter implementations, the first with an extra update step and the second without the update step:

Halide::Func gaussian_3x3_1(Halide::Func input) {
    Halide::Func k, gaussian;
    Halide::Var x,y,c;
   
    gaussian(x,y,c) = input(x-1, y-1, c) * 1   + input(x, y-1, c) * 2  + input(x+1, y-1, c) * 1 +
                                 input(x-1, y, c) * 2      + input(x, y, c) * 4     + input(x+1, y, c) * 2 +
                                 input(x-1, y+1, c) * 1  + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1;
    gaussian(x,y,c) /= 16;
    return gaussian;
}

Halide::Func gaussian_3x3_1(Halide::Func input) {
    Halide::Func k, gaussian;
    Halide::Var x,y,c;
   
    gaussian(x,y,c) = (input(x-1, y-1, c) * 1   + input(x, y-1, c) * 2  + input(x+1, y-1, c) * 1 +
                                  input(x-1, y, c) * 2      + input(x, y, c) * 4     + input(x+1, y, c) * 2 +
                                  input(x-1, y+1, c) * 1  + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1) /16;
    return gaussian;
}

How much does this affect the performance?  This really depends on the algorithm and the schedule, in ways that I cannot explain.  I ran the second Gaussian function (gaussian_3x3_2) with and without an update and sometimes I got better results, sometimes worse.  This is shown in the second and third results rows of the table below (2-update and 2-no-update).  In the best performing schedules of gaussian_3x3_2,the no-update implementation provided the best results.

Casting operations also impact performance

In the previous post I discussed why we need to cast the 8-bit pixel channel inputs before calling the Gaussian filter.  If we want to realize the results of the Gaussian filter to a 24-bit PNG image, then we also need to cast the results back to uint8_t.  I found that if I perform the update as part of the Gaussian filter I get the best results.  But if I insist on performing the casting on the results of the filter (i.e. after existing the filter function), then that cast is a legitimate part of the Halide schedule and should be optimized like all other parts.  
    Halide::Var x,y,xi,yi,c;
    Halide::Func padded, padded32, test2;

    padded(x,y,c) = input(clamp(x, 0, input.width()-1), clamp(y, 0, input.height()-1),  c);
    padded32(x,y,c) = Halide::cast(padded(x,y,c));
    Halide::Func test = gaussian_3x3_5(padded32, Separable2dConvolutionSched(7));
    test.compute_root();
    test2(x,y,c) = Halide::cast(test(x,y,c));
    test2.vectorize(x, 4).parallel(y);

The choice of schedule varies widely with the implementation of the algorithm

In the table below you can see seven schedules applied to six different algorithm implementation of Gaussian 3x3.  For each pair of {algorithm, schedule} I provide the mean and standard deviation. The first three implementations (1, 2-update, 2-no-update) are single function implementations while the latter three implementations use two functions (separable convolutions kernels).  That means that the seven schedules of the first three algorithms are different from the schedules of the latter three algorithms.  None the less, it is clear that the choice of schedule varies widely with the implementation of the algorithm, as expected.

Separable kernels are faster

Gaussian 3x3 is a separable filter which means that it can be expressed as the outer product of two vectors. The number of computations in the non-separated Gaussian is roughly (MxNx3x3) when applied to an input image of size MxN pixels.    For the separated version it is (MxN)x(3+3).  So less computations and we would expect it to run faster.  And indeed the results show that the separated implementation is the fastest (of course, I could be wrong.  It is possible that I have not found the optimal schedules).  This is expected, but disheartening.  It means that we should be thinking about optimizing our algorithm, not just the schedule.

Inline reductions perform best

Functions gaussian_3x3_4 and gaussian_3x3_5 are the same except for how they do the accumulation.  Function gaussian_3x3_4 uses the Halide:sum inline reduction while gaussian_3x3_5 accumulates over the domain using an update step.  The results speak for themselves: using the Halide::sum inline reduction provides almost 14-fold better results compared to using an update step (look at the best results in the rows for functions 4 and 5 in the table below).


Halide::Func gaussian_3x3_4(Halide::Func input, const Scheduler &s) {
    Halide::Func k, gaussian_x, gaussian_y;
    Halide::Var x,y,xi,yi,c;
    Halide::RDom r(-1,3);

    k(x) = 0;
    k(-1) = 1;    k(0) = 2;    k(1) = 1;
    gaussian_x(x,y,c) = sum(input(x+r.x, y, c) * k(r)) / 4;
    gaussian_y(x,y,c) = sum(gaussian_x(x, y+r, c) * k(r)) / 4;
   
    s.schedule(gaussian_x, gaussian_y, x, y);
    return gaussian_y;
}

Halide::Func gaussian_3x3_5(Halide::Func input, const Scheduler &s) {
    Halide::Func k, gaussian_x, gaussian_y;
    Halide::Var x,y,xi,yi,c;
    Halide::RDom r(-1,3);

    k(x) = 0;
    k(-1) = 1;    k(0) = 2;    k(1) = 1;
    gaussian_x(x,y,c) += input(x+r.x, y, c) * k(r) / 4;
    gaussian_y(x,y,c) += gaussian_x(x, y+r, c) * k(r) / 4;

    s.schedule(gaussian_x, gaussian_y, x, y);
    return gaussian_y;

}


Larger vector sizes perform better

OK, this one was predictable, but it's nice to see the empirical results.  And make sure your machine supports the vectorization you're trying out.


A sample set consisting of 50 measurement is usually too small

The data I present in the table below consists of 50 samples per tests, but I noticed that sometimes there was variation in the results (the average measurement of the 50 samples) between two test runs (of 50 samples each) which can't be explained by the standard deviation.  When I increased the sample set to 500 samples I got very stable results (I didn't try anything between 50-500, laziness).

Really bad schedules have a really high sample variance

I need to understand why this happens.  I would expect the high variance to appear when thread parallelization is used in high granularity (frequent context switches), but it seems to work the opposite.  I am missing something.

Choosing tile size is a delicate act

My workstation has 4x32KB 8-way set associative L1 data caches and 4x256KB 8-way set associative L2 caches.  Plenty of memory, no doubt.  The largest tile size with which I managed to achieve good performance has size 256x32x4=32KB.  Remember that each 8-bit pixel channel value is expanded to 32 bits to prevent overflow, and that is why I multiplied the tile size by 4.  This limits the vector sizes and also the tile sizes.  Now, if we also parallelize the tile computation, then Halide allocates several instances of the temporary tile buffer and I suspect this is why I saw the optimized tile size peek at 256x32.
Finally,pay attention to the relationship between your tile (or split) sizes and the vector dimensions. The dimension which you choose to vectorize should not be smaller than the vector size that you choose.  Once you set a size for that dimension, you can try several values for the second dimension, while keeping the size of the entire tile within the cache bounds.


Results of running different Gaussian 3x3 implementations with different schedules