Sunday, December 21, 2014

Several Ways to Express a Convolution in Halide

There are at least two ways to express a convolution operation in Halide; more if the kernel is separable.  I'll review these in this post.

Let's examine a simple Gaussian 3x3 lowpass (smoothing) filter (also known as a Gaussian Blur):



\frac{1}{16}
\left[ {\begin{array}{ccc}
 1 & 2 & 1  \\
 2 & 4 & 2  \\
 1 & 2 & 1  \\
 \end{array} } \right]

The straight-forward method sums up the pixel neighborhood, using the weights in the convolution kernel.

Halide::Func gaussian_3x3_1(Halide::Func input) {
    Halide::Func k, gaussian;
    Halide::Var x,y,c;
   
    gaussian(x,y,c) = input(x-1, y-1, c) * 1   + input(x, y-1, c) * 2  + input(x+1, y-1, c) * 1 +
                                 input(x-1, y, c) * 2      + input(x, y, c) * 4     + input(x+1, y, c) * 2 +
                                 input(x-1, y+1, c) * 1  + input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1;
    gaussian(x,y,c) /= 16;
    return gaussian;
}

We have to watch out not to overflow the summation of the pixel values.  In the gaussian_3x3_1 function below, the type of the summation (gaussian(x,y,c)) is deduced from the type of input(x,y,c) and if this is an 8-bit type for example, then it will most likely overflow and the output will be wrong without emitting any errors.  One way to handle this is to explicitly set the kernel weights to floats, but this will most likely hurt performance because it will require a cast and arithmetic operations on the float type.


gaussian(x,y,c) = input(x-1, y-1, c) * 1.f + input(x, y-1, c) * 2.f + input(x+1, y-1, c) * 1.f +
                  input(x-1, y, c) * 2.f   + input(x, y, c) * 4.f   + input(x+1, y, c) * 2.f +
                  input(x-1, y+1, c) * 1.f + input(x, y+1, c) * 2.f + input(x+1, y+1, c) * 1.f;


I prefer to cast the input type so that the caller of the function has the control on when to cast and when not to cast.  Here's an example which loads a 24-bit RGB image (8 bit per color channel), clamps the image values and converts from uint8_t to int32_t.

Halide::Image input = load("images/rgb.png");
Halide::Func padded, padded32;
padded(x,y,c) = input(clamp(x, 0, input.width()-1), clamp(y, 0, input.height()-1), c);
padded32(x,y,c) = Halide::cast(padded(x,y,c));
Halide::Func gaussian_3x3_fn = gaussian_3x3_1(padded32);


Another method to perform that convolution uses a 2-dimensional reduction domain for the convolution kernel.  We define a 3x3 RDom which spans from -1 to +1 in both width and height. When the RDom in the code below is evaluated, r.x takes the values {-1, 0, 1} and r.y similarly takes the values {-1, 0, 1}. Therefore, x+r.x takes the values {x-1, x, x+1}.  

Halide::Func gaussian_3x3_2(Halide::Func input) {
    Halide::Func k, gaussian;
    Halide::RDom r(-1,3,-1,3);
    Halide::Var x,y,c;
    
    k(x,y) = 0;
    k(-1,-1) = 1;    k(0,-1) = 2;    k(1,-1) = 1;
    k(-1, 0) = 2;    k(0, 0) = 4;    k(1, 0) = 2;
    k(-1, 1) = 1;    k(0, 1) = 2;    k(1, 1) = 1;

    gaussian(x,y,c) = sum(input(x+r.x, y+r.y, c) * k(r.x, r.y));
    gaussian(x,y,c) /= 16;

    return gaussian;
}

Because a Gaussian kernel is separable (that is, it can be expressed as the outer product of two vectors), we can express it in yet another way:



h[m,n]

Halide::Func gaussian_3x3_3(Halide::Func input) {
    Halide::Func gaussian_x, gaussian_y;
    Halide::Var x,y,c;
    
    gaussian_x(x,y,c) = (input(x-1,y,c) + input(x,y,c) * 2 + input(x+1,y,c)) / 4;
    gaussian_y(x,y,c) = (gaussian_x(x,y-1,c)  + gaussian_x(x,y,c) * 2 + gaussian_x(x,y+1,c) ) / 4;

    return gaussian_y;
}

Of course, we can also use a reduction domain here.  In this case we need a 1-dimensional RDom spanning {-1, 0, 1}:

Halide::Func gaussian_3x3_4(Halide::Func input) {
    Halide::Func k, gaussian_x, gaussian_y;
    Halide::Var x,y,c;
    Halide::RDom r(-1,3);

    k(x) = 0;
    k(-1) = 1;    k(0) = 2;    k(1) = 1;
    gaussian_x(x,y,c) = sum(input(x+r.x, y, c) * k(r)) / 4;
    gaussian_y(x,y,c) = sum(gaussian_x(x, y+r, c) * k(r)) / 4;

    return gaussian_y;
}

And we can also play a bit with the syntax, replacing the Halide::Sum function with explicit summation over the reduction domain:

Halide::Func gaussian_3x3_5(Halide::Func input) {
    Halide::Func k, gaussian_x, gaussian_y;
    Halide::Var x,y,c;
    Halide::RDom r(-1,3);

    k(x) = 0;
    k(-1) = 1;    k(0) = 2;    k(1) = 1;
    gaussian_x(x,y,c) += input(x+r.x, y, c) * k(r) / 4;
    gaussian_y(x,y,c) += gaussian_x(x, y+r, c) * k(r) / 4;

    return gaussian_y;
}

So does it matter how we specify the algorithm?  The premise of Halide says 'no': write the algorithm once, and then experiment with different schedules until you get the best performance.  Intuitively, gaussian_3x3_2 is better than gaussian_3x3_1 because the Halide::RDom should have been optimized by Halide's compiler.  And gaussian_3x3_3 should perform better than gaussian_3x3_2 because it provides another degree of freedom when scheduling.  However, this is intuition and what we care about is actual performance measurements. 

I haven't measured this yet, so I owe you the results soon...  ;-)


No comments:

Post a Comment