There are at least two ways to express a convolution operation in
Halide; more if the kernel is separable. I'll review these in this post.
Let's examine a simple Gaussian 3x3 lowpass (smoothing) filter
(also known as a Gaussian Blur):
The straight-forward method sums up the pixel neighborhood, using
the weights in the convolution kernel.
Halide::Func gaussian_3x3_1(Halide::Func input) {
Halide::Func k, gaussian;
Halide::Var x,y,c;
gaussian(x,y,c) = input(x-1, y-1, c) * 1 +
input(x, y-1, c) * 2 + input(x+1, y-1, c) * 1 +
input(x-1, y, c) * 2
+ input(x, y, c) * 4 + input(x+1, y, c) * 2 +
input(x-1, y+1, c) * 1
+ input(x, y+1, c) * 2 + input(x+1, y+1, c) * 1;
gaussian(x,y,c) /= 16;
return gaussian;
}
We have to watch out not to overflow the
summation of the pixel values. In the gaussian_3x3_1 function below, the
type of the summation (gaussian(x,y,c)) is deduced from the type of
input(x,y,c) and if this is an 8-bit type for example, then it will most likely
overflow and the output will be wrong without emitting any errors. One
way to handle this is to explicitly set the kernel weights to floats, but this
will most likely hurt performance because it will require a cast and arithmetic
operations on the float type.
gaussian(x,y,c) = input(x-1, y-1, c) * 1.f + input(x, y-1, c) *
2.f + input(x+1, y-1, c) * 1.f +
input(x-1, y, c) * 2.f + input(x, y, c) * 4.f + input(x+1, y, c)
* 2.f +
input(x-1, y+1, c) * 1.f + input(x, y+1, c) * 2.f + input(x+1, y+1, c) * 1.f;
I prefer to cast the input type so that the caller of the function
has the control on when to cast and when not to cast. Here's an example
which loads a 24-bit RGB image (8 bit per color channel), clamps the image
values and converts from uint8_t to int32_t.
Halide::Image input
= load("images/rgb.png");
Halide::Func padded, padded32;
padded(x,y,c) = input(clamp(x, 0, input.width()-1), clamp(y, 0,
input.height()-1), c);
padded32(x,y,c) = Halide::cast(padded(x,y,c));
Halide::Func gaussian_3x3_fn = gaussian_3x3_1(padded32);
Another method to perform that convolution uses a 2-dimensional
reduction domain for the convolution kernel. We define a 3x3 RDom which
spans from -1 to +1 in both width and height. When the RDom in the code below
is evaluated, r.x takes the values {-1, 0, 1} and r.y similarly takes the
values {-1, 0, 1}. Therefore, x+r.x takes the values {x-1, x, x+1}.
Halide::Func gaussian_3x3_2(Halide::Func input) {
Halide::Func k, gaussian;
Halide::RDom r(-1,3,-1,3);
Halide::Var x,y,c;
k(x,y) = 0;
k(-1,-1) = 1; k(0,-1) = 2;
k(1,-1) = 1;
k(-1, 0) = 2; k(0, 0) = 4;
k(1, 0) = 2;
k(-1, 1) = 1; k(0, 1) = 2;
k(1, 1) = 1;
gaussian(x,y,c) = sum(input(x+r.x, y+r.y, c) *
k(r.x, r.y));
gaussian(x,y,c) /= 16;
return gaussian;
}
Because a Gaussian kernel is separable (that is, it can be
expressed as the outer product of two vectors), we can express it in yet
another way:
Halide::Func gaussian_3x3_3(Halide::Func input) {
Halide::Func gaussian_x, gaussian_y;
Halide::Var x,y,c;
gaussian_x(x,y,c) = (input(x-1,y,c) + input(x,y,c) *
2 + input(x+1,y,c)) / 4;
gaussian_y(x,y,c) = (gaussian_x(x,y-1,c) +
gaussian_x(x,y,c) * 2 + gaussian_x(x,y+1,c) ) / 4;
return gaussian_y;
}
Of course, we can also use a reduction domain here. In this
case we need a 1-dimensional RDom spanning {-1, 0, 1}:
Halide::Func gaussian_3x3_4(Halide::Func input) {
Halide::Func k, gaussian_x, gaussian_y;
Halide::Var x,y,c;
Halide::RDom r(-1,3);
k(x) = 0;
k(-1) = 1; k(0) = 2; k(1)
= 1;
gaussian_x(x,y,c) = sum(input(x+r.x, y, c) * k(r)) /
4;
gaussian_y(x,y,c) = sum(gaussian_x(x, y+r, c) *
k(r)) / 4;
return gaussian_y;
}
And we can also play a bit with the syntax, replacing the
Halide::Sum function with explicit summation over the reduction domain:
Halide::Func gaussian_3x3_5(Halide::Func input) {
Halide::Func k, gaussian_x, gaussian_y;
Halide::Var x,y,c;
Halide::RDom r(-1,3);
k(x) = 0;
k(-1) = 1; k(0) = 2; k(1)
= 1;
gaussian_x(x,y,c) += input(x+r.x, y, c) * k(r) / 4;
gaussian_y(x,y,c) += gaussian_x(x, y+r, c) * k(r) /
4;
return gaussian_y;
}
So does it matter how we specify the algorithm? The premise
of Halide says 'no': write the algorithm once, and then experiment with
different schedules until you get the best performance. Intuitively,
gaussian_3x3_2 is better than gaussian_3x3_1 because the Halide::RDom should
have been optimized by Halide's compiler. And gaussian_3x3_3 should
perform better than gaussian_3x3_2 because it provides another degree of
freedom when scheduling. However, this is intuition and what we care
about is actual performance measurements.
I haven't measured this yet, so I owe you the results soon...
;-)