Caffe’s formulas for calculating the input region for
Convolution and Pooling layers are, surprisingly, not the same. They are only so slightly different, but this
difference can cause the output sizes of Convolution and Pooling layers to be
different, even if they are both parameterized with the same input size, receptive-field,
padding and stride. This unexpected behavior
seems to confuse many, and I’m the
latest victim.
In this post I’ll explain what’s going on, why, and where
you can see it in the code.
I’m working on a small set of Python scripts to discover
structure in Caffe networks. I won’t go
into the details, since I’m just starting and my ideas are not mature yet, but
in essence I want to look at the “engineering underbelly” of the network:
memory allocation and movement, number and types of layers, and other such odd
information ;-). The scripts parse a given Caffe network
prototxt file, recreate the abstract network structure in memory (using a DAG),
and then analyze it.
I was coding the calculation of the output sizes of Convolution
and Pooling layers, when I noticed that I wasn’t getting the correct values. As one of my test inputs I used the original GoogLeNet network and compared the layers' output BLOB sizes I was calculating to those
published in the GoogLeNet paper - and I was getting the wrong results.Why?
Given a Pooling or Convolutional filter, receptive-field (F), stride (S) and padding (P) parameters, and an input feature-map (C*W*H),
you can calculate the size of the output feature-map (OFM width) using this formula:
Note that this assumes that the receptive-field and stride
are square (same value for height and width dimensions) , which is true for the
networks I’m aware of, so this assumption is valid.
Here’s a toy example to illustrate this:
- Input W, H (IFM width, height) = 10
- F (receptive field size) = 3
- S (stride height and width) = 2
- P (padding height and width) = 1
Which leads to output (OFM width, height) size of (10 - 3 + 2) / 2 + 1 = 5 pixels.
In the image below, the green and gray input pixels compose the IFM (10x10 pixels), where the green pixels represent the centers of the receptive fields as the filter window slides across and down the IFM. The zero-padded pixels are yellow.
Andrej Karpathy calls this a “valid” configuration because “the neurons “fit” neatly and
symmetrically across the input.” In other
words, all of the input pixels that we want to pool or convolve can be used, because their receptive field (3x3) fits entirely in the input feature-map.
Now let's contrast this with a different configuration:
- Input W,H = 10
- F = 3
- S = 2
- P = 0 (i.e. no padding)
Notice what happens to the pixels on the right and bottom (painted blue):
we want to use them because they are part of the IFM, but the receptive fields of the bottom and right-most pixels extend beyond the IFM borders - and therefore they can't be used :-(
When we plug the parameters in our OFM formula, we find that the size of the output is (10 - 3) / 2 + 1 = 4.5 which is not an integer. Karpathy calls this configuration non-valid and it's clear why after we look at the above image. The configuration leads to a seemingly impossible situation where the "blue" pixels need to participate in the computations, but simply can't...
This kind of Convolution and Pooling configurations might be non-valid, but they do appear in real networks. For an example, take a look at the first Convolution layer (conv1/7x7_s2) of the GooLeNet network I mentioned above. It has this configuration:
- Input W,H = 224
- F = 7
- S = 2
- P = 3
and the size of the output is (224 - 7 + 6) / 2 + 1 = 112.5, which is not an integer and therefore not valid. The correct OFM value, as gleaned from the GoogLeNet paper, is 112. So the 112.5 result we calculated using the OFM formula has been rounded down (floor operation).
A bit later in this network, the configuration of a Pooling layer (layer pool1/3x3_s2) is:
- Input W,H = 112
- F = 3
- S = 2
- P = 0
and the size of the output is (112 - 3) / 2 + 1 = 55.5. Here again, we see a non-valid configuration. And if you check the GoogLeNet paper, you'll see that it gives a value of 56 pixels, which is a rounding-up operation (ceiling) of the 55.5 which we calculated.
By way of a short digression, here's the Caffe configuration for this layer:
layer {
name: "pool1/3x3_s2"
type: "Pooling"
bottom: "conv1/7x7_s2"
top: "pool1/3x3_s2"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
You might notice that the padding configuration is missing, but it defaults to zero, as we can see in caffe.proto:
message PoolingParameter {
enum PoolMethod {
MAX = 0;
AVE = 1;
STOCHASTIC = 2;
}
optional PoolMethod pool = 1 [default = MAX]; // The pooling method
// Pad, kernel size, and stride are all given as a single value for equal
// dimensions in height and width or as Y, X pairs.
optional uint32 pad = 4 [default = 0]; // The padding size (equal in Y, X)
Back to our topic: when we look at the Caffe code for Convolution and Pooling we can see that Caffe rounds down for Convolution layers and rounds-up for Pooling layers. But what does this mean?
If we turn to the toy example from above, we can see graphically how a non-valid Convolution configuration is treated by Caffe:
floor((10 - 3) / 2 + 1) = 4
The rounding-down (floor) operation, essentially eliminates a row and a column of pixels from the input. The smaller the IFM, the more impact this rounding decision has.
Now here's how Caffe treats the same non-valid configuration, but this time for a Pooling layer:
ceiling((10 - 3) / 2 + 1) = 5
As the image above shows, we can't round up unless we add zero-padded pixels on the top and left borders of the IFM (or right and bottom borders). Caffe implicitly performs this padding. For Max-Pooling at least, this makes sense: we get to pool some more IFM pixels, and we don't affect the output value because max(x,0) = x.
By now, I think we've cleared up the confusion about what was happening to my calculations. I think that it helps to be aware of this somewhat odd behavior.
I'm not a Torch user, but according to this suggestion to "Add parameter for pooling layer to specify ceil or floor" in Caffe, Torch has means to explicitly specify how to handle non-valid configurations (here's the commit code for adding this feature to Caffe). I'm not sure why this hasn't been merged, but maybe this will be added one day and the confusion will end ;-)
I'm not a Torch user, but according to this suggestion to "Add parameter for pooling layer to specify ceil or floor" in Caffe, Torch has means to explicitly specify how to handle non-valid configurations (here's the commit code for adding this feature to Caffe). I'm not sure why this hasn't been merged, but maybe this will be added one day and the confusion will end ;-)
I was trying to convert a pretrained ResNet-50 from Caffe to Torch. And I've encountered the same problem. This blog helps me understand the implementation differences in Torch and Caffe.
ReplyDeleteI am not so sure when ceil in Caffe, it adds a padding layer in the left or to the right. Cause when I compare the output of before max pooling layer and after, it picks some elements which wouldn't be in field area if the padding is added to the left.
Thanks