Saturday, August 13, 2016

Confused about Caffe’s Pooling layer input region behavior?

Caffe’s formulas for calculating the input region for Convolution and Pooling layers are, surprisingly, not the same.  They are only so slightly different, but this difference can cause the output sizes of Convolution and Pooling layers to be different, even if they are both parameterized with the same input size, receptive-field, padding and stride.  This unexpected behavior seems to confuse many, and I’m the latest victim.
In this post I’ll explain what’s going on, why, and where you can see it in the code.

I’m working on a small set of Python scripts to discover structure in Caffe networks.  I won’t go into the details, since I’m just starting and my ideas are not mature yet, but in essence I want to look at the “engineering underbelly” of the network: memory allocation and movement, number and types of layers, and other such odd information ;-).  The scripts parse a given Caffe network prototxt file, recreate the abstract network structure in memory (using a DAG), and then analyze it. 
I was coding the calculation of the output sizes of Convolution and Pooling layers, when I noticed that I wasn’t getting the correct values.  As one of my test inputs I used the original GoogLeNet network and compared the layers' output BLOB sizes I was calculating to those published in the GoogLeNet paper - and I was getting the wrong results.Why?

Given a Pooling or Convolutional filter, receptive-field (F), stride (S) and padding (P) parameters, and an input feature-map (C*W*H), you can calculate the size of the output feature-map (OFM width) using this formula:

Note that this assumes that the receptive-field and stride are square (same value for height and width dimensions) , which is true for the networks I’m aware of, so this assumption is valid. 
If W != H, you can calculate the OFM height simply by replacing W with H in the above formula.

Here’s a toy example to illustrate this: 
  • Input W, H (IFM width, height) = 10
  • F (receptive field size) = 3
  • S (stride height and width) = 2
  • P (padding height and width) = 1

    Which leads to output (OFM width, height) size of (10 - 3 + 2) / 2 + 1 = 5 pixels.

    In the image below, the green and gray input pixels compose the IFM (10x10 pixels), where the green pixels represent the centers of the receptive fields as the filter window slides across and down the IFM. The zero-padded pixels are yellow.



    Andrej Karpathy calls this a “valid” configuration because “the neurons “fit” neatly and symmetrically across the input.”  In other words, all of the input pixels that we want to pool or convolve can be used, because their receptive field (3x3) fits entirely in the input feature-map.

    Now let's contrast this with a different configuration:
    • Input W,H = 10
    • F = 3
    • S = 2
    • P = 0 (i.e. no padding)

    Notice what happens to the pixels on the right and bottom (painted blue): we want to use them because they are part of the IFM, but the receptive fields of the bottom and right-most pixels extend beyond the IFM borders - and therefore they can't be used :-(
    When we plug the parameters in our OFM formula, we find that the size of the output is (10 - 3) / 2 + 1 = 4.5 which is not an integer.  Karpathy calls this configuration non-valid and it's clear why after we look at the above image.  The configuration leads to a seemingly impossible situation where the "blue" pixels need to participate in the computations, but simply can't...

    This kind of Convolution and Pooling configurations might be non-valid, but they do appear in real networks.  For an example, take a look at the first Convolution layer (conv1/7x7_s2) of the GooLeNet network I mentioned above. It has this configuration:
    • Input W,H = 224
    • F = 7
    • S = 2
    • P = 3
    and the size of the output is (224 - 7 + 6) / 2 + 1 = 112.5, which is not an integer and therefore not valid.  The correct OFM value, as gleaned from the GoogLeNet paper, is 112. So the 112.5 result we calculated using the OFM formula has been rounded down (floor operation).
    A bit later in this network, the configuration of a Pooling layer (layer pool1/3x3_s2) is:
    • Input W,H = 112
    • F = 3
    • S = 2
    • P = 0
    and the size of the output is (112 - 3) / 2 + 1 = 55.5.  Here again, we see a non-valid configuration. And if you check the GoogLeNet paper, you'll see that it gives a value of 56 pixels, which is a rounding-up operation (ceiling) of the 55.5 which we calculated.

    By way of a short digression, here's the Caffe configuration for this layer:

    layer {
      name: "pool1/3x3_s2"
      type: "Pooling"
      bottom: "conv1/7x7_s2"
      top: "pool1/3x3_s2"
      pooling_param {
        pool: MAX
        kernel_size: 3
        stride: 2
      }
    }

    You might notice that the padding configuration is missing, but it defaults to zero, as we can see in caffe.proto:

    message PoolingParameter {
      enum PoolMethod {
        MAX = 0;
        AVE = 1;
        STOCHASTIC = 2;
      }
      optional PoolMethod pool = 1 [default = MAX]; // The pooling method
      // Pad, kernel size, and stride are all given as a single value for equal
      // dimensions in height and width or as Y, X pairs.
      optional uint32 pad = 4 [default = 0]; // The padding size (equal in Y, X)

    Back to our topic: when we look at the Caffe code for Convolution and Pooling we can see that Caffe rounds down for Convolution layers and rounds-up for Pooling layers. But what does this mean?

    If we turn to the toy example from above, we can see graphically how a non-valid Convolution configuration is treated by Caffe:
    floor((10 - 3) / 2 + 1) = 4

    The rounding-down (floor) operation, essentially eliminates a row and a column of pixels from the input.  The smaller the IFM, the more impact this rounding decision has.

    Now here's how Caffe treats the same non-valid configuration, but this time for a Pooling layer:

    ceiling((10 - 3) / 2 + 1) = 5

    As the image above shows, we can't round up unless we add zero-padded pixels on the top and left borders of the IFM (or right and bottom borders). Caffe implicitly performs this padding. For Max-Pooling at least, this makes sense: we get to pool some more IFM pixels, and we don't affect the output value because max(x,0) = x.

    By now, I think we've cleared up the confusion about what was happening to my calculations.  I think that it helps to be aware of this somewhat odd behavior.
    I'm not a Torch user, but according to this suggestion to "Add parameter for pooling layer to specify ceil or floor" in Caffe, Torch has means to explicitly specify how to handle non-valid configurations (here's the commit code for adding this feature to Caffe).  I'm not sure why this hasn't been merged, but maybe this will be added one day and the confusion will end ;-)