Strides and padding

Two important parameters common to all convolutions are padding and strides. Let's consider the bidimensional case, but keep in mind that the concepts are always the same. When a kernel (n × m with n, m > 1) is shifted upon an image and it arrives at the end of a dimension, there are two possibilities. The first one, called valid padding, consists of not continuing even if the resulting image is smaller than the original. In particular, if X is a w × h matrix, the resulting convolution output will have dimensions equal to (w - n + 1) × (h - m + 1). However, there are many cases when it's useful to keep the original dimensions, for example, to be able to sum different outputs. This approach is called same padding and it's based on the simple idea to add n - 1 blank columns and m - 1 blank rows to allow the kernel to shift over the original image, yielding a number of pixels equal to the initial dimensions. In many implementations, the default value is set to valid padding.

The other parameter, called strides, defines the number of pixels to skip during each shift. For example, a value set to (1, 1) corresponds to a standard convolution, while strides set to (2, 1) are shown in the following diagram:

Example of bidimensional convolution with strides=2 on the x-axis

In this case, every horizontal shift skips a pixel. Larger strides force a dimensionality reduction when a high granularity is not necessary (for example, in the first layers), while strides set to (1, 1) are normally employed in the last layers to capture smaller details. There are no standard rules to find out the optimal value and testing different configurations is always the best approach. Like any other hyperparameter, too many elements should be taken into account when determining whether a choice is acceptable or not; however, some general pieces of information about the dataset (and therefore about the underlying data generating process) can help in making a reasonable initial decision. For example, if we are working with pictures of buildings whose dimension is vertical, it's possible to start picking a value of (1, 2), because we can assume that there's more informative redundancy in the y-axis than in the x-axis. This choice can dramatically speed up the training process, as the output has one dimension, which is half (with the same padding) of the original one. In this way, larger strides produce a partial denoising and can improve the training speed. At the same time, the information loss could have a negative impact on the accuracy. If that happens, it probably means that the scale isn't high enough to allow skipping some elements without compromising the semantics. For example, an image with very small faces could be irreversibly damaged with large strides, yielding an inability to detect the right feature and a consequent worsening of the classification accuracy.