Convolutional Neural Networks or in short CNN’s is a type of neural network used extensively for images. We will later also understand some of its other important applications. In a CNN there are convolutional layers. A convolutional layer is nothing but a sliding window of fixed strides and size.
Array representation of images
An image is a matrix represented in the form of rows and columns. Each element of the matrix represents an image pixel. In a grey scale image the values in the matrix ranges from 0 to 255.
We will first learn the basic components of a typical CNN and then later understand it’s working.
A filter is a matrix of integers that are used on a subset of the input pixel values, the same size as the kernel. Each pixel is multiplied by the corresponding value in the kernel, then the result is summed up for a single value for simplicity representing a grid cell, like a pixel, in the output channel/feature map.
A convolutional kernel is a filter in convolutional layer. A filter is a matrix of some integer values. There are multiple number of filters in a typical CNN. We multiply each filter with the input pixel matrix and then we sum up.
The kernels learn over time with each iteration over training set and update their values.
The filter is shown in the image as a 3×3 matrix while the sliding window is shown in black. We slide over the input matrix in specified strides and obtain a new matrix of features. These filter usually reduce the size of input features and extract high-level features.
You would have seen above that that the pixels at the edges are weighted less than the pixels in the middle. This is so because say the pixel at the top left corner of the input matrix would be considered only once by the sliding window, but in contrast the pixel in the middle of the input matrix would be considered multiple times by the input matrix. Due to this we may be missing important features which may occur at the corners of the input image. To overcome this problem we use padding. One more problem is that the size of an input image shrinks after every convolutional and eventually it becomes small enough wherein we can’t make any predictions.
To overcome these problems we use padding. Padding is adding zeros (or some other integer values) at the edges of the matrix.
Let this is our input matrix of shape 2×2:
We pad it as:
While convoluting we use a stride of fixed size. Stride tells us by how much value we have to move our filter in a layer. A stride of 1 means we will move our filter by a value of 1 over a matrix.
Suppose this our input matrix of shape 5×5 with some values. The box in dark is our filter of shape 2×2 and the size of stride is 1.
So after a convolution of stride 1.
We go on like this until we reach the end of the matrix.
Sometimes it happen that the size of the input becomes large. So to reduce the size of matrix in order to save memory and speed up computation we use a pooling layer. A pooling layer chooses a segment of fixed size and outputs one value.
There are multiple types of pooling layers. We will discuss three of them.
This pooling layer outputs the maximum value in the segment.
This would give an output matrix like:
This pooling layer outputs the minimum value in the segment.
This pooling layer outputs the average of all the values in the segment.
How does a CNN learn?
- Each CNN layer learns filters.
- The beginning layers learn basic features of image such as edges, corners, etc
- The middle layers learn filters that detect parts of objects.
- The last layers learn to recognize complete objects in an
image of different shapes and positions.
How does a CNN work ?
- We give input a matrix of pixels.
- We apply a series of operations on this matrix, such as convolutional layer, pooling layer, padding and others.
- After performing these operations the size of the input matrix reduces and we get a 1D vector.
- This 1D vector contains high level features of the input image which is then fed into some other model usually an Artificial Neural Network.