Gradient descent is an optimization algorithm for finding the minimum of a function. To reach the minimum of a function, we take steps proportional to value of the gradient vector at that point (which is always negative). Thus we keep on decreasing the parameter values to reach the minimum. Gradient descent was originally proposed by Cauchy in 1847.

In the case of deep learning, our aim is to reduce the value of the loss function (which depicts by how much neural networks predicted value differs from the actual value), by updating the values of the parameters (weight and biases) through gradient descent algorithm.

Let’s first take a simple example of gradient descent algorithm:

Let us take an equation given by:

The derivative of the function is given by:

The gradient decent algorithm works by iterating a number of times and updating the values. Suppose we would like to predict the value of x for which the given equation has the minimum value. We first take any arbitrary value of x and then update it in each iteration of the gradient descent.

Let’s understand it by an example in python:

This gives us the value of x for which f(x) is minimum.

alpha is a fixed value. This is multiplied with the derivative of function and decides the size of the step taken in every iteration.

The gradient descent in Neural Network works in the same way as explained above.

This is a graph of loss function with respect to weight w.

Now let’s understand how w is updated.

In each iteration of the gradient descent w is updated as:

where α is the learning rate and J(w) is a loss function with weight w as parameter. The term

{dJ(w) \above{1pt} dw}

is the differentiation of the loss function with respect to w.

The variation of the Gradient Descent that we discussed is called the Batch Gradient Descent as it takes the whole batch of the training examples for each iteration of the gradient decent. This would be very computationally very expensive.

Now we will discuss a computationally less expensive variation of the Gradient Descent algorithm called Stochastic Gradient Descent. What this algorithm does is that it randomly selects a given small batch of the training set and in each iteration of the gradient descent it updates the value of the parameters. Obviously this is a much faster algorithm than Batch Gradient Descent as it takes only a small instance of the training set. But the problem with this algorithm is that it does not steadily reduce to minimum but may jump of here and there in its iteration because of its stochastic nature.

$${}$$