Gradient descent is an optimization algorithm for finding the minimum of a function. To reach the minimum of a function, we take steps proportional to value of the gradient vector at that point (which is always negative). Thus we keep on decreasing the parameter values to reach the minimum. Gradient descent was originally proposed by Cauchy in 1847.

In the case of deep learning, our aim is to reduce the value of the **loss function **(which depicts by how much neural networks predicted value differs from the actual value), by updating the values of the **parameters **(weight and biases) through **gradient descent algorithm**.

Let’s first take a simple example of **gradient descent algorithm**:

Let us take an equation given by:

The derivative of the function is given by:** **

The gradient decent algorithm works by iterating a number of times and updating the values. Suppose we would like to predict the value of x for which the given equation has the minimum value. We first take any arbitrary value of x and then update it in each iteration of the gradient descent.

Let’s understand it by an example in python:

This gives us the value of x for which **f(x) **is minimum.

alpha is a fixed value. This is multiplied with the derivative of function and decides the size of the step taken in every iteration.

**Gradient descent in Neural Networks**

The gradient descent in Neural Network works in the same way as explained above.

This is a
graph of **loss function** with respect
to **weight** w.

Now let’s understand how w is updated.

In each iteration of the gradient descent w is updated as:

where **α** is the learning rate and** J(w)** is a loss function with weight **w **as parameter. The term

is the differentiation of the loss function with respect to w.

**Stochastic Gradient Descent**

The
variation of the **Gradient Descent **that we discussed is called the **Batch** **Gradient Descent **as it takes the whole batch of the training
examples for each iteration of the gradient decent. This would be very
computationally very expensive.

Now we will discuss a computationally less expensive variation of the Gradient Descent algorithm called **Stochastic** **Gradient Descent. ** What this algorithm does is that it randomly selects a given small batch of the training set and in each iteration of the gradient descent it updates the value of the parameters. Obviously this is a much faster algorithm than **Batch Gradient Descent** as it takes only a small instance of the training set. But the problem with this algorithm is that it does not steadily reduce to minimum but may jump of here and there in its iteration because of its stochastic nature.

## 0 Comments