One of the biggest problem with the RNN’s is the **vanishing gradient problem. **We can understand this problem by taking a simple example.

Suppose we have a very long sentence

**The accountant, who is young resides near the river ………………………………….. , is incapable of handling the business.**

This is a very long sentence.

Suppose in this sentence we want to find whether the noun would be
singular or plural. To be singular the subject in the sentence should be **accountant** while for it to be plural
the subject should have been **accountants.
**

Since the sentence is extremely long, while we **backpropogate** through our standard RNN,
it becomes difficult for the memory state a^{<t>} to retain
information which occurs in the sequence a lot earlier. So in the end, the
network cannot retain what was most important due to the long sequence. We
overcome this problem by modifying our RNN.

In this article we would be discussing a variant of **RNN** called **LSTM** or **Long-Short Term
Memory**. There are various variants of LSTM used. We would be using the LSTM
with three gates. The **LSTM** overcomes
the problem of **vanishing gradient **by
using various gates. These **gates**
keep only those information that is required and throws away the rest.

In this LSTM, in addition to the **information vector a ^{<t>}** used in RNN, we use an additional
vector

**c**which stores some additional information. It is called

^{<t>}**cell state vector**.

**Working of LSTM**

To understand LSTM, we would first study by understanding the maths or the equations used in the LSTM.

The cell state in LSTM which stores the previous information is
usually denoted by **c ^{<t>}.
**

**ć**^{<t> }**=
tanh(W _{c} [a^{<t-1>}, x^{<t-1>}] + b_{c} )**

This is
the cell state vector of the t^{th }timestep. **W _{c}** is the vector of weights.

**a**is the vector of activations of the t-1

^{<t-1> }^{th }timestep.

**x**

^{<t-1>}^{ }is the vector of the input at t-1

^{th }timestep.

**b**is the bias term.

_{c}Now we would create three gates:

**Update gate:**

Using this gate, we update our cell state vector at every timestep.

**⌠ _{u } = sigmoid(W_{u} [a^{<t-1>}, x^{<t>}] + b_{u})**

**Forget gate**

Using this gate, we leave behind the information which is not of much importance

**⌠ _{f } = sigmoid(W_{f} [a^{<t-1>}, x^{<t>}] + b_{f})**

**Output Gate**

Using this gate we output our activation for a timestep

**⌠ _{o } = sigmoid(W_{o }[a^{<t-1>}, x^{<t>}] + b_{o})**

Now having created all these gates we need to find the output of a LSTM cell. That is we need to define the equations of output, the next cell state, and the next activation.

**c ^{<t> }= ⌠_{u }*
ć^{<t>}
+ ⌠_{u }* c^{<t-1>}**

This equation shows one of the output of the LSTM cell, which is
the next cell state. Note that **ć**** ^{<t>}** was the cell state of the current timestep,
from which we updated our cell state which was then passed to next timestep.

**a ^{<t> }= ⌠_{u }*
tanh(c^{<t>)}**

In this equation we find the output **actvation** of the LSTM cell. This activation vector is passed to the next timestep.

This is an example of an LSTM cell. In this cell, the activation unit is represented by **h _{t. }**It is to note that this diagram may be more difficult to understand and interpret.

**LSTM IMPLEMNTATION**

**Problem statement**

We are given the number of passengers travelling from a flight for 144 months. We want to predict the number of passengers. This is a typical time-series prediction problem.

Note: You can find the complete code here: https://www.kaggle.com/amritansh22/flight-passenger-prediction-in-keras-and-pytorch Please upvote it if you find it helpful

**KERAS**

We start by loading necessary data pre-processing and computation libraries.

import numpy as np

import pandas as pd

Loading the dataset.

dataset = pd.read_csv("/kaggle/input/air-passengers/AirPassengers.csv")

Checking if there are any null values.

dataset.isnull().sum()

Check dataset shape

dataset.shape

Viewing the dataset

dataset.head()

Now we need to take the number of passengers and store it into an array.

data = np.array(dataset[["#Passengers"]])

Viewing the data

print(data[:5])

Checking the shape of the data.

data.shape

Now we scale the data into a range of -1 and 1 using the scikit-learn library

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(-1, 1))

data = scaler.fit_transform(data)

Now we convert the data into sequences. This is done as the LSTM layer only accepts in this form. We first take 60 passenger sample into X_train and the next passenger sample is stored in y_train. We repeatedly performed this and store them into a numpy arrays X_train and y_train.

```
X_train = []
y_train = []
for i in range(60, 144):
X_train.append(data[i-60:i, 0])
y_train.append(data[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
```

We check the shape of X_train

X_train.shape

But the LSTM does not accept this type of shape. So we need to reshape this so that it is suitable for LSTM.

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

So this is the new shape.

X_train.shape

Now we start the building our model. We start by importing the necessary libraries.

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dropout

from keras.layers import Dense

Now we start building our model.

We first create an instance of the **Sequential** class, which is the class of Keras, the Sequential model is a linear stack of layers.

Then we add to the model which named as regressor an LSTM layer. This layer has 50 units or cells of LSTM, we set **return_sequences TRUE** to **return_sequences TRUE** to tell that the LSTM cell need to return the last state, so that it can be used in the next cell. Then we tell the shape of thee input sequence that we would be giving to the layer.

Then we add a **dropout** in order to prevent overfitting.

We stack up these layers.

regressor = Sequential()

regressor.add(LSTM(units = 50,return_sequences = True,input_shape = (X_train.shape[1],1))) regressor.add(Dropout(0.2))

regressor.add(LSTM(units = 50,return_sequences = True)) regressor.add(Dropout(0.2))

regressor.add(LSTM(units = 50,return_sequences = True)) regressor.add(Dropout(0.2))

Then we add the last LSTM layer.

regressor.add(LSTM(units = 50))

Since this is last layer we do not want to take into consideration the last state of the cell, so we skip it as the default is **FALSE. **

Next we add a Dropout layer and finally add a Dense Layer to get the output.

regressor.add(Dropout(0.2))

regressor.add(Dense(units = 1))

Now we get the summary of the model that we just built.

regressor.summary()

Now we will compile the model.

regressor.compile(optimizer = 'adam',loss = 'mean_squared_error')

We used the **adam **optimizer
and the loss function is mean_squared_error as we have a regression problem.

Now we train or **fit** the model. We would train for 100 epochs, with a batch size of 32.

regressor.fit(X_train,y_train,epochs = 100, batch_size = 32)

With this our training starts.

First 10 epochs:

The training ends with an error of 0.0130

**PyTorch**

We would study the same problem as above in PyTorch.

We start by loading necessary data pre-processing and computation libraries.

import numpy as np

import pandas as pd

Loading the dataset.

dataset = pd.read_csv("/kaggle/input/air-passengers/AirPassengers.csv")

Checking if there are any null values.

dataset.isnull().sum()

Check dataset shape

dataset.shape

Viewing the dataset

dataset.head()

Now we need to take the number of passengers and store it into an array.

data = np.array(dataset[["#Passengers"]])

Viewing the data

print(data[:5])

Checking the shape of the data.

data.shape

Now we scale the data into a range of -1 and 1 using the scikit-learn library

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(-1, 1))

data = scaler.fit_transform(data)

Now we convert the data into sequences. This is done as the LSTM layer only accepts in this form.

We first take 60 passenger sample into X_train and the next passenger sample is stored in y_train. We repeatedly performed this and store them into a numpy arrays X_train and y_train.

```
X_train = []
y_train = []
for i in range(60, 144):
X_train.append(data[i-60:i, 0])
y_train.append(data[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
```

We check the shape of X_train

X_train.shape

But the LSTM in PyTorch does not accept this type of shape. So we need to reshape this so that it is suitable for LSTM.

X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))

So this is the new shape.

X_train.shape

Now we import necessary PyTorch dependencies.

import torch.nn as nn

import torch

from torch.autograd import Variable

Now we create the LSTM class.

```
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(LSTM, self).__init__()
self.LSTM = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers
)
self.out = nn.Linear(hidden_size, output_size)
def forward(self, x, h_state):
r_out, hidden_state = self.LSTM(x, h_state)
hidden_size = hidden_state[-1].size(-1)
r_out = r_out.view(-1, hidden_size)
outs = self.out(r_out)
return outs, hidden_state
```

In this we create a network composed of multiple LSTM cells
stacked up together. The number of LSTM cells is defined by **num_layers. **The nn.Linear layer is the
final layer that gives us the output.

In the forward function we perform the forward propogation
of the network. First we store the output and the hidden state of the LSTM in **r_out** and **hidden_state. **Then we compute the **hidden_size**, as this would be used later. Now we need to get the
output. We only take the output from the final timetep. So we need to take out
the last hidden sate. We then pass it to the Linear layer and get the ouput.

Now we declare some variables to be used later.

INPUT_SIZE = 60

HIDDEN_SIZE = 64

NUM_LAYERS = 2

OUTPUT_SIZE = 1

Now we create a LSTM class.

LSTM = LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, OUTPUT_SIZE)

Now we define the optimizer and the loss function. We will also define the hidden state to **None**.

optimiser = torch.optim.Adam(LSTM.parameters(), lr=0.01) criterion = nn.MSELoss()

hidden_state = None

Now we will start the training.

```
for epoch in range(100):
inputs = Variable(torch.from_numpy(X_train).float())
labels = Variable(torch.from_numpy(y_train).float())
output, hidden_state = LSTM(inputs, hidden_state)
loss = criterion(output.view(-1), labels)
optimiser.zero_grad()
loss.backward(retain_graph=True) # back propagation
optimiser.step() # update the parameters
print('epoch {}, loss {}'.format(epoch,loss.item()))
```

Now we will train the network. We train it for 100 epochs.

First we convert the **X_train**
and **y_train** to PyTorch Variables. Then we calculate the output. And then find the loss on this output. **zero_grad** is a PyTorch function. It
sets our gradients to zero as PyTorch a*ccumulates the gradients* *on subsequent
backward passes. *Then using the **backward()
**function we **backpropagate** the
Neural Network. The **step()** function
updates the parameters using the gradients calculated.

Then we print the loss after every epoch.

Output:

First 10 epochs

After 100^{th} epoch

## 0 Comments