One of the biggest problem with the RNN’s is the vanishing gradient problem. We can understand this problem by taking a simple example.

Suppose we have a very long sentence

The accountant, who is young resides near the river …………………………………..   , is incapable of handling the business.

This is a very long sentence.

Suppose in this sentence we want to find whether the noun would be singular or plural. To be singular the subject in the sentence should be accountant while for it to be plural the subject should have been accountants.

Since the sentence is extremely long, while we backpropogate through our standard RNN, it becomes difficult for the memory state a<t> to retain information which occurs in the sequence a lot earlier. So in the end, the network cannot retain what was most important due to the long sequence. We overcome this problem by modifying our RNN.

In this article we would be discussing a variant of RNN called LSTM or Long-Short Term Memory. There are various variants of LSTM used. We would be using the LSTM with three gates. The LSTM overcomes the problem of vanishing gradient by using various gates. These gates keep only those information that is required and throws away the rest.

In this LSTM, in addition to the information vector a<t> used in RNN, we use an additional vector c<t> which stores some additional information. It is called cell state vector.

Working of LSTM

To understand LSTM, we would first study by understanding the maths or the equations used in the LSTM.

The cell state in LSTM which stores the previous information is usually denoted by c<t>.

ć<t> = tanh(Wc [a<t-1>, x<t-1>]  + bc )

This is the cell state vector of the tth timestep. Wc is the vector of weights. a<t-1> is the vector of activations of the t-1th timestep. x<t-1> is the vector of the input at t-1th timestep. bcis the bias term.

Now we would create three gates:

Update gate:

Using this gate, we update our cell state vector at every timestep.

u  = sigmoid(Wu [a<t-1>, x<t>]  + bu)

Forget gate

Using this gate, we leave behind the information which is not of much importance

f  = sigmoid(Wf [a<t-1>, x<t>]  + bf)

Output Gate

Using this gate we output our activation for a timestep

o  = sigmoid(Wo [a<t-1>, x<t>]  + bo)

Now having created all these gates we need to find the output of a LSTM cell. That is we need to define the equations of output, the next cell state, and the next activation.

c<t> = ⌠u * ć<t>  + ⌠* c<t-1>

This equation shows one of the output of the LSTM cell, which is the next cell state. Note that ć<t>  was the cell state of the current timestep, from which we updated our cell state which was then passed to next timestep.

a<t> = ⌠u * tanh(c<t>)

In this equation we find the output actvation of the LSTM cell. This activation vector is passed to the next timestep.

This is an example of an LSTM cell. In this cell, the activation unit is represented by ht. It is to note that this diagram may be more difficult to understand and interpret.

LSTM IMPLEMNTATION

Problem statement

We are given the number of passengers travelling from a flight for 144 months. We want to predict the number of passengers. This is a typical time-series prediction problem.

Note: You can find the complete code here: https://www.kaggle.com/amritansh22/flight-passenger-prediction-in-keras-and-pytorch Please upvote it if you find it helpful

KERAS

import numpy as np import pandas as pd

dataset = pd.read_csv("/kaggle/input/air-passengers/AirPassengers.csv")


Checking if there are any null values.

dataset.isnull().sum()


Check dataset shape

dataset.shape


Viewing the dataset

dataset.head()


Now we need to take the number of passengers and store it into an array.

data = np.array(dataset[["#Passengers"]])


Viewing the data

print(data[:5])


Checking the shape of the data.

data.shape


Now we scale the data into a range of -1 and 1 using the scikit-learn library

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(-1, 1)) data = scaler.fit_transform(data)

Now we convert the data into sequences. This is done as the LSTM layer only accepts in this form. We first take 60 passenger sample into X_train and the next passenger sample is stored in y_train. We repeatedly performed this and store them into a numpy arrays X_train and y_train.

X_train = []
y_train = []
for i in range(60, 144):
X_train.append(data[i-60:i, 0])
y_train.append(data[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)


We check the shape of X_train

X_train.shape


But the LSTM does not accept this type of shape. So we need to reshape this so that it is suitable for LSTM.

X_train = np.reshape(X_train, (X_train.shape, X_train.shape, 1))


So this is the new shape.

X_train.shape


Now we start the building our model. We start by importing the necessary libraries.

from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dropout from keras.layers import Dense

Now we start building our model.

We first create an instance of the Sequential class, which is the class of Keras, the Sequential model is a linear stack of layers.

Then we add to the model which named as regressor an LSTM layer. This layer has 50 units or cells of LSTM, we set return_sequences TRUE to return_sequences TRUE to tell that the LSTM cell need to return the last state, so that it can be used in the next cell. Then we tell the shape of thee input sequence that we would be giving to the layer.

Then we add a dropout in order to prevent overfitting.

We stack up these layers.

regressor = Sequential() regressor.add(LSTM(units = 50,return_sequences = True,input_shape = (X_train.shape,1))) regressor.add(Dropout(0.2)) regressor.add(LSTM(units = 50,return_sequences = True)) regressor.add(Dropout(0.2)) regressor.add(LSTM(units = 50,return_sequences = True)) regressor.add(Dropout(0.2))

Then we add the last LSTM layer.

regressor.add(LSTM(units = 50))

Since this is last layer we do not want to take into consideration the last state of the cell, so we skip it as the default is FALSE.

Next we add a Dropout layer and finally add a Dense Layer to get the output.

regressor.add(Dropout(0.2))regressor.add(Dense(units = 1))

Now we get the summary of the model that we just built.

regressor.summary()


Now we will compile the model.

regressor.compile(optimizer = 'adam',loss = 'mean_squared_error')


We used the adam optimizer and the loss function is mean_squared_error as we have a regression problem.

Now we train or fit the model. We would train for 100 epochs, with a batch size of 32.

regressor.fit(X_train,y_train,epochs = 100, batch_size = 32)


With this our training starts.

First 10 epochs:

The training ends with an error of 0.0130

PyTorch

We would study the same problem as above in PyTorch.

import numpy as np import pandas as pd

dataset = pd.read_csv("/kaggle/input/air-passengers/AirPassengers.csv")


Checking if there are any null values.

dataset.isnull().sum()


Check dataset shape

dataset.shape


Viewing the dataset

dataset.head()


Now we need to take the number of passengers and store it into an array.

data = np.array(dataset[["#Passengers"]])


Viewing the data

print(data[:5])


Checking the shape of the data.

data.shape


Now we scale the data into a range of -1 and 1 using the scikit-learn library

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(-1, 1)) data = scaler.fit_transform(data)

Now we convert the data into sequences. This is done as the LSTM layer only accepts in this form.

We first take 60 passenger sample into X_train and the next passenger sample is stored in y_train. We repeatedly performed this and store them into a numpy arrays X_train and y_train.

X_train = []
y_train = []
for i in range(60, 144):
X_train.append(data[i-60:i, 0])
y_train.append(data[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

We check the shape of X_train

X_train.shape


But the LSTM in PyTorch does not accept this type of shape. So we need to reshape this so that it is suitable for LSTM.

X_train = np.reshape(X_train, (X_train.shape, 1, X_train.shape))

So this is the new shape.

X_train.shape


Now we import necessary PyTorch dependencies.

import torch.nn as nn import torch from torch.autograd import Variable

Now we create the LSTM class.

class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(LSTM, self).__init__()

self.LSTM = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers
)
self.out = nn.Linear(hidden_size, output_size)

def forward(self, x, h_state):
r_out, hidden_state = self.LSTM(x, h_state)

hidden_size = hidden_state[-1].size(-1)
r_out = r_out.view(-1, hidden_size)
outs = self.out(r_out)

return outs, hidden_state


In this we create a network composed of multiple LSTM cells stacked up together. The number of LSTM cells is defined by num_layers. The nn.Linear layer is the final layer that gives us the output.

In the forward function we perform the forward propogation of the network. First we store the output and the hidden state of the LSTM in r_out and hidden_state. Then we compute the hidden_size, as this would be used later. Now we need to get the output. We only take the output from the final timetep. So we need to take out the last hidden sate. We then pass it to the Linear layer and get the ouput.

Now we declare some variables to be used later.

INPUT_SIZE = 60 HIDDEN_SIZE = 64 NUM_LAYERS = 2 OUTPUT_SIZE = 1

Now we create a LSTM class.

LSTM = LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, OUTPUT_SIZE)

Now we define the optimizer and the loss function. We will also define the hidden state to None.

optimiser = torch.optim.Adam(LSTM.parameters(), lr=0.01) criterion = nn.MSELoss() hidden_state = None

Now we will start the training.

for epoch in range(100):
inputs = Variable(torch.from_numpy(X_train).float())
labels = Variable(torch.from_numpy(y_train).float())

output, hidden_state = LSTM(inputs, hidden_state)

loss = criterion(output.view(-1), labels)
loss.backward(retain_graph=True)                     # back propagation
optimiser.step()                                     # update the parameters

print('epoch {}, loss {}'.format(epoch,loss.item()))

Now we will train the network. We train it for 100 epochs.

First we convert the X_train and y_train to PyTorch Variables.  Then we calculate the output.  And then find the loss on this output. zero_grad is a PyTorch function. It sets our gradients to zero as PyTorch accumulates the gradients on subsequent backward passes. Then using the backward() function we backpropagate the Neural Network. The step() function updates the parameters using the gradients calculated.

Then we print the loss after every epoch.

Output:

First 10 epochs

After 100th epoch

Categories: Deep Learning

$${}$$