Uploaded by ayad_m_k

LSTM Architecture & Text Classification Guide

A Complete Guide to LSTM Architecture and its Use in Text Classification
1 of 10
A Complete Guide to LSTM
Architecture and its Use in Text
Yugesh Verma
8-10 minutes
In the modern age of data science, neural networks are emerging
drastically because they have the ability to perform tasks rapidly
and easily. There are various kinds of neural networks which we
use to perform a variety of tasks. Here in this article, we will be
focused on the LSTM model, one of the variants of the neural
network. In one of our previous articles, we have discussed that
the LSTM networks perform better with sequential data like time
series. Here, we will consider text data as the sequential data and
we will try to fit a LSTM model with this. The major points to be
discussed in this article are given below.
Table of Contents
Sign up for your weekly dose of what's up in emerging technology.
Introduction to LSTM
The Architecture of LSTM
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
2 of 10
1. Forget gate
2. Input gate.
3. Cell state
4. Output gate
Why do we use LSTM with text data?
Text classification using LSTM
LSTM (Long Short-Term Memory) network is a type of RNN
(Recurrent Neural Network) that is widely used for learning
sequential data prediction problems. As every other neural
network LSTM also has some layers which help it to learn and
recognize the pattern for better performance. The basic operation
of LSTM can be considered to hold the required information and
discard the information which is not required or useful for further
There can be various LSTM network types but we can divide them
roughly into three types.
LSTM forward pass
LSTM backwards pass
Bidirectional LSTM or Bi-LSTM
As the name suggests the forward pass and backward pass LSTM
are unidirectional LSTM which process the information in one
direction either on the forward side or on the backside where the
bidirectional LSTM processes the data on both sides to persist the
information. All the above-given LSTM types work on a basic
structure. Updating the basic structure causes the difference
between various LSTM. Next, in the article, we will see different
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
3 of 10
components of a basic LSTM model architecture.
The Architecture of LSTM
A simple LSTM network consists of the following components.
Forget gate
Input gate.
Output gate
Image source
As the hidden layers and various gates are added to the simple
LSTM it changes its type. Like in BI LSTM network it can consist of
two LSTM passing information in an opposite or similar manner.
Let’s have an overview of the gates and the state.
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
4 of 10
Forget Gate
As we have discussed earlier, one of the main properties of the
LSTM is to memorize and recognize the information coming inside
the network and also to discard the information which is not
required to the network to learn the data and predictions. This gate
is responsible for this feature of the LSTM.
It helps in deciding whether information can pass through the
layers of the network. There are two types of input it expects from
the network one of them is the information from the previous layers
and another one is the information from the presentation layer.
The above image shows a circuit of Forget gate where h and x are
information. This information goes through the sigmoid function
where the information which has a tendency towards zero gets
eliminated from the network.
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
5 of 10
Input Gate
Input gate helps in deciding the importance of the information by
updating the cell state. where the forget gate helps in the
elimination of the information from the network input gate decides
the measure of the importance of the information and helps the
forget function in elimination of the not important information and
other layers to learn the information which is important for making
The information goes through the sigmoid and tanh functions
where the sigmoid decides the weight of information and tanh
reduces the bias of the network.
Cell State
The weight gained information goes through the cell state where
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
6 of 10
this layer calculates the cell state. In the cell state, the output of
the forget gate and input gate gets multiplied by each other. The
information which has the possibility of dropping out gets multiplied
with near-zero values.
Here in the cell state, an addition between input and the output
values takes place which tries to get the cell state updated with the
information which is relevant to the network.
Output Gate
It is the last gate of the circuit that helps in deciding the next
hidden state of the network in which information goes through the
sigmoid function. Updated cell from the cell state goes to the tanh
function then it gets multiplied by the sigmoid function of the output
state. Which helps the hidden state to carry the information.
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
7 of 10
This is the final stage of the circuit which helps the hidden state to
decide which information it should carry.
Why do we use LSTM with text data?
When performing normal text modelling, most of the
preprocessing task and modelling task focuses on creating data
sequentially. Examples of such tasks can be POS tagging,
stopwords elimination, sequencing of the text. These are the
methods that try to make data understood by a model with less
effort according to the known pattern. It can give the results.
Here applying LSTM networks can have its own special feature.
Earlier in the article, we have discussed that LSTM has a feature
through which it can memorize the sequence of the data. It has
one more feature that it works on the elimination of unused
information and as we know the text data always consists a lot of
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
8 of 10
unused information which can be eliminated by the LSTM so that
the calculation timing and cost can be reduced,
So basically the feature of elimination of unused information and
memorizing the sequence of the information makes the LSTM a
powerful tool for performing text classification or other text-based
Text classification using LSTM
In this section, I have created a LSTM model for text classification
using the IMDB data set provided by Keras that has the reviews on
the movies provided by the users on the IMDB site.
You can use the full code for making the model on a similar data
import numpy as np
from keras.datasets import imdb
from keras.layers import LSTM, embeddings, dense
from keras.preprocessing.sequence import pad_sequence
# fix random seed for reproducibility
# load the dataset but only keep the top 6000 words
(X_train, y_train), (X_test, y_test) =
# pad input sequences
X_train = pad_sequences(X_train, maxlen=500)
X_test = pad_sequences(X_test, maxlen=500)
model = Sequential()
model.add(Embedding(6000, 32, input_length=500))
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
9 of 10
model.add(Dense(1, activation='sigmoid'))
optimizer='adam', metrics=['accuracy'])
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Before processing the model we created a similar pad sequence
of the data so that it can be put to the model with the same length.
In the modelling, we are making a sequential model. The first layer
of the model is the embedding layer which uses the 32 length
vector, and the next layer is the LSTM layer which has 100
neurons which will work as the memory unit of the model. After
LSTM, the dense layer which is an output layer with sigmoid
function, sigmoid function helps in providing the labels.
Here in the data set, we have good or bad reviews which can be
classified as 0 and 1 values. The loss function in binary crossentropy and it is suggested to use adam optimization when
working with text classification.
The below image shows the results and summary of the model
which we have created.
8/19/2022, 2:25 AM
A Complete Guide to LSTM Architecture and its Use in Text Classification
10 of 10
Here in the model, we used only 3 epochs so that with smaller
data the model will not get overfitted. In the image, we can see the
result from the model is very satisfactory. It has increased to
around 90% and the final accuracy of all three epochs is 85%.
Final Words
As we have seen in the article we have done nothing in data
preprocessing, we just called the data and put it into a simple
LSTM model and the model has given very satisfactory results. We
can do a number of edits in the data or in the model which can be
more helpful for increasing the accuracy of our work. LSTM is a
commonly used network with sequential data like time series data,
audio data. There are various tasks we can perform in the time
series analysis domain using LSTM.
Long short-term memory.
The Sequential class.
Google Colab notebook for above codes
8/19/2022, 2:25 AM