Introduction to NLP
Before deep diving into NLP, make sure you are clear with distinction between data and information.

NLP is one such way to get
understanding of data and processing it to get some useful information from it.In
today’s advanced world, there is infinite amount of data generated every second
from various sources. And this data is not of one format or style or in one
language. It has thousands of variations in it. Data may be textual, voice,
video, images or something else.
Few years back, it was only human being who
were able to understand all these kinds of data. But now we are in era where
machines can also understand human language and data in any format.
What makes machines so advanced????
NLP
is that area of machine learning which makes a computer capable enough to understand,
analyse, process and even generate human language.NLP is used in various areas
in real life. Some of the most common you come across everyday are:-
- The most popular is our best friend – Google….
Yesss.. Have you ever thought, how google identifies anything, can search in any language or can even translate your text? It is all with the help of NLP.
- Google Speech Recognition – This is also NLP.
- Amazon Alexa- How it identifies your voice and react accordingly. Here, also NLP plays significant role.
- Google auto predict and auto correct are also popular applications of NLP.
Let’s dive deep into NLP in this
article.
NLP is branch of data science to process textual data and speech data.We
have a module named nltk (Natural Language Toolkit) in python for processing
natural language.
There are various interesting things in NLP. We will discuss
them one by one.
Regular expressions
This is one of the most
basic and a must to learn topic in NLP and are also known as regex.
Regular expression is an expression where in
we define a pattern using alphanumeric as well as special characters and this
pattern can be matched against any given input string to identify whether the
input string contains the specified pattern or not.
For instance,Regex
for IP address is
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
Similarly, regex for any
hyperlink is
(https?://\S+)
Let’s see application of
regex in python.
Python has its own built in package ‘re’ for regex expressions. Please
refer the below link to get basic understanding of regex.
https://github.com/data-stats/NLP/blob/master/Regex.ipynb
Web Scraping
Using regex and famous library NLTK, we can do web scraping of any web page.
Text Data Pre-processing
Sentence splitter
It is self-explanatory
that it splits the paragraph into sentences. By default, separator is ‘,’. It
returns a list of string where each record is one sentence from an input paragraph.
Below is the link for reference.
https://github.com/data-stats/NLP/blob/master/sentence_tokenize.ipynb
Word Tokenization
In NLP, token is the smallest unit that machine can understand. Word tokenizer splits the sentences into words. There are many ways of word tokenization with or without using nltk. Few methods are given below:-
- Split method
- Word_tokenize
- TreebankWordTokenize
Below is the reference link for example of applying all
above methods.
https://github.com/data-stats/NLP/blob/master/Word_tokenize.ipynb
Remove Noise
In our daily conversations,
we speak and write a lot of junk which does not give any useful information
like in any sentence there are only few words which contains the base message and rest all are
the helping words just to complete the sentence as per English rules. While
doing NLP, we can avoid such words with no meaningful information. Such words
are called as noise in NLP. Regex expression are very helpful in this. Please
refer the below code to see how regex helps us to remove noise while doing NLP.
https://github.com/data-stats/NLP/blob/master/NoiseRemoval.ipynb
Stemming
Process of transforming
word into its root word. It removes some general part at the end of word. For
example- word like powerful , powers , powered , power have the base word as
power. So stemming uses algorithm to transform these words to their base word.
Overstemming
– When two words are stemmed to the same root word but should not have
been.
Understemming – When two words should be stemmed to the same root but
are not.
Famous algorithms used in stemming are:-
- Porter
- Lancaster
- Snowball
Below link shows the usage of each stemmer algo.
https://github.com/data-stats/NLP/blob/master/Stemmization.ipynb
Lemmatization
It is also the process of bringing the word to its root form. It works on different form of words like lemmatization of the word ‘ate’ is ‘eat’. It uses context and POS (Part of speech) in order to do lemmatization. POS parameter takes one of the below values.

Please refer below link for lemmatization example. WordNetLemmatizer is one
such algorithm from nltk.
https://github.com/data-stats/NLP/blob/master/Lemmatization.ipynb
WordNet
In simple words, WordNet is English lexical database provided
by nltk where one can find the meaning of words, acronym or synonym. It basically
contains nouns adjectives, adverbs and verbs.
WordNet contains synonyms of word
in the form of synsets where each word in the synset has the same meaning. Please
refer the below link for example.
https://github.com/data-stats/NLP/blob/master/WordNet.ipynb
Stop words removal
Stop words are the
commonly used words which are present in all the data. For example – is, am, are,
do, did, was, cannot etc. These words are used as per the English rules but does
contain any information. So, we should remove these stop words as part of text
pre-processing.
NLTK in python has list of stopwords in 22 different languages.
While text pre-processing , we can use this list to remove useless words from
text.
There is one more concept of rare word removal where we remove the rare
words on the basis of there frequency in the input string. Also, while doing NLP
in projects, we need to apply spell check. For this, we use ‘edit_distance’
method from nltk.
Please refer the below link covering basic example of stopword
removal, rare word and spell check.
href=”https://github.com/data-stats/NLP/blob/master/StopWordsRemoval.ipynb”>https://github.com/data-stats/NLP/blob/master/StopWordsRemoval.ipynb
POS tagging
It involves identifying
the words in the given text as noun, verb, adverb or adjective. There are total
of 36 POS tags.
Please refer the below link for the same.
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
NLTK provides us function pos_tag() to get the POS tag
for each word in the text.POS tagging is important for syntactic and semantic analysis.
Please
refer the below link for basic example.
https://github.com/data-stats/NLP/blob/master/POSTagging.ipynb
Chunking
Chunking is extension of POS tagging and is basically grouping of words in chunks. It adds more structure to the sentence and also known as shallow parsing.
N-grams
It is the contiguous sequence
of n-items in a sentence where n can be any positive integer. For example-
Consider the below text.
“We are learning NLP from datastats.com.”
A list of bi-grams
for above sentence would be –
[“We are”,”are learning”,”learning
NLP”,”NLP from”,”from datastats.com”].
N-gram technique is mostly used in
text classification use cases.
Please refer the below link to quickly see how to
apply this.
https://github.com/data-stats/NLP/blob/master/NGram.ipynb
Bag of words (BoW)
In any machine learning
algorithms, we need to convert textual information in some numeric format so
that algos can be applied over it. BoW is a method to count the occurrence of
words in a document.
Few important concepts in BoW
Word Embeddings – They
are the texts converted into numeric form. Here, a word is mapped to a vector
using dictionary. For example, consider the below sentence.
The dictionary for above sentence would be
Easiest
way to represent word as a vector is one-hot encoding where position of the
word is given by 1 and 0 for rest of the words in a sentence. For example,
vector representation for learning will be [0,0,1,0,0,0].
There are two
types of Word Embeddings
- Frequency based Embedding
- Prediction based Embedding
Term Document Matrix
It is 2-D matrix where column
are the document names and rows are the words present in those document and
cells’ vale represent the frequency of each word in each document. Consider the
below example.
Let’s say we have 3 documents as shown below

Term Document Metrix for above would be –

TF-IDF (Term Frequency -Inverse Document Frequency)
TF =\frac{No.\: of \:times\: a\: term\: appears\: in\: a\: document}{Total \:no.\: of\: terms\: in\: the\: document} DF =\frac{d(No.\: of\: documents\: containing\: a\: given\: term)}{D(The\: size \:of\: collection \:of\: the\: documents)} IDF =\log\frac{Total\: no.\: of \:documents}{No. \:of\: documents \:with\: a\: given\:term \:in \:it }TF-IDF is the most important concept in text modelling.
It tells us how relevant is the word to the document in the collection of documents.
The
only difference between CountVectorizer and TfidfVectorizer is that
CountVectorizer returns integer values and TfidfVectorizer returns float
values. Both CountVectorizer and TfidfVectorizer are the vector types of
frequency based embeddings.
Please refer the below link to get hands-on of this.
https://github.com/data-stats/NLP/blob/master/BoW.ipynb
So far, we focussed on text pre-processing methods. Next
we will discuss about model building methods in NLP.
Model building
It is the process of understanding the relationship between variables. To build model, we apply machine learning algos on pre-processed data that we have done just above. We can apply NLP pre-processing techniques in various model building.
- Text Clustering
- Text Similarity
- Semantic Analysis
- Sentiment Analysis
- Topic Modelling
- Text Classification
- Word2Vec
Here, we are discussing Word2Vec in details. We will come back with the details of rest models in upcoming blogs.
Word2Vec
It is the most widely used model which is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words etc. It is used to learn vector representation of the words.This model is used to generate prediction based vector of word embedding. It is a combination of two techniques explained below.
CBOW(Continuous Bag OF Words)
This model predicts the current word from a window of surrounding context words or given a set of context words predict the missing word that is likely to appear in that context.
Skip-gram model
This model predicts the surrounding window of context words using the current word or given a single word, predict the probability of other words that are likely to appear near it in that context.Below image gives illustration of skip-gram model with window size of 2.

Here, the blue highlighted words are the input words and its nearest neighbours till 2 places lie under the window as window size is taken as 2 in this case. On the right side, there are pair of input word with each of its nearest neighbour word within the window size.Word2Vec framework is imported from genism.models library.So, in this blog we have completed all the basic concepts of NLP. We will be back with use case of NLP applying all above techniques.
0 Comments