Introduction to NLP

Before deep diving into NLP, make sure you are clear with distinction between data and information.

NLP is one such way to get understanding of data and processing it to get some useful information from it.In today’s advanced world, there is infinite amount of data generated every second from various sources. And this data is not of one format or style or in one language. It has thousands of variations in it. Data may be textual, voice, video, images or something else.
Few years back, it was only human being who were able to understand all these kinds of data. But now we are in era where machines can also understand human language and data in any format.

What makes machines so advanced????
NLP is that area of machine learning which makes a computer capable enough to understand, analyse, process and even generate human language.NLP is used in various areas in real life. Some of the most common you come across everyday are:-

  • The most popular is our best friend – Google….

Yesss.. Have you ever thought, how google identifies anything, can search in any language or can even translate your text? It is all with the help of NLP.

  • Google Speech Recognition – This is also NLP.
  • Amazon Alexa- How it identifies your voice and react accordingly. Here, also NLP plays significant role.
  • Google auto predict and auto correct are also popular applications of NLP.

Let’s dive deep into NLP in this article.
NLP is branch of data science to process textual data and speech data.We have a module named nltk (Natural Language Toolkit) in python for processing natural language.
There are various interesting things in NLP. We will discuss them one by one.

Regular expressions

This is one of the most basic and a must to learn topic in NLP and are also known as regex.
Regular expression is an expression where in we define a pattern using alphanumeric as well as special characters and this pattern can be matched against any given input string to identify whether the input string contains the specified pattern or not.
For instance,Regex for IP address is

\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
  (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
  (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
  (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Similarly, regex for any hyperlink is
(https?://\S+)

Let’s see application of regex in python.
Python has its own built in package ‘re’ for regex expressions. Please refer the below link to get basic understanding of regex.
https://github.com/data-stats/NLP/blob/master/Regex.ipynb

Web Scraping

Using regex and famous library NLTK, we can do web scraping of any web page.

Text Data Pre-processing

Sentence splitter

It is self-explanatory that it splits the paragraph into sentences. By default, separator is ‘,’. It returns a list of string where each record is one sentence from an input paragraph. Below is the link for reference.
https://github.com/data-stats/NLP/blob/master/sentence_tokenize.ipynb

Word Tokenization

In NLP, token is the smallest unit that machine can understand. Word tokenizer splits the sentences into words. There are many ways of word tokenization with or without using nltk. Few methods are given below:-

  • Split method
  • Word_tokenize
  • TreebankWordTokenize

Below is the reference link for example of applying all above methods.
https://github.com/data-stats/NLP/blob/master/Word_tokenize.ipynb

Remove Noise

In our daily conversations, we speak and write a lot of junk which does not give any useful information like in any sentence there are only few words  which contains the base message and rest all are the helping words just to complete the sentence as per English rules. While doing NLP, we can avoid such words with no meaningful information. Such words are called as noise in NLP. Regex expression are very helpful in this. Please refer the below code to see how regex helps us to remove noise while doing NLP.
https://github.com/data-stats/NLP/blob/master/NoiseRemoval.ipynb

Stemming

Process of transforming word into its root word. It removes some general part at the end of word. For example- word like powerful , powers , powered , power have the base word as power. So stemming uses algorithm to transform these words to their base word.
Overstemming – When two words are stemmed to the same root word but should not have been.
Understemming – When two words should be stemmed to the same root but are not.
Famous algorithms used in stemming are:-

  • Porter
  • Lancaster
  • Snowball

Below link shows the usage of each stemmer algo.
https://github.com/data-stats/NLP/blob/master/Stemmization.ipynb

Lemmatization

It is also the process of bringing the word to its root form. It works on different form of words like lemmatization of the word ‘ate’ is ‘eat’. It uses context and POS (Part of speech) in order to do lemmatization. POS parameter takes one of the below values.

Please refer below link for lemmatization example. WordNetLemmatizer is one such algorithm from nltk.
https://github.com/data-stats/NLP/blob/master/Lemmatization.ipynb

WordNet

In simple words, WordNet is English lexical database provided by nltk where one can find the meaning of words, acronym or synonym. It basically contains nouns adjectives, adverbs and verbs.
WordNet contains synonyms of word in the form of synsets where each word in the synset has the same meaning. Please refer the below link for example.
https://github.com/data-stats/NLP/blob/master/WordNet.ipynb

Stop words removal

Stop words are the commonly used words which are present in all the data. For example – is, am, are, do, did, was, cannot etc. These words are used as per the English rules but does contain any information. So, we should remove these stop words as part of text pre-processing.
NLTK in python has list of stopwords in 22 different languages. While text pre-processing , we can use this list to remove useless words from text.
There is one more concept of rare word removal where we remove the rare words on the basis of there frequency in the input string. Also, while doing NLP in projects, we need to apply spell check. For this, we use ‘edit_distance’ method from nltk.
Please refer the below link covering basic example of stopword removal, rare word and spell check.
href=”https://github.com/data-stats/NLP/blob/master/StopWordsRemoval.ipynb”>https://github.com/data-stats/NLP/blob/master/StopWordsRemoval.ipynb

POS tagging

It involves identifying the words in the given text as noun, verb, adverb or adjective. There are total of 36 POS tags.
Please refer the below link for the same.
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
NLTK provides us function pos_tag() to get the POS tag for each word in the text.POS tagging is important for syntactic and semantic analysis.
Please refer the below link for basic example.
https://github.com/data-stats/NLP/blob/master/POSTagging.ipynb

Chunking

Chunking is extension of POS tagging and is basically grouping of words in chunks. It adds more structure to the sentence and also known as shallow parsing.

N-grams

It is the contiguous sequence of n-items in a sentence where n can be any positive integer. For example- Consider the below text.
“We are learning NLP from datastats.com.”
A list of bi-grams for above sentence would be –
     [“We are”,”are learning”,”learning NLP”,”NLP from”,”from datastats.com”].
N-gram technique is mostly used in text classification use cases.
Please refer the below link to quickly see how to apply this.
https://github.com/data-stats/NLP/blob/master/NGram.ipynb

Bag of words (BoW)

In any machine learning algorithms, we need to convert textual information in some numeric format so that algos can be applied over it. BoW is a method to count the occurrence of words in a document.
Few important concepts in BoW
Word Embeddings
They are the texts converted into numeric form. Here, a word is mapped to a vector using dictionary. For example, consider the below sentence.

“We are learning NLP from datastats.com”

The dictionary for above sentence would be

[‘We’,’are’,’learning’,’NLP’,’from’,’datastats.com’]


Easiest way to represent word as a vector is one-hot encoding where position of the word is given by 1 and 0 for rest of the words in a sentence. For example, vector representation for learning will be [0,0,1,0,0,0].
There are two types of Word Embeddings

  • Frequency based Embedding
  • Prediction based Embedding
Term Document Matrix

It is 2-D matrix where column are the document names and rows are the words present in those document and cells’ vale represent the frequency of each word in each document. Consider the below example.
Let’s say we have 3 documents as shown below

Term Document Metrix for above would be –

TF-IDF (Term Frequency -Inverse Document Frequency)
TF =\frac{No.\: of \:times\: a\: term\: appears\: in\: a\: document}{Total \:no.\: of\: terms\: in\: the\: document} DF =\frac{d(No.\: of\: documents\: containing\: a\: given\: term)}{D(The\: size \:of\: collection \:of\: the\: documents)} IDF =\log\frac{Total\: no.\: of \:documents}{No. \:of\: documents \:with\: a\: given\:term \:in \:it }

TF-IDF is the most important concept in text modelling. It tells us how relevant is the word to the document in the collection of documents.
The only difference between CountVectorizer and TfidfVectorizer is that CountVectorizer returns integer values and TfidfVectorizer returns float values. Both CountVectorizer and TfidfVectorizer are the vector types of frequency based embeddings.
Please refer the below link to get hands-on of this.
https://github.com/data-stats/NLP/blob/master/BoW.ipynb
So far, we focussed on text pre-processing methods. Next we will discuss about model building methods in NLP.

Model building

It is the process of understanding the relationship between variables. To build model, we apply machine learning algos on pre-processed data that we have done just above. We can apply NLP pre-processing techniques in various model building.

  1. Text Clustering
  2. Text Similarity
  3. Semantic Analysis
  4. Sentiment Analysis
  5. Topic Modelling
  6. Text Classification
  7. Word2Vec

Here, we are discussing Word2Vec in details. We will come back with the details of rest models in upcoming blogs.

Word2Vec

It is the most widely used model which is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words etc. It is used to learn vector representation of the words.This model is used to generate prediction based vector of word embedding. It is a combination of two techniques explained below.

CBOW(Continuous Bag OF Words)

This model predicts the current word from a window of surrounding context words or given a set of context words predict the missing word that is likely to appear in that context.

Skip-gram model

This model predicts the surrounding window of context words using the current word or given a single word, predict the probability of other words that are likely to appear near it in that context.Below image gives illustration of skip-gram model with window size of 2.

Here, the blue highlighted words are the input words and its nearest neighbours till 2 places lie under the window as window size is taken as 2 in this case. On the right side, there are pair of input word with each of its nearest neighbour word within the window size.Word2Vec framework is imported from genism.models library.So, in this blog we have completed all the basic concepts of NLP. We will be back with use case of NLP applying all above techniques.

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert