Howdy! In today’s blog, I’ll be guiding you on how to create a WhatsApp Chat Analyzer. So, lets first talk about the utility of this project and how will we do that. The basic idea behind this project is to generate API which can store all our predictions in MongoDB.

Text Box: Howdy! So, lets first talk about the utility of                     
this project and how will we do that.

The basic idea behind this project is to generate API which can store all our prediction in MongoDB.

Overview

Let’s go through the quick overview before starting the project.

Path we’ll follow:

  • Collection of training dataset
  • Pre-processing
  • Applying Model
  • Setup pipeline and save it
  • Export WhatsApp chat
  • Building API and Storing prediction in MongoDB

Collection of Dataset

The basic idea of this blog is to give you the proper pathway so I had not taken large data as it will consume more time for training, but you can train it with larger data set for practical use.

Either manually or by Web Scrapping, we’ll collect interview questions related to ML, Bigdata, and ReactJs then save it to csv files as ML_interview.csv, BigData_interview.csv and Reactjs_interview.csv.

I had collected data from the following links:

You can download my dataset from below links:

Importing Libraries

from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np # linear algebra
import pandas as pd #data processing
import os
import re
import nltk

Appending Data Frames

Now, we’ll append the all the data frames that are holding the individual csv files.

def Prep(label):
    x=[]
    for i in range(50):
        x.append(label)
    return x

df1 = pd.read_csv('ML_interview.csv')
df1['Label']=Prep('ML')
df2 = pd.read_csv("Reactjs_interview.csv")
df2['Label']=Prep('ReactJs')
df3 = pd.read_csv("BigData_interview.csv")
df3['Label']=Prep('BigData')

train=df1.append(df2).append(df3)
train.drop('Unnamed: 0',1,inplace=True)
train.Label.unique()
train.Label.replace({'ML':0,'ReactJs':1,'BigData':2},inplace=True)
train=train.fillna(' ')

Now, our data frame is ready for pre-processing.

Pre-processing

Creating Cloud Visual

We’ll start by forming a cloud visual to get familiarized with dataset.

real_words = ''
fake_words = ''
stopwords = set(STOPWORDS) 
  
# iterate through the csv file 
for val in train[train['Label']==0].questions: 
  
    # split the value 
    tokens = val.split() 
      
    # Converts each token into lowercase 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    real_words += " ".join(tokens)+" "

for val in train[train['Label']==1].questions: 
      
    # split the value 
    tokens = val.split() 
      
    # Converts each token into lowercase 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    fake_words += " ".join(tokens)+" "
    
for val in train[train['Label']==2].questions: 
      
    # split the value 
    tokens = val.split() 
      
    # Converts each token into lowercase 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    fake_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(real_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show()

We’ll get a beautiful image as output when we execute the above code, which is actually our cloud visual.

Output cloud visual

Applying NLP Techniques

Next, we’ll be applying NLP techniques.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = stopwords.words('english')

lemmatizer=WordNetLemmatizer()
for index,row in train.iterrows():
    filter_sentence = ''  
    sentence = row['questions']
    sentence = re.sub(r'[^\w\s]','',sentence) #cleaning  
    words = nltk.word_tokenize(sentence) #tokenization    
    words = [w for w in words if not w in stop_words]  #stopwords removal    
    for word in words:
        filter_sentence = filter_sentence + ' ' + str(lemmatizer.lemmatize(word)).lower()       
    train.loc[index,'total'] = filter_sentence

train = train[['questions','Label']]

The above lines of code results a clean dataset.

Applying TFid vectorizer

Next, we’ll prepare a function for Tf-idf vectorizer. Tf-idf transforms the texts into the feature vectors and use it as input to the estimator.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

X_train=train['questions']
Y_train=train['Label']

    vectorizer = TfidfVectorizer( stop_words='english',
                            decode_error='strict',
                            analyzer='word',
                            ngram_range=(1, 2),
                            max_features=max_features
                            #max_df=0.5 # Verwendet im ML-Kurs unter Preprocessing                   
                            )
    feature_vec = vectorizer.fit_transform(features)
    return feature_vec.toarray()

#Feature extraction using count vectorization and tfidf.
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X_train)
freq_term_matrix = count_vectorizer.transform(X_train)
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)

Congratulations!! Your text is successfully stored in tf_idf_matrix variable. It is now fit for training the model.

Training the Model

In the next step, we’ll be training our model.

Let’s train our data using Logistic Regression, MultinomialNB, and Random Forest Classifier.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tf_idf_matrix, Y_train,test_size=0.3, random_state=0)

Logistic Regression

#Logistic Regression

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)
print('Accuracy of Lasso classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

from sklearn.metrics import f1_score,recall_score,precision_score
print(f1_score(y_test,pred,average='micro'))
print(recall_score(y_test,pred,average='micro'))
print(precision_score(y_test,pred,average='micro'))

MultinomialNB

from sklearn.naive_bayes import MultinomialNB

NB = MultinomialNB()
NB.fit(X_train, y_train)
pred = NB.predict(X_test)
print('Accuracy of NB  classifier on training set: {:.2f}'
     .format(NB.score(X_train, y_train)))
print('Accuracy of NB classifier on test set: {:.2f}'
     .format(NB.score(X_test, y_test)))

from sklearn.metrics import f1_score,recall_score,precision_score
print(f1_score(y_test,pred,average='micro'))
print(recall_score(y_test,pred,average='micro'))
print(precision_score(y_test,pred,average='micro'))

Random Forest Classifier

from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
                            n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=9, random_state=0)
clf.fit(X_train, y_train)
pred=clf.predict(X_test)
print('Accuracy of RF  classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

from sklearn.metrics import f1_score,recall_score,precision_score
print(f1_score(y_test,pred,average='micro'))
print(recall_score(y_test,pred,average='micro'))
print(precision_score(y_test,pred,average='micro'))

Considering all the above three models, we found that Logistic Regression is performing better as compared to the rest two.

So, for further processing, I’ll be choosing Logistic Regression. It may vary if you use larger datasets.

Setting Pipeline and saving it

Now, we’ll be setting the pipeline and saving it, so that we can directly access our model without training it again and again.

from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn import linear_model
import warnings
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer(norm='l2')),
    ('clf', linear_model.LogisticRegression(C=1e5)),
])

X_train = train['questions']
Y_train = train['Label']
pipeline.fit(X_train,Y_train)

#saving the pipeline
filename = 'pipeline.sav'
joblib.dump(pipeline, filename)

I had saved the model as pipeline.sav

Exporting WhatsApp chat

As our project is about exporting WhatsApp data and passing it through our model and storing the predictions in the MongoDB database.

So, let’s not delay it. Let’s begin with further execution.

Firstly, we need to export the chat.

Next, we need to download the exported chat from the email inbox. It should resemble the following:

We’ll save the above file as file.txt

Building API and Storing prediction in MongoDB

Before moving further let’s take a look on MongoDB. For storing your data there, you first have an account in Mongo Cloud. If you are using it for the first time kindly go through the article for setup.

By now, you must have successfully setup your account on Mongo cloud along with creating a cluster.

Create a database on Mongo cloud by giving the name ‘Projects’ and give ‘python_test’ as the name of the collection.

Now, let’s import the packages.

import pandas as pd
import numpy as np
from pymongo import MongoClient
import csv
import pandas as pd
import urllib
import os
import pymongo
import joblib

Here we have imported

  • pymongo for creating a connection between python and mobgoDB, and
  • joblib for applying our saved model.

Algorithm:

  • Pre-processing the data and storing it in df.
  • Applying saved model and storing the predictions in result.
  • Converting result in to original labels
  • Storeing it in MongoDB
def final(link,feature,filename,path,db_name,coll_name):
    df = pd.read_csv(link,header=None)
    df = df.drop(0,1)
    df[feature]=df[1].str.split('@').str.get(1)
    df=df.drop(1,1)
    df=df.dropna()
    
    #applying model
    loaded_model = joblib.load(filename)
    result = loaded_model.predict(df[feature])
    df['results']=result
    df['results'].replace({0:'ML', 1:'ReactJs', 2:'BigData'}, inplace=True)
    
    #storing result in mongoDB
    client = pymongo.MongoClient(path)
    db=client[db_name]
    collection = db[coll_name]
    df.reset_index(inplace=True)
    data_dict = df.to_dict("records")
    
    # Insert collection
    collection.insert_many(data_dict)

Congratulations!!! We are done.

Let’s test it with the exported WhatsApp chat which we had stored in the file.txt file.

final('file.txt','questions', './pipeline.sav', 'your/path/', 'Projects', 'python_test')

Note: The path is different for every user so while putting its value follow the above-mentioned link for reference.

Conclusion

In this tutorial, we covered a simple technique for analysing WhatsApp chats.

We prepared a proper pathway, we started with training our model with the dataset we collected manually. In this tutorial, I tried three models and chose the best among them by considering their F1-Score. We extracted the Whatsapp chat data and passed it through our model to find the label using ML. Finally, we saved our predicted labels in MongoDB.

You can access the source code on github.

Adios!!

-Modabbir Tarique


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert