Howdy! In today’s blog, I’ll be guiding you on how to create a WhatsApp Chat Analyzer. So, lets first talk about the utility of this project and how will we do that. The basic idea behind this project is to generate API which can store all our predictions in MongoDB.

Overview
Let’s go through the quick overview before starting the project.
Path we’ll follow:
- Collection of training dataset
- Pre-processing
- Applying Model
- Setup pipeline and save it
- Export WhatsApp chat
- Building API and Storing prediction in MongoDB

Collection of Dataset
The basic idea of this blog is to give you the proper pathway so I had not taken large data as it will consume more time for training, but you can train it with larger data set for practical use.
Either manually or by Web Scrapping, we’ll collect interview questions related to ML, Bigdata, and ReactJs then save it to csv files as ML_interview.csv, BigData_interview.csv and Reactjs_interview.csv.
I had collected data from the following links:
You can download my dataset from below links:
Importing Libraries
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt from sklearn import metrics from sklearn.metrics import confusion_matrix import seaborn as sns import numpy as np # linear algebra import pandas as pd #data processing import os import re import nltk
Appending Data Frames
Now, we’ll append the all the data frames that are holding the individual csv files.
def Prep(label): x=[] for i in range(50): x.append(label) return x df1 = pd.read_csv('ML_interview.csv') df1['Label']=Prep('ML') df2 = pd.read_csv("Reactjs_interview.csv") df2['Label']=Prep('ReactJs') df3 = pd.read_csv("BigData_interview.csv") df3['Label']=Prep('BigData') train=df1.append(df2).append(df3) train.drop('Unnamed: 0',1,inplace=True) train.Label.unique() train.Label.replace({'ML':0,'ReactJs':1,'BigData':2},inplace=True) train=train.fillna(' ')
Now, our data frame is ready for pre-processing.
Pre-processing
Creating Cloud Visual
We’ll start by forming a cloud visual to get familiarized with dataset.
real_words = '' fake_words = '' stopwords = set(STOPWORDS) # iterate through the csv file for val in train[train['Label']==0].questions: # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() real_words += " ".join(tokens)+" " for val in train[train['Label']==1].questions: # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() fake_words += " ".join(tokens)+" " for val in train[train['Label']==2].questions: # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() fake_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800, background_color ='white', stopwords = stopwords, min_font_size = 10).generate(real_words) # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
We’ll get a beautiful image as output when we execute the above code, which is actually our cloud visual.

Applying NLP Techniques
Next, we’ll be applying NLP techniques.
from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords nltk.download('stopwords') stop_words = stopwords.words('english') lemmatizer=WordNetLemmatizer() for index,row in train.iterrows(): filter_sentence = '' sentence = row['questions'] sentence = re.sub(r'[^\w\s]','',sentence) #cleaning words = nltk.word_tokenize(sentence) #tokenization words = [w for w in words if not w in stop_words] #stopwords removal for word in words: filter_sentence = filter_sentence + ' ' + str(lemmatizer.lemmatize(word)).lower() train.loc[index,'total'] = filter_sentence train = train[['questions','Label']]
The above lines of code results a clean dataset.
Applying TFid vectorizer
Next, we’ll prepare a function for Tf-idf vectorizer. Tf-idf transforms the texts into the feature vectors and use it as input to the estimator.
from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer X_train=train['questions'] Y_train=train['Label'] vectorizer = TfidfVectorizer( stop_words='english', decode_error='strict', analyzer='word', ngram_range=(1, 2), max_features=max_features #max_df=0.5 # Verwendet im ML-Kurs unter Preprocessing ) feature_vec = vectorizer.fit_transform(features) return feature_vec.toarray() #Feature extraction using count vectorization and tfidf. count_vectorizer = CountVectorizer() count_vectorizer.fit_transform(X_train) freq_term_matrix = count_vectorizer.transform(X_train) tfidf = TfidfTransformer(norm="l2") tfidf.fit(freq_term_matrix) tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)
Congratulations!! Your text is successfully stored in tf_idf_matrix variable. It is now fit for training the model.
Training the Model
In the next step, we’ll be training our model.
Let’s train our data using Logistic Regression, MultinomialNB, and Random Forest Classifier.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(tf_idf_matrix, Y_train,test_size=0.3, random_state=0)
Logistic Regression
#Logistic Regression from sklearn.linear_model import LogisticRegression logreg = LogisticRegression(C=1e5) logreg.fit(X_train, y_train) pred = logreg.predict(X_test) print('Accuracy of Lasso classifier on training set: {:.2f}' .format(logreg.score(X_train, y_train))) print('Accuracy of Lasso classifier on test set: {:.2f}' .format(logreg.score(X_test, y_test))) from sklearn.metrics import f1_score,recall_score,precision_score print(f1_score(y_test,pred,average='micro')) print(recall_score(y_test,pred,average='micro')) print(precision_score(y_test,pred,average='micro'))
MultinomialNB
from sklearn.naive_bayes import MultinomialNB NB = MultinomialNB() NB.fit(X_train, y_train) pred = NB.predict(X_test) print('Accuracy of NB classifier on training set: {:.2f}' .format(NB.score(X_train, y_train))) print('Accuracy of NB classifier on test set: {:.2f}' .format(NB.score(X_test, y_test))) from sklearn.metrics import f1_score,recall_score,precision_score print(f1_score(y_test,pred,average='micro')) print(recall_score(y_test,pred,average='micro')) print(precision_score(y_test,pred,average='micro'))
Random Forest Classifier
from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False) clf = RandomForestClassifier(max_depth=9, random_state=0) clf.fit(X_train, y_train) pred=clf.predict(X_test) print('Accuracy of RF classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of RF classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) from sklearn.metrics import f1_score,recall_score,precision_score print(f1_score(y_test,pred,average='micro')) print(recall_score(y_test,pred,average='micro')) print(precision_score(y_test,pred,average='micro'))
Considering all the above three models, we found that Logistic Regression is performing better as compared to the rest two.
So, for further processing, I’ll be choosing Logistic Regression. It may vary if you use larger datasets.
Setting Pipeline and saving it
Now, we’ll be setting the pipeline and saving it, so that we can directly access our model without training it again and again.
from sklearn.pipeline import Pipeline from sklearn.externals import joblib from sklearn import linear_model import warnings from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer(norm='l2')), ('clf', linear_model.LogisticRegression(C=1e5)), ]) X_train = train['questions'] Y_train = train['Label'] pipeline.fit(X_train,Y_train) #saving the pipeline filename = 'pipeline.sav' joblib.dump(pipeline, filename)
I had saved the model as pipeline.sav
Exporting WhatsApp chat
As our project is about exporting WhatsApp data and passing it through our model and storing the predictions in the MongoDB database.
So, let’s not delay it. Let’s begin with further execution.
Firstly, we need to export the chat.

Next, we need to download the exported chat from the email inbox. It should resemble the following:

We’ll save the above file as file.txt
Building API and Storing prediction in MongoDB
Before moving further let’s take a look on MongoDB. For storing your data there, you first have an account in Mongo Cloud. If you are using it for the first time kindly go through the article for setup.
By now, you must have successfully setup your account on Mongo cloud along with creating a cluster.
Create a database on Mongo cloud by giving the name ‘Projects’ and give ‘python_test’ as the name of the collection.
Now, let’s import the packages.
import pandas as pd import numpy as np from pymongo import MongoClient import csv import pandas as pd import urllib import os import pymongo import joblib
Here we have imported
- pymongo for creating a connection between python and mobgoDB, and
- joblib for applying our saved model.
Algorithm:
- Pre-processing the data and storing it in df.
- Applying saved model and storing the predictions in result.
- Converting result in to original labels
- Storeing it in MongoDB
def final(link,feature,filename,path,db_name,coll_name): df = pd.read_csv(link,header=None) df = df.drop(0,1) df[feature]=df[1].str.split('@').str.get(1) df=df.drop(1,1) df=df.dropna() #applying model loaded_model = joblib.load(filename) result = loaded_model.predict(df[feature]) df['results']=result df['results'].replace({0:'ML', 1:'ReactJs', 2:'BigData'}, inplace=True) #storing result in mongoDB client = pymongo.MongoClient(path) db=client[db_name] collection = db[coll_name] df.reset_index(inplace=True) data_dict = df.to_dict("records") # Insert collection collection.insert_many(data_dict)
Congratulations!!! We are done.
Let’s test it with the exported WhatsApp chat which we had stored in the file.txt file.
final('file.txt','questions', './pipeline.sav', 'your/path/', 'Projects', 'python_test')
Note: The path is different for every user so while putting its value follow the above-mentioned link for reference.
Conclusion
In this tutorial, we covered a simple technique for analysing WhatsApp chats.
We prepared a proper pathway, we started with training our model with the dataset we collected manually. In this tutorial, I tried three models and chose the best among them by considering their F1-Score. We extracted the Whatsapp chat data and passed it through our model to find the label using ML. Finally, we saved our predicted labels in MongoDB.
You can access the source code on github.
Adios!!
-Modabbir Tarique
0 Comments