Data Science · Data Story

Text Analytics: A Tale of Lengthy WhatsApp Messages!

I don’t know whether it’s just me or every other person feels same, but I really do not like lengthy and “forward-this-to” WhatsApp messages. In fact, I don’t even read them! I think I had this notion that WhatsApp was meant for a quick exchange of, “hello, how was your day?” and not, “The story of the Bamboo”, and “… forward this to twelve people you care about, and twelve blessings will come to you this week” – I quickly move from 😕 to 😡 in milliseconds, like seriously? I remember when I first got the application, I presumed that the founders coined the name after exchanging pleasantries – “WhatsApp” as in “What’s up”, that being the case, the app will basically be used for exchanging pleasantries. After using the application for some time my thought process changed.Just recently, I got a series of lengthy stories on an alumnae WhatsApp group which I joined. I simply ignored it for a while but, it just kept coming (plus, other group members were asking for it), Part 10, Part 12, Part 14. So, I finally decided to read just one part –  I was scrolling up really fast, skipping several lines and then, suddenly, it struck me, like a revelation – THIS IS ALL DATA!!!. I just knew what to do with it – Text Analytics.

I exported data from April 1 – May 5 to my e-mail, loaded it into R, the open source statistical software and began working on it.  It required a lot of data tidying, though tedious – because of the back and forths, but it was the most important step. This includes the removal of non-ASCII characters, punctuation, numbers, URLs, stop words (commonly used words like, “the”, “is”, “so”, etc) and the transformation of all words to the lower case (R is case-sensitive)

In order to make this post as concise as possible, I will not include any tutorial/explanation of how results below were obtained. If you have any questions concerning the results, feel free to comment below. That being said, the bar plot below shows words that occurred the most – at least 40 times. The names “Rosy” and “Edu” occurred the most which are, in fact, one name (first and the second name), I guess she’s very active and popular (134 times), then next is “God” (113 times) – women fuss about men that love them, so it’s quite obvious why we mention God a lot.


The word cloud below clearly shows again that the top three words are “Rosy”, “Edu” and “God”. It also shows other frequently used words, where the greater the prominence given to a word, the more frequent that word appears in the data.


Note: I didn’t produce this word cloud using R, I exported the results to Wordclouds and built the Viz. I tend to lean towards Wordclouds most times because, they have prettier options for the Viz like, the overall shape of the word cloud which in this case is the “comment” shape.

In order to explore the opinions of members in the group chat, sentiment analysis was used. Results identified that the group chat scored highly for expressions of positiveness, joy, trust, and anticipation. Disgust, sadness and anger rank low for this group, so we’re pretty much positive, joyous and trustworthy ladies. See plot below:


There are other fun things to do with text data, clustering, topic modelling – which is one of my personal favourite, etc. Consider this hypothetical scenario, you want to start an online bookshop. Let’s say you’re mostly interested in Kids Books, Novels, Christian literature, Motivational Books, Magazines and then every other book will be categorized as “other books”.  You get in touch with a very liberal donor and he donates 5000 e-books to you. Now, your job is to categorize these 5000 books based on their content. For small collections, you could do this by simply going through each document. Your time is limited and you can’t possibly go through every single one of them, so you call me – geek to the rescue and I solve your problem using Topic Modelling, smiling at my credit alert from you.

Topic modelling – deals with the problem of automatically classifying sets of documents into themes/categories. Here’s a very good explanation of the concept.

Finally, I hope my article today was comprehensible.  Nonetheless, I hope I’ve succeeded in conveying a sense of the possibilities in the vast and rapidly expanding discipline of Data Science – text analytics.



If you spot a typo or any error in the post, please let me know in the comment section – so I can fix it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s