I could have titled this article- how to use NLP for sentiment analysis to expand any business and outpace competitors. I chose to write in non-technical terms and simplify the procedures needed so that those not familiar with programming can understand the advantages of using reviews to grow a business. I also tried to maintain a balance in technicality so that those in data or studying data science at Dataquest.io, for example, can explore more on this topic.
Before we dive in, let us find out just how much impact customer review data can have on a business and why you should consider leveraging on this type of data.
According to Trustpilot report 2020:
- 9 out of 10 customers will take their time to read reviews before making a purchase decision.
- 62% of customers are concerned about the authenticity and transparency of online reviews.
- Almost half (47%) of online shoppers drop reviews monthly.
- And more importantly, more than half of reviewers expect businesses to respond to their negative reviews within seven days, and about 63.3 percent report that they never heard back from them (review trackers, 2018).
You may have personally experienced the truthfulness of these statistics. What we are particularly interested in here is how we can perceive customer’s viewpoints of the business through their reviews. This analysis is what we mean by customer sentiment analysis. Our job is to create a system that automatically analyzes customer reviews to ascertain the sentiments behind them. This system will help the business know exactly what to improve and which aspect of the business is currently doing well. We could also find out what our competitors are missing out on through the negative reviews data of their customers. Or with the positive reviews, find out areas they are doing better than us. We can thus forward this information to some departments like the R & D and marketing sections for prompt implementation.
HOW TO DO IT
Are you ready to crack the hard nut? Don’t worry; it isn’t as difficult as you may be thinking. I will break this up into steps. But before we go right in, I’ll love to explain some terms.
Machine Learning (ML): It is simply the use of mathematical and statistical models and algorithms to discover patterns in a set of data and inferring results based on knowledge gained by the computer system. Basically, in most machine learning projects, we divide our data into training and test datasets. We feed the training data into the model that will analyze it to look for patterns. Once gotten, we can use our test data to judge the correctness of the generated program from the training stage.
For example, in this article, we can divide our reviews into two. We input the training set in terms of adjectives. Our system discerns that positive reviews are associated with expressions such as great, delicious, nice, awesome, and negative reviews have words like poor bad, dirty, expensive. We test the correctness of this program on a set of new data. In this case, the machine is allowed to see only the reviews and not the ratings. Now, when the system sees new feedback, based on experience, it says oh! I think I have seen something like this before, it is associated with a positive review and labels it positive.
In ML, we try our best to achieve the highest level of accuracy using techniques like making the training dataset larger than the test set because the more the system learns, the higher its probability to predict more correctly. Also, we try different algorithms to find out the one with the least error.
STEP1: GET THE DATA
If you own a business with a website, you can get reviews from the site’s back-end. However, you can get feedbacks are on review platforms like Yelp.com. You can also do a web-scrapping if that is legal or better use the site’s API if there’s a provision for that. If you do not have the technical know-how, consider hiring a freelancer. I have done many data collection jobs for clients around the world. If you need my services, you can hit me up.
STEP 2: CLEAN AND PREPARE THE DATASET
For the sake of this article, I collected the data of a popular New York restaurant. When you collect data in this raw form, there may be inconsistencies like typos, blanks, etc. we also need to ensure that each column is in the correct type, like integers should not be used as strings.
The reviews I collected for this project are shown below on the bar chart
We need to divide the reviews into two - positive and negative. Any review with a rating of 4 stars and above is labeled positive. Those below 3 stars are negative. Reviews of 3 stars ratings are considered neutral and dropped from our dataset.
Reviews are often in the form of sentences. Unfortunately, we can’t feed this into our computer. We break these sentences into words using the tokenize feature of the natural language toolkit(NLTK) library in python. After this, we remove stop words. Stop words are words like: in, on, of, at. These words on their own make no sense so, they aren’t necessary. The next stage is stemming. What this means is that we convert words to other words with the same intent. For example, the word “poorly” can be changed to poor.
STEP3: ANALYSIS AND VISUALZATION
We now have two sets of the bag of words-The bad and good reviews. Next, we find out the words most used in positive and negative reviews. The legible the term, the more frequently its has been used by reviewers. For example, the word cloud below shows the frequency of terms used in positive reviews.
You may notice that words like Pickle and Pastrami stand out. What it means is that most customers love the restaurant because of these.
Next, we believe it will be better to see the top positive and negative phrases used in the reviews. That is visualized below.
There seem to be some similarities in the positive and negative words/phrases. Do you notice, for example, that most are referring to meals? For better tracking and visualization, we take two steps further here:
- We write a function to loop over the words and combine similar words into the same category.
- We combine the positive and negative bar plots stacking them.
The result of the steps are shown below
Our analysis shows that the business is doing great. While most customers love the food, attention should be given to providing excellent customer service and adjusting the pricing if possible. Also, if this business belongs to your competitor, you now know some of their weaknesses and you could exploit them to gain more market share.
NATURAL LANGUAGE PROCESSING (NLP)
Now we go into machine learning proper!
We will use Naive Bayes, Bernoulli Naive Bayes, Multinomial Naive Bayes, Support Vector Classifier(SVC), Stochastic Gradient Descent(SGD), RandomForest, and logistic regression Classifiers for training.
The purpose of using several algorithms is to find out the one that gives the most accurate prediction.
On testing, the Logistic Regression Classifier gave the most accuracy of about 90%.
We confirm this accuracy by using a list of fresh reviews show below
NEW CUSTOMER REVIEWS
“Was simply amazing. And that tea leaf salad. Definitely coming back!”
“Used to be good. I have no idea how it got so bad. Everything tasted like leftovers from last week. Shrimp were mush. Platha was a brick. No way it was made that day. Rainbow salad was 90% cabbage and noodles. No visible papaya. Eggplant was edible. The rest was not. Too bad”.
"I just really don’t understand why people wait in line to eat here. The first time I ate here, after waiting outside for an hour to be seated, I was underwhelmed with the food and just generally disappointed after hearing so many rave reviews from friends. But I figured that could just have been an off day or that perhaps I just hadn’t ordered the right dish, especially since people really seem to love this place. So I went back and tried it a few more times. I’ve tried to like this place, I really have. But the food just isn’t that great. I tried the famed tea leaf salad…why do people love this dish so much? Is it just because it is a unique dish that you won’t see at other places? Because aside from that I just don’t see what the big deal is. I think people just feel special saying “tea leaf salad”. Seriously I don’t know what else would make someone say this is such a good dish. I tried many other items there as well and it’s just rather ordinary asian food. The garlic noodles, citrus chicken, and everything else I tried there. left me totally underwhelmed and questioning why I’ve spent hours of my life standing around outside this place waiting to eat this completely overrated food. The portions are small, too. I definitely would not wait in line to eat there again. Actually, I’m just going to say it, I wouldn’t eat there again at all. I read these yelp reviews to try to understand the appeal and I noticed that 9 out of 10 of the rave reviews appear to have been written by young women. Maybe chicks just like being able to tell their friends “ohhmygawd I had the tea leaf salad and it was ah-amazing. Or, maybe they enjoy getting to drag their poor boyfriends here when it’s their turn to pick a restaurant. I don’t know. All I know is it’s definitely not the food. This place sucks.”
“Great place, I’ve been here a couple times but just now posting review because I wanted to try a few more dishes before saying this place Â is DELICIOUS. Service is great, the dishes come fast and hot. There is always a wait so do call in or get on waitlist. Must try.”,
”Okay I don’t get it? I love Burmese food and if try this place several times. And always seems to have them a line to get in. The staff inspired with walkie talkies and earpieces."
“The wait for this place is not worth it. I waited one hour and a half only to enjoy a mediocre meal. I’m either cursed by the hype or the food quality has went down. I’m not from the area and it’s my first time here so I’m not sure what it could’ve been. I do not consider myself to have a fair share of Burmese cuisine but I do know that they’re usually a unique combination of spices and seasonings. I have my favorite Burmese restaurant on the east side but Burma Superstar did not compare.”
“I was driving through SF and was starving when I saw on my yelp app that there’s a Burma Superstar in SF–awesome. I’ve only ever been to the one in Oakland. Â I have to say, this just did not live up to the Oakland restaurant’s standards. Â The waitstaff, host etc were all super nice, but the food was just not that good. The Samusa salad was off, and the sesame chicken was not crisp, gooey and meh. Â we were there really late, And maybe being the last seating we got the dregs of the night. Â But I was really disappointed. I think I’ll stick to the one in Oakland.”
Running our program gives the following result:
The sentiment of this statement is: positive The sentiment of this statement is: negative The sentiment of this statement is: negative The sentiment of this statement is: positive The sentiment of this statement is: positive The sentiment of this statement is: positive The sentiment of this statement is: negative
Our Logistic Regression model correctly predicted 5 out of 7 reviews. That’s nearly perfect! The 5th review is a sarcastic statement so it is understandable why the model couldn’t predict its sentiment correctly.
The sentiment analysis of customer reviews of the restaurant proved to be very effective. It is apparent that the business has to look majorly at its food pricing. They should also address concerns over wait time, customer service and ticketing and payment systems before they get out of hand.
Our model also affords us the convenience of predicting future reviews as positive or negative. This saves us the time of reading each review to find out what the customers feel about the services and food.
We could also improve this model by predicting the service category each negative or positive review falls into.