From Brontë to Tolstoy: Using Sentiment Analysis in Free Text
by Jamie McLaughlin,
Data Consultant, Eyecademy
nPart II of a two part series:
Part I – Finding Patterns in Free Text
What is the Natural Extension to Analysing Text?
In my last blog on analysis of free text, I looked at how you could use R to draw data into a long tabular form, and then draw out information from the free text about the general points of emphasis. As an extension of this practice, it is also possible to take your analysis one step further, and identify the positive or negative sentiment used in the text and trends therein.
From a business point of view, this technique could be useful to monitor customer language over time, potentially to ensure that there’s a continually positive relationship. It’s entirely perceivable that this analysis could also be matched against various types of contact, using natural language processing (NLP). Alternatively, survey type data could be analysed for sentiment to give an idea of the general feeling of the respondents.
For this work, I’ll continue with my previous example using Emily Bronte’s ‘Wuthering Heights’ and Leo Tolstoy’s ‘War and Peace’.
Constructing the Same Data Set and Adding the Sentiments
Having already completed the earliest stages of this analysis in my previous blog, specifically the “Reading in the Data” and “Tidying the Data” segments, we can now pick up at the start of our sentiment analysis. It’s key that we get the same form of data set to give us the best possible data quality for the analysis.
We’ll use the ‘NRC’ and ‘AFINN’ word banks, stored in the ‘sentiments’ data set from the TidyText library in R. Once again, this can be quite easily performed in Power BI or other packages that allow R script to be executed in order to ensure you get the right information into the right place. By joining this information onto our original set of data we now have a set of sentiments applied to each word (from the NRC lexicon) and a score applied to our data set (from the AFINN lexicon)
NRC – The NRC lexicon categorises words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust, with the general sentiment being quite a simple concept. We will match each word in our data to a sentiment in the NRC sentiment data set. The number of words grouped under each given sentiment can be counted and mapped onto a plot, however the score concept is slightly trickier.
AFINN – The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. Therefore, each word if given a positive or negative score (from 5 to -5) based on how positive the word is. So a word with a score of 5 is exceptionally positive, while a word with a -1 score is very slightly negative. All of this information is now buttoned onto our data set.
Analysing the Data
So, now we have our library with our sentiments and scores attached. We can really start to have a dig to see what the mood is in our data. For this, as before, I’m using the ‘Wuthering Heights’ and ‘War and Peace’ data sets. We could easily apply this to a batch of complaints data, customer account data or survey data.
Checking Positivity and Negativity
Our first port of call can be to use the NRC lexicon to measure how much positive and negative language can be found in each book. This is the most basic level of analysis and I’ve included it in the plot below:
So as per the plot above, neither model is particularly overwhelmed by negative language. Bronte however is significantly closer to the middle ground than Tolstoy, with Wuthering Heights appearing to be just as full of negative words than positive.
It’s important to remember that this might just be the nature of novels – that they always have more positive language – be it also gives us an apples for apples comparison using the same lexicon and the same approach. Tolstoy’s novel contains more positive language than Bronte’s.
What About the Actual Sentiment?
So we’re pretty comfortable that War and Peace is a more positive novel than Wuthering Heights at this stage, however, the question still remains – why? What emotions drive the swell of general negativity (or positivity!)
The NRC lexicon allows us to do this as well. Instead of filtering out everything that’s not a “positive” or “negative” match, we’ll filter out the “positive” and “negative” matches themselves and leave in the main bulk of the ‘sentiments’ offered by the sentiments package in R.
From the Detailed Sentiment analysis, we can see that the emotions of fear, sadness, anticipation and anger are all fairly common within Wuthering Heights – with really only trust featuring as a key positive theme.
On the other hand, War and Peace has trust as it’s key sentiment, and there is a reasonable drop-off thereafter.
Thoughts So Far
It makes sense at this stage to stop and consider application. The notion that a lexicon of positive and negative language could be matched with a novel isn’t groundbreaking stuff, but the question is what could actually be done with this sort of analysis?
The first implication could be on customer / complaints analysis, looking specifically at the mood implied at various points of contact. The real value however, might come from time series data with this approach.
Positivity/Negativity Over Time
So with the previous charts we have identified how generally positive a set of text is, and additionally which emotions are typically most common. This doesn’t tell much of a story though. Considering a customer account or other data from a given relationship, it’s probably more important to identify how that has developed over time.
With that in mind, we can use the AFINN lexicon from the same data set to assign a score to each word, helping deliver a rolling score from which we can see the ups and downs of the language used. With the data before, we’d expect the plots to end in the positive, but it might be interesting to see how the rises and falls in positive language are different between the texts.
The first thing the chart shows is how much shorter Wuthering Heights is as a book; we’ve seen this throughout with the significant difference in word counts. It also shows quite clearly the difference in language over time between the two books. Wuthering Heights is exceptionally negative straight away – and the trend down is pretty constant. War and Peace on the other separates early with a huge notch around 15% of the way in, and then around halfway through the book it suddenly becomes a general trend down.
Final Thoughts on Practical Application
If we were to forget the source for a second and consider the impact of these data being collected from customer interactions within business, we could quite easily create ratings over time – identifying which customers had been engaged with both positively and negatively – and act on that basis.
This sort of sentiment analysis could really shake up the world of customer engagement or population surveys. The ability to, at speed, crunch customer contact notes or NLP transcripts could give some indicators as to the likelihood of customers to buy new products or cancel existing ones.
Customers who trigger a given threshold of negative language (potentially dropping below a long rolling average) could be analysed for likelihood to leave – determining if it’s significantly different to the base customer set. If so, the magnitude of that average can be identified and a suitably priced solution, such as a bouquet of flowers or a discount voucher, can be sent to the customer as a proactive solution.
On the other hand, customers who have shown ‘extreme’ levels of positive language could be considered a sales lead (marketing consent considered, of course) and sent to a sales team for pitching. The conversion rate against the company or demographic average should be identified to quantify a cost per sale improvement and to unearth whether it’s better to only dial customers who have met a particular level of positive relationship.
All of these opportunities, of course, require some thought and strategic application – it’s not an easy item, but segmenting customers based on actual relationship metrics when considering relationship-driven actions is something that may transpire to be of value as the world becomes increasingly data-dependent.