From Brontë to Tolstoy: Finding Patterns in Free Text
by Jamie McLaughlin,
Data Consultant, Eyecademy
What Value Possibly Exists in Open Text Data?
So, many of our usual day-to-day processes include difficulties with how to properly capture information. On the data input side, we have the varied masses who would always prefer to be given an open text box, and along with it, the ability to just type out information without having to categorize it, or be limited in how to explain situations, information and descriptions. However, with that in mind, for every process that adopts that free-for-all approach, there is at least one data analyst behind the scenes pulling their hair out, trying to identify how they will provide any meaningful insight for the business.
Although it’s fair to suggest that we should be able to categorize our data, it’s also equally true that to get definitive information on each case, we will want to have open text available to use as storage for supplementary information.
How could we use that supplementary information? The classic use case is for a case study, where somebody could use it to build a timeline of events and actions, but perhaps a more holistic way to explore this data is with an analysis of word counts and related words.
Let’s look into that, and see if we can identify trends and infer information from sets of open-sourced data.
Reading in the Data
Using the excellent “Project Gutenberg”, we can collect some data that we can explore with this methodology. I’ve decided, since I’m never going to read these tomes, that is Emily Bronte’s ‘Wuthering Heights’ and Leo Tolstoy’s ‘War and Peace’ are going to be the texts we’ll use to analyse in this example. Obviously, in the business world you’re not likely to be analysing 19th century literature, but this technique could easily be applied to examples such as customer account notes, complaints details notes, or any other free-text fields in almost any context.
Conveniently for us, Project Gutenberg publish these two novels online directly, so I’ve directly linked to those specific web pages and pulled the data down to be used directly as our dataset. It’s important to remember that this could just as easily be business data in an open form – things like complaints, descriptive data or incident reporting notes come to mind.
Tidying the Data
I’ve included the data set we’re going to explore (visible below). We also have huge tracts of data that we can see are full of punctuation – something that could disrupt our later analyses. The head (first six lines) of the data structure for Wuthering Heights is shown below:
## X.Wuthering.Heights. ## 1 Wuthering Heights ## 2 Wuthering Heights ## 3 Wuthering Heights ## 4 Wuthering Heights ## 5 Wuthering Heights ## 6 Wuthering Heights ## readLines.pap_url..encoding....UTF.8.. ## 1 <U+FEFF>The Project Gutenberg eBook, Wuthering Heights, by Emily Bronte ## 2 ## 3 ## 4 This eBook is for the use of anyone anywhere at no cost and with ## 5 almost no restrictions whatsoever. You may copy it, give it away or ## 6 re-use it under the terms of the Project Gutenberg License included
Whether this is Wuthering Heights, complaints information or customer account notes, we would ideally like to do 3 things to make this data set more useful for our analyses:
- Remove the rows that add no value to the analysis, such as empty lines or system generated waffle (in this case it’s legal statements and contents pages).
- Remove any punctuation, including capitalisation, leaving only a long column of lines with nearly or no punctuation at all.
- ‘Unnest’ and re-build the table with simply two columns; one column being the book title and the other being a list of all words that appear – one by one.
Once these 3 steps are completed, we should have a good looking table of the individual books. As it turns out, we have 118,748 words in Wuthering Heights and 567,211 words in War and Peace. A small excerpt of the War and Peace table is shown below (after being “unnested”):
## Book Words ## 1 War and Peace book ## 2 War and Peace one ## 3 War and Peace 1805 ## 4 War and Peace chapter ## 5 War and Peace i ## 6 War and Peace well
Analysing the Data
Now we have a pair of huge tables. Each table includes the broken down text structures. Therefore, there are multiple ways to service the coming analysis. If we had wanted, we could have left the line number in to let us analyse groups of text. This is a useful approach with complaints or customer account data, where we can trace examples that include certain key words or triggers. However in this case, we’ll just do a general analysis of common words.
Most Common Words
The ‘TidyVerse’ group of packages in R (which can also be integrated with Microsoft’s Power BI software quite closely) will help us break down the tables and visualise the most common words found in the two books. We will plot the data on a chart using the famous ‘ggplot2’ package:
With this method, it’s fairly easy to see straight away that the books are somewhat different in nature. In Wuthering Heights, the largest group of words are names, suggesting that relationships between the characters are a key theme. Interestingly “answered”, “house” and “door” also both appear – which may indicate that the story centres around a house.
On the other hand, Tolstoy’s classic has a much more grandiose set of common words. Along with names, there are numerous references to nations, royal court titles and interestingly, and perhaps unsurprisingly for a book titled ‘War and Peace’, the word “army” comes up very frequently.
Interestingly, ‘time’ appears in both!
This is an interesting first take on the two books. Without ever reading a page, we can reasonably surmise that Wuthering Heights is heavily driven by relationships – with a possible setting in the environment of a house. War and Peace on the other hand, at first glance, looks like it may be a war novel – speaking primarily about high titles, nations and armies.
In the event that these data are complaints, customer account or notes data, we could really quickly identify larger key trends in the data that has been input. It wouldn’t be hard to tell from complaints data that costs are a key issue if “cost” is the most common word – that however doesn’t tell the whole story, how do we investigate further?
War and Peace
Considering the fact that we’ve never read the books, we’re able to infer quite a lot about them right now about them from our analysis. War and Peace is a story about the Napoleonic Wars between France and Russia, and it appears that it centers around royal court characters, with numerous mentions of Princes, Princesses, Emperors and Counts. It is likely that the main protagonist is called ‘Pierre’ as the most commonly referenced name in the analyses.
Wuthering Heights appears to centre around a house and the relationships of the characters central to the plot – primarily Catherine and Heathcliffe (if we assume Linton is a surname). We can however suggest that Catherine’s story isn’t necessarily all positive due to the related words in the novel.
Final Thoughts on Practical Application
It’s fair to suggest that with additional time diving into the detail of some of the most popular words, we’d be able to identify some really strong key trends. The same logic and thought process can by applied to almost any character/string data set, helping to topple the notion that open-text boxes to be filled in are of little use in general data exploration and only useful for deep-dive case study or root cause analysis.
Using this exact technique for complaints or survey results could give great insight into the data that’s drawn out. If a company were looking at complaints details then they might be able to identify different demographics, and their likelihood to complain about key items like cost or value. If cost was the issue, they could drill into related words and properly draw out what the actual issue with the cost was – for example, is it value or is it just too expensive (a subtle, but important, difference)?
These opportunities to learn about data could also be included in Microsoft Power BI or other visualization tools and trended over time – is cost an increasing issue for example? A trend of the appearance of the word ‘cost’ over time in complaints data might hold the answer to that question!