by Jamie McLaughlin,
Data Consultant, Eyecademy

What Value Possibly Exists in Open Text Data?

So, many of our usual day-to-day processes include difficulties with how to properly capture information. On the data input side, we have the varied masses who would always prefer to be given an open text box, and along with it, the ability to just type out information without having to categorize it, or be limited in how to explain situations, information and descriptions. However, with that in mind, for every process that adopts that free-for-all approach, there is at least one data analyst behind the scenes pulling their hair out, trying to identify how they will provide any meaningful insight for the business.

Although it’s fair to suggest that we should be able to categorize our data, it’s also equally true that to get definitive information on each case, we will want to have open text available to use as storage for supplementary information.

How could we use that supplementary information? The classic use case is for a case study, where somebody could use it to build a timeline of events and actions, but perhaps a more holistic way to explore this data is with an analysis of word counts and related words.

Let’s look into that, and see if we can identify trends and infer information from sets of open-sourced data.

Reading in the Data

Using the excellent “Project Gutenberg”, we can collect some data that we can explore with this methodology. I’ve decided, since I’m never going to read these tomes, that is Emily Bronte’s ‘Wuthering Heights’ and Leo Tolstoy’s ‘War and Peace’ are going to be the texts we’ll use to analyse in this example. Obviously, in the business world you’re not likely to be analysing 19th century literature, but this technique could easily be applied to examples such as customer account notes, complaints details notes, or any other free-text fields in almost any context.

Conveniently for us, Project Gutenberg publish these two novels online directly, so I’ve directly linked to those specific web pages and pulled the data down to be used directly as our dataset. It’s important to remember that this could just as easily be business data in an open form – things like complaints, descriptive data or incident reporting notes come to mind.

Tidying the Data

I’ve included the data set we’re going to explore (visible below). We also have huge tracts of data that we can see are full of punctuation – something that could disrupt our later analyses. The head (first six lines) of the data structure for Wuthering Heights is shown below:

##   X.Wuthering.Heights.
## 1    Wuthering Heights
## 2    Wuthering Heights
## 3    Wuthering Heights
## 4    Wuthering Heights
## 5    Wuthering Heights
## 6    Wuthering Heights
##                                    readLines.pap_url..encoding....UTF.8..
## 1 <U+FEFF>The Project Gutenberg eBook, Wuthering Heights, by Emily Bronte
## 2                                                                        
## 3                                                                        
## 4        This eBook is for the use of anyone anywhere at no cost and with
## 5    almost no restrictions whatsoever.  You may copy it, give it away or
## 6     re-use it under the terms of the Project Gutenberg License included

Whether this is Wuthering Heights, complaints information or customer account notes, we would ideally like to do 3 things to make this data set more useful for our analyses:

  1. Remove the rows that add no value to the analysis, such as empty lines or system generated waffle (in this case it’s legal statements and contents pages).
  2. Remove any punctuation, including capitalisation, leaving only a long column of lines with nearly or no punctuation at all.
  3. ‘Unnest’ and re-build the table with simply two columns; one column being the book title and the other being a list of all words that appear – one by one.

Once these 3 steps are completed, we should have a good looking table of the individual books. As it turns out, we have 118,748 words in Wuthering Heights and 567,211 words in War and Peace. A small excerpt of the War and Peace table is shown below (after being “unnested”):

##            Book   Words
## 1 War and Peace    book
## 2 War and Peace     one
## 3 War and Peace    1805
## 4 War and Peace chapter
## 5 War and Peace       i
## 6 War and Peace    well

Analysing the Data

Now we have a pair of huge tables. Each table includes the broken down text structures. Therefore, there are multiple ways to service the coming analysis. If we had wanted, we could have left the line number in to let us analyse groups of text. This is a useful approach with complaints or customer account data, where we can trace examples that include certain key words or triggers. However in this case, we’ll just do a general analysis of common words.

Most Common Words

The ‘TidyVerse’ group of packages in R (which can also be integrated with Microsoft’s Power BI software quite closely) will help us break down the tables and visualise the most common words found in the two books. We will plot the data on a chart using the famous ‘ggplot2’ package:

With this method, it’s fairly easy to see straight away that the books are somewhat different in nature. In Wuthering Heights, the largest group of words are names, suggesting that relationships between the characters are a key theme. Interestingly “answered”, “house” and “door” also both appear – which may indicate that the story centres around a house.

On the other hand, Tolstoy’s classic has a much more grandiose set of common words. Along with names, there are numerous references to nations, royal court titles and interestingly, and perhaps unsurprisingly for a book titled ‘War and Peace’, the word “army” comes up very frequently.

Interestingly, ‘time’ appears in both!

This is an interesting first take on the two books. Without ever reading a page, we can reasonably surmise that Wuthering Heights is heavily driven by relationships – with a possible setting in the environment of a house. War and Peace on the other hand, at first glance, looks like it may be a war novel – speaking primarily about high titles, nations and armies.

In the event that these data are complaints, customer account or notes data, we could really quickly identify larger key trends in the data that has been input. It wouldn’t be hard to tell from complaints data that costs are a key issue if “cost” is the most common word – that however doesn’t tell the whole story, how do we investigate further?

Finding Strongly Related Words

Let’s say we now have an idea that Wuthering Heights centres around relationships with other characters, while War and Peace is (unsurprisingly) about war. We can further inspect this hypothesis by finding words that are related to some of the most used words in the respective books.

We can begin by identifying lines of the original data set that include our target word. Then, from there, we take those lines, unnest them and do an identical analysis. It may be worth picking words that we have an intuition on, so perhaps “army” from War and Peace and “Catherine” from Wuthering Heights.

We might expect to see things related to war in the ‘army’ related words, and perhaps some clarification of the role of ‘Catherine’ to the story in Wuthering Heights.

So, with that in mind we can produce the same visualisations, slightly amended for the situation:

Immediately, it’s easy to see that Catherine likely has the title ‘Miss’, and has a strong relationship somehow with Heathcliffe. It doesn’t look like Catherine’s story is necessarily overwhelmingly positive though – with words like “cried” and “exclaimed” forming commonly related words to Catherine.

With War and Peace, the word ‘army’ immediately yields two results above all others: French and Russian. This is not just a likely confirmation that the book is about a war, but other supporting words give us additional detail:

  • NapoleonThe book can be assumed to be related in some way to the Napoleonic Wars.
  • Position/Battle/ThousandThese words all hint to descriptions of armies being involved somehow in the story, not just war as a environmental setting.


War and Peace

Considering the fact that we’ve never read the books, we’re able to infer quite a lot about them right now about them from our analysis. War and Peace is a story about the Napoleonic Wars between France and Russia, and it appears that it centers around royal court characters, with numerous mentions of Princes, Princesses, Emperors and Counts. It is likely that the main protagonist is called ‘Pierre’ as the most commonly referenced name in the analyses.

Wuthering Heights

Wuthering Heights appears to centre around a house and the relationships of the characters central to the plot – primarily Catherine and Heathcliffe (if we assume Linton is a surname). We can however suggest that Catherine’s story isn’t necessarily all positive due to the related words in the novel.

Final Thoughts on Practical Application

It’s fair to suggest that with additional time diving into the detail of some of the most popular words, we’d be able to identify some really strong key trends. The same logic and thought process can by applied to almost any character/string data set, helping to topple the notion that open-text boxes to be filled in are of little use in general data exploration and only useful for deep-dive case study or root cause analysis.

Using this exact technique for complaints or survey results could give great insight into the data that’s drawn out. If a company were looking at complaints details then they might be able to identify different demographics, and their likelihood to complain about key items like cost or value. If cost was the issue, they could drill into related words and properly draw out what the actual issue with the cost was – for example, is it value or is it just too expensive (a subtle, but important, difference)?

These opportunities to learn about data could also be included in Microsoft Power BI or other visualization tools and trended over time – is cost an increasing issue for example? A trend of the appearance of the word ‘cost’ over time in complaints data might hold the answer to that question!