Our Top Data Horror Stories
What happens when data turns horrible? Which data management mishaps keep us awake at night? What data breaches give us goosebumps? This Halloween, we’ve broken down our top ‘data horror stories’ from our team, to give you some dos and don’ts for your data management:
The most publicised data horror stories often come of large data breaches that have exposed millions of users’ information at scale. These could either be the result of a company hacking with a brute-force approach, or businesses leaving their data without sufficient protection to prevent a breach.
Business Insider reports that hackers have now become so sophisticated, that nearly 4 billion records have been stolen over the last decade alone, with the last decade also including the two largest data exposures in history. The most famous misuse of data last year was Facebook & Cambridge Analytica, where more than 50 million Facebook profiles were harvested by Cambridge Analytica through use of a third-party app, ‘thisisyourdigitallife’. Other records that have been left exposed or been stolen by hackers include over 885 million financial records by First American, 500 million guest details from the Marriott’s reservation system, and 150 million users’ fitness information from Under Armour app ‘MyFitnessPal’.
From a company perspective, these horror stories should be enough to scare anyone into checking the protection of their company and customers’ data. As a user, if you want to check if your own email address has been involved in a breach, you can search for your details on website HaveIBeenPwned.com, which is a free resource for anyone to assess if their details have been at risk in a major reported breach.
Sometimes, the effect and impact of data leaks are unavoidable to users. Despite it being the responsibility of an organisation to implement appropriate measures, there are steps that users and businesses should take to protect your online third-party accounts. The most commonly publicised advice is to use a strong password, or better yet, a password manager. Some common passwords to avoid include ‘password’, ‘qwerty’, and ‘123456’, which was declared the most used password in 2019 – being used by 23 million hacked accounts.
However, it could be argued that having a weak password may be better than not having one at all. One famous data horror story occurred in February this year, where 763 million unique email addresses were leaked by email verification service ‘Verification.io’. Discovered by Bob Diachenko and Vinny Troia, the breach was due to the data being stored in a 150 GB MongoDB instance that was left publicly facing without a password, and resulted in the data being exposed. Many records within the data also included additional personal attributes such as names, phone numbers, IP addresses, dates of birth and genders.
Aforementioned site HaveIBeenPwn.com also has a password checker, where you can check if your password has ever been breached and therefore unsuitable for use (which you can read here). These data horror stories are occurring more and more frequently, however by applying the measures above, users and businesses alike can minimise their risk.
There is an old principle in data management; ‘rubbish in, rubbish out’. Ultimately, poor quality data will lead to poor quality information and bad visualisations, no matter how many best practices are followed. Data quality issues can come from a multitude of sources, including anything from typing mistakes and manual input errors, duplication, rows missing or returning as null, errors in business systems and more.
One famous data quality horror story reported by Utopia was the Enron scandal, which was uncovered to be largely the fault of bad data. Once the sixth-biggest company in the world, fraudulent and incorrect data contributed significantly to the company’s exponential rise stock prices, misleading shareholders and contributing to the company’s inevitable demise.
No business is immune to issues with data quality, as NASA found out in 1999. Due to an inconsistency with the data for their Mars orbiter, where English imperial units were confused with NASA’s metric units, communication with the Orbiter was lost during orbital insertion. In total, the mistake cost NASA $125 million due to the loss of the spacecraft.
To understand if your data is clean, you need to follow the data journey at every point, from collection right through to analysis, to understand if any errors occur. Some techniques include using statistical methods to uncover outliers, data profiling to summarise quality, and applying software to specify if any defined data constraints are violated. As a data consultancy, we spend a lot of time with our clients utilising our experience to uncover any issues in their data quality. As data quality issues are so common, we have even developed own 5-phase iterative methodology to identify and remediate data quality issues on each cycle.
Data visualisation is the process of taking data & information and placing it into a visual context, which could include a range of different types of graphs, charts, maps and more. Visualising data is extremely important to help users detect patterns & trends quickly within a data set, allowing for both fast & evidence-based decision making. With good quality data, visualisations can be a valuable asset for any business looking to adopt an analytical approach to making decisions across the organisation.
However, the quality of a data visualisation is extremely important. As visualisations enable decision-makers to view lots of information in an instant, a badly designed graph or misleading chart can actually harm your ability to make decisions at speed. It can be extremely easy for users to draw incorrect conclusions from a confusing source. Common mistakes include; choosing the wrong type of visualisation for the data you are dealing with, trying to display too much information at once, misrepresenting data, or using inconsistent scales. These mistakes can quickly become a horror story in itself, even if your data quality is relatively good.
Even publications can sometimes get their data visualisations wrong, creating unrepresentative information for their readers. The Economist has published an analysis recognising where some of their visualisations have been confusing or misleading. Some of their mistakes include choosing the wrong method to represent attitudes to the EU referendum, using confusing colours to compare the spending of different governments on pension benefits, and misrepresenting a correlation between the weight and neck sizes of dogs by cherry-picking scales. While not exactly terrifying mistakes, the Economist’s reflection and transparency on their visualisation errors perfectly illustrate how easy they are to make and the impact they can have on the user.
You can also view some more worst-offending visualisations from a variety of sources on blog Viz.WTF from a variety of sources, to see examples of how a poorly designed visualisation can be over-complicated, unclear, or at worst, misleading.
And there we have it – our top 4 data horror stories and how to avoid them. If you have a favourite scary data tale to be told, let us know in the comments below!
More From Our Blog
This Halloween, we thought we would delve into the spookier side of Scotland’s history, going back in time over 400 years to uncover some information about our country’s supernatural past.
To celebrate both Halloween and the release of Shining-sequel Doctor Sleep, Gail has created our very own Stephen King-themed Tableau viz, joining box office data, film data and reviews from IMDb and Good Reads!
With the 2019/20 Scottish Premiership kicking off last month, we thought we’d put one of our data scientists to the see if they can predict the results of the season!