Featured Product
This Week in Quality Digest Live
Management Features
Kate Zabriskie
Misguided incentives create misaligned consequences
Chengyi Lin
The right metrics can align objectives in flexible work arrangements
Jake Mazulewicz
Three tips from high-reliability organizations
Dave Gilson
Getting out of the boardroom for a stroll changes how women navigate
Bob Ferrone
Saving the planet and bolstering the bottom line

More Features

Management News
How to drive productivity with a universal and powerful 3D inspection software
Research commissioned by the Aerospace & Defense PLM Action Group with Eurostep and leading PLM providers
Improved design of polarization-independent beam splitters
New industry-recognized guidelines for manufacturing jobs
ASQ will address absence of internationally recognized ESG benchmarks
Helping organizations improve quality and performance
Leading technologies empowering the next generation of 3D engineering software solutions
EstateSpace offers digital estate management system

More News

Tristan Mobbs

Management

How Classifying and Fixing Dirty Data Can Help Even Those Using Spreadsheets

Examining the ground we stand on

Published: Wednesday, March 9, 2022 - 13:02

All too often the topic of fixing dirty data is neglected in the plethora of online media covering artificial intelligence (AI), data science, and analytics. This is wrong for many reasons.

To highlight just one, confidence in the quality of data is the vital foundation of all analysis. This topic remains relevant for all levels of complexity, from spreadsheets to complex machine-learning models.

So, I was delighted to review Susan Walsh’s book, Between the Spreadsheets: Classifying and Fixing Dirty Data (Facet Publishing, 2021). Here are some highlights from her book, and my own advice on who should read it.

Yes, we’re finally talking about dirty data

Between the Spreadsheets covers a topic that gets less daylight amid the glamour of AI and machine learning. Having followed Walsh for a while on LinkedIn, I've enjoyed how she highlights the benefits of making sure your data have their COAT on (consistent, organized, accurate, trustworthy), and how she takes the time to explain the challenges associated with poor data quality—as well as the painful real-world consequences of it.

Exploring dirty data in the world of procurement, Walsh looks at spend data classification and provides real examples of how she would go about validating and sorting out the dirty data.

Data horror stories to inspire action

Data quality and data validation are often unloved topics in the world of data. But they’re crucial. Anyone working in the world of data will have encountered issues that had consequences for the company or people involved. In the energy industry, I often saw customers being billed who weren’t on supply with the company I worked for. It gets worse when the debt collectors are about to be unleashed.

There are many examples where data errors cause major issues, from NASA losing its $125 billion orbiter due to a metric conversion error to an English town in Yorkshire not paying its gas bill for 17 years. The range of impact can vary from looking a bit foolish to something extremely costly. Between the Spreadsheets shows the importance of these topics, and offers practical examples of steps to ensure your data have their COAT on.

Cleaning data can be tedious, but Walsh’s practical examples guide you through a process that helps make it a little less painful. As the benefits are stated and the horror stories shared, this book will motivate you to get on with it and clean up your data.

How can you get started with your dirty data?

We can all spot errors and clean up our data with Walsh’s guidance along with our own methods and techniques. Often, it can be as simple as sorting the data, as a schoolboy found out when he corrected NASA (yes, them again). Rocket science is easy compared to data validation, apparently.

Throughout her book, Walsh provides guidance on how to clean up your data in Excel. She shares tips and tricks, such as key misspellings and the challenges of replacing data without context.

Walsh also highlights the importance of regularly cleaning up your data. If quality is regularly checked, then cleanup is a relatively small task. If neglected, however, the task can become huge. The longer it’s left, the bigger the effect on the business, too. What dodgy decisions might be based on false information in your organization?

Who can benefit from reading this book?

None of the techniques and methods shared by Walsh is particularly complex. Most people working with data should have the skills to execute the advice and methods in this book. By illuminating the issues, Walsh hopefully will motivate more people to take an interest in ensuring their data are cleaned on a regular basis.

If you have limited experience in managing and maintaining data, then this book is for you. If you work in finance, procurement, or marketing, you deal with data daily but may not have the technical knowledge of a data team. For this reason, Walsh’s Excel tips are relatable and easy to implement. So almost anyone can improve the quality of their data using this book as a prompt or guide.

I’d be betraying confidences if I shared specific examples, but I too have seen big organizations make costly mistakes due to data errors. These days, it feels like all the focus on advanced analytics and data science is making such neglect even more likely. Thank goodness for the rise of DataOps as a topic. Hopefully, these techniques and a growing legion of chief data officers (CDOs) can ensure dirty data are tackled quickly and often. As leaders in data, analytics, and insight, let’s resolve to continually question the quality of our data. They are the very ground we stand on.

Discuss

About The Author

Tristan Mobbs’s picture

Tristan Mobbs

As a true data translator, Tristan Mobbs excels at giving data meaning. Mobbs’ goal is to to ensure that analytics deliver business results.