



© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.
“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.
Published: 03/09/2022
All too often the topic of fixing dirty data is neglected in the plethora of online media covering artificial intelligence (AI), data science, and analytics. This is wrong for many reasons.
To highlight just one, confidence in the quality of data is the vital foundation of all analysis. This topic remains relevant for all levels of complexity, from spreadsheets to complex machine-learning models.
So, I was delighted to review Susan Walsh’s book, Between the Spreadsheets: Classifying and Fixing Dirty Data (Facet Publishing, 2021). Here are some highlights from her book, and my own advice on who should read it.
Between the Spreadsheets covers a topic that gets less daylight amid the glamour of AI and machine learning. Having followed Walsh for a while on LinkedIn, I've enjoyed how she highlights the benefits of making sure your data have their COAT on (consistent, organized, accurate, trustworthy), and how she takes the time to explain the challenges associated with poor data quality—as well as the painful real-world consequences of it.
Exploring dirty data in the world of procurement, Walsh looks at spend data classification and provides real examples of how she would go about validating and sorting out the dirty data.
Data quality and data validation are often unloved topics in the world of data. But they’re crucial. Anyone working in the world of data will have encountered issues that had consequences for the company or people involved. In the energy industry, I often saw customers being billed who weren’t on supply with the company I worked for. It gets worse when the debt collectors are about to be unleashed.
There are many examples where data errors cause major issues, from NASA losing its $125 billion orbiter due to a metric conversion error to an English town in Yorkshire not paying its gas bill for 17 years. The range of impact can vary from looking a bit foolish to something extremely costly. Between the Spreadsheets shows the importance of these topics, and offers practical examples of steps to ensure your data have their COAT on.
Cleaning data can be tedious, but Walsh’s practical examples guide you through a process that helps make it a little less painful. As the benefits are stated and the horror stories shared, this book will motivate you to get on with it and clean up your data.
We can all spot errors and clean up our data with Walsh’s guidance along with our own methods and techniques. Often, it can be as simple as sorting the data, as a schoolboy found out when he corrected NASA (yes, them again). Rocket science is easy compared to data validation, apparently.
Throughout her book, Walsh provides guidance on how to clean up your data in Excel. She shares tips and tricks, such as key misspellings and the challenges of replacing data without context.
Walsh also highlights the importance of regularly cleaning up your data. If quality is regularly checked, then cleanup is a relatively small task. If neglected, however, the task can become huge. The longer it’s left, the bigger the effect on the business, too. What dodgy decisions might be based on false information in your organization?
None of the techniques and methods shared by Walsh is particularly complex. Most people working with data should have the skills to execute the advice and methods in this book. By illuminating the issues, Walsh hopefully will motivate more people to take an interest in ensuring their data are cleaned on a regular basis.
If you have limited experience in managing and maintaining data, then this book is for you. If you work in finance, procurement, or marketing, you deal with data daily but may not have the technical knowledge of a data team. For this reason, Walsh’s Excel tips are relatable and easy to implement. So almost anyone can improve the quality of their data using this book as a prompt or guide.
I’d be betraying confidences if I shared specific examples, but I too have seen big organizations make costly mistakes due to data errors. These days, it feels like all the focus on advanced analytics and data science is making such neglect even more likely. Thank goodness for the rise of DataOps as a topic. Hopefully, these techniques and a growing legion of chief data officers (CDOs) can ensure dirty data are tackled quickly and often. As leaders in data, analytics, and insight, let’s resolve to continually question the quality of our data. They are the very ground we stand on.
Links:
[1] https://www.linkedin.com/in/susanewalsh/
[2] https://www.amazon.com/Between-Spreadsheets-Classifying-Fixing-Dirty-ebook/dp/B094ZHDX57
[3] https://www.customerinsightleader.com/others/data-quality-reporting-1/
[4] https://www.customerinsightleader.com/books/how-all-leaders-can-learn-that-data-means-business/
[5] https://www.latimes.com/archives/la-xpm-1999-oct-01-mn-17288-story.html
[6] https://www.yorkshirepost.co.uk/news/people/beverley-town-council-found-to-have-not-paid-gas-bill-for-17-years-3523528
[7] https://www.bbc.co.uk/news/uk-39351833
[8] https://www.customerinsightleader.com/events/gdpr-requires-data-quality/
[9] https://www.customerinsightleader.com/opinion/3-tips-maximising-data-team/
[10] https://www.customerinsightleader.com/books/getting-practical-with-dataops/