Featured Product
This Week in Quality Digest Live
Statistics Features
Douglas C. Fair
Part 3 of our series on SPC in a digital era
Scott A. Hindle
Part 2 of our series on SPC in a digital era
Donald J. Wheeler
Part 2: By trying to do better, we can make things worse
Douglas C. Fair
Introducing our series on SPC in a digital era
Donald J. Wheeler
Part 1: Process-hyphen-control illustrated

More Features

Statistics News
How to use Minitab statistical functions to improve business processes
New capability delivers deeper productivity insights to help manufacturers meet labor challenges
Day and a half workshop to learn, retain, and transfer GD&T knowledge across an organization
Elsmar Cove is a leading forum for quality and standards compliance
InfinityQS’ quality solutions have helped cold food and beverage manufacturers around the world optimize quality and safety
User friendly graphical user interface makes the R-based statistical engine easily accessible to anyone
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Ability to subscribe with single-user minimum, floating license, and no long-term commitment

More News

Quality Transformation With David Schwinn

Statistics

A History of Statistics

Including the dangers of some contemporary applications of big data

Published: Thursday, April 13, 2017 - 11:02

This month’s column comes from a convergence of finishing my article, “Statistical Thinking for OD Professionals,” for the OD Practitioner, and reading “How Statistics Lost their Power—and Why We Should Fear What Comes Next” in the Guardian, and Weapons of Math Destruction by Cathy O’Neil (Crown Publishing Group, 2016). I modestly titled this column “A History of Statistics,” even though I have only six credit hours of statistics education. A better title might be “Dave’s Pretty True Story of the Evolution of Statistics.” Here goes.

The Guardian article led me to believe that once upon a time, rulers of countries and nations wanted to know what was going on in their countries. When they looked around and sent out scouts, they got very different observations. They had trouble making sense out of a lot of different views of the same country. They wanted a single “objective” view of the whole. Measuring and counting things seemed to be a good way to establish “objectivity.” A mathematician proposed getting a single view by finding a mathematical central tendency, such as an average.

Those mathematicians went one step further. They also proposed understanding how much things varied. As they studied that variation, they noticed that things frequently varied in a way that could be graphically depicted by a bell-shaped curve. They logically called this a “normal distribution.” They even proposed calculating how much things varied. Along came the range. When they noticed that some distributions were not normally distributed, they concluded that maybe the range measure was too crude. They came up with the standard deviation and noticed that almost everything they attempted to measure or count fell within plus or minus three standard deviations of the average. Statistics was born!

While their statistical system was working pretty well for doing what it was designed to do—i.e., objectively describing a complex system like a country—they noticed that sometimes the stuff they were measuring or counting didn’t fall within their convenient six-standard-deviation spread. They decided to call these “outliers” and not to pay too much attention to them. After all, there were only a few of them at most.

The discipline of statistics was well on its way. Some mathematicians began calling themselves statisticians, and they began doing much more sensitive analysis of other distributions beyond the normal one that started the whole thing. They even figured out how to estimate central tendencies and variabilities based on just a sample of all the stuff that was to be analyzed. The next step along the way, however, took a real turn.

During the 1920s, Walter A. Shewhart, a physicist, wanted to use statistics to help Bell Labs control the quality of the products it was producing. Shewhart starting tracking product measurements over time on a run chart. He noticed that sometimes things went haywire, and an adjustment would have to be made in the production process. One of the primary signals for when things went haywire was that outliers would appear. Rather than ignore the outliers, Shewhart found that he needed to react to the outliers to bring the process back to where it needed to be.

He also noticed that sometimes workers would make adjustments intended to improve the process that, in fact, made it worse. He decided that if he could find a balance between these two kinds of errors, he could make the best products with the least effort. He and many others began conducting experiments to figure out the balance between these two kinds of errors (adjusting a process when nothing significant had changed, and not making an adjustment when something weird happened that required a process adjustment). As a result of these experiments, Shewhart invented a new statistical tool, the control chart, and a new purpose for statistics. This new kind of statistics proved so valuable that a colleague of Shewhart’s, W. Edwards Deming, took it beyond the walls of Bell Labs.

During World War II, Deming and others taught Shewhart’s techniques to the folks producing weapons for the war effort in America. They significantly helped improve the quality of American armaments while helping keep the cost of scrap and rework down. Deming called this new kind of statistics “analytic studies,” as opposed to the descriptive and inferential studies that had been developed earlier to simply and objectively describe a system. The purpose of analytic studies was to guide decisions that would influence the future, a very different but very important purpose.

The next step in the evolution of statistics is called big data.

One of the things that occurred as we used statistics to describe and improve systems was to gather extra data just in case we needed them. Sometimes we did need them, but more often their existence caused us to do analyses that took us off track from our intended purpose. We decided to be more disciplined about when and where to analyze that “just in case” data, but we kept the data because we still might need them at another time. Now big data is helping us use that data, because those data are in our computer systems and our social media. Big data taps into those data and, in some cases, creates new data. Although I am not a big data guy, Cathy O’Neil is.

O’Neil’s book, Methods of Math Destruction, indicates that big data looks for patterns and correlations. Because big data is big—and fast—it can seemingly find cause-and-effect relations quickly and decisively. For example, if I want to increase sales at my restaurant, I can search the database of my customers for characteristics or behaviors that seem common among my customers. I might find something like most of my customers live within 10 miles of my restaurant, and make $40,000 to $60,000 a year. I can then focus my advertising and promotions on people with those characteristics. There are, however, some big problems with big data.

When using analytic studies, correlations like the one above are usually done to verify cause/effect relations that we already believe to be true. We all know about correlations that do not indicate cause/effect relations. There is the old example that the more ice cream cones people eat in New York, the higher the murder rate. That is a correlation, but not a cause-and-effect relationship. Big data establishes the correlations without necessarily establishing cause/effect.

In analytic studies, we usually substitute scatter plots for correlation analyses because they are easier to understand. The analyses that big data conducts are usually opaque because the firms that use them consider their analysis techniques proprietary; they believe that it gives them a strategic advantage over their competitors. Methods of Math Destruction expands on this shortfall and explains other shortfalls. It further summarizes the dangers of big data as being caused by:

1. A big data model frequently does not clearly disclose its purpose, inputs, analysis techniques, outputs, and ability to test its own veracity and learn from that feedback. Clients and others usually know virtually nothing about how the big data supplier came up with its recommendation or how to test its success.
2. Big data sometimes harms the very people it is analyzing. A few examples will follow.
3. Big data is sometimes scalable in such a way that, although it may work in some application, it may cause major harm if applied in others.

One set of examples is how big data can use zip codes or other geographic descriptors to focus police concentration, resulting in more arrests, frequently for minor offenses that send more citizens to jail, making it more difficult for them to get jobs, and therefore increasing the crime rate. Big data has created similar “death spirals” around credit worthiness, employability, and educational success.

Big data has been used to rank colleges by “excellence.” We all know that excellence is an emergent property, one of Deming’s “most important” characteristics that are “unknown and unknowable.” Big data created surrogate metrics to simulate “excellence” without really knowing the quality of that simulation. The colleges figured out the algorithm and spent resources to game the system, knowing full well that there was no real cause and effect between the things the money was spent on and the education the students got.

In one big data application, FICO credit scores were used as a surrogate for trustworthiness, a desired employability attribute. If a person had a low FICO score, perhaps because of some short-term medical or other personal emergency, they had trouble getting a job as a result of a big data model. If they couldn’t get a job, their FICO score declined... another death spiral.

Another example revolved around the “Nation at Risk” studies criticizing K-12 teachers because U.S. SAT scores were going down. Big data forgot to factor in the fact that a higher proportion of students began taking the SAT. We are still paying a price for that big data error.

In Methods of Math Destruction, O’Neil argues that, left unchecked, big data can increase inequality and even threaten our democracy. I think she may be right. We are the statistical experts. Even though the big data providers will likely tell us that their black box is proprietary and too complex for us to understand, we must tell them that if that is so, it may be too dangerous for us to use. We are all smart enough to understand it and, I hope, ethical enough to keep their analytical secrets.

I think big data can be a valuable tool when used with analytic studies, to improve systems, but also that it needs to be transparent, to do no harm, and to be applied at an appropriate scale.

As always, I treasure your thoughts and questions.

Discuss

About The Author

Quality Transformation With David Schwinn’s picture

Quality Transformation With David Schwinn

David Schwinn, an associate of PQ Systems, is a full-time professor of management at Lansing Community College and a part-time consultant in the college’s Small Business and Technology Development Center. He is also a consultant in systems and organizational development with InGenius and INTERACT Associates.

Schwinn worked at Ford’s corporate quality office and worked with W. Edwards Deming beginning in the early 1980s until Deming’s death.  Schwinn is a professional engineer with an MBA from Wright State University. You can reach him at support@pqsystems.com.  

 

Comments

Big Data

Hi David

Big data is opaque is definitely yes because we can only see the final analysis but not the causes and situations in detail by oneself. As it reminds of CEO of Boeing expressed that it is not that we should not have complex (big) data to deal with but complex data if fail will be in complex ways.

Big Data

Hello David,

Enjoyed your article, especially the section on Big Data.

 "A big data model frequently does not clearly disclose its purpose, inputs, analysis techniques, outputs, and ability to test its own veracity and learn from that feedback."

A big data model so described is not a scientific model.  Reminds me of Professor Michael Mann's "hockey stick" with results that have directed the careers and funds of so many.  And he fought tooth and nail to prevent verification of his data and methods, indeed is still fighting today.

William H. Pound, PhD