Featured Product
This Week in Quality Digest Live
Six Sigma Features
Mark Rosenthal
The intersection between Toyota kata and VSM
Scott A. Hindle
Part 7 of our series on statistical process control in the digital era
Adam Grant
Wharton’s Adam Grant discusses unlocking hidden potential
Scott A. Hindle
Part 6 of our series on SPC in a digital era
Douglas C. Fair
Part 5 of our series on statistical process control in the digital era

More Features

Six Sigma News
Helps managers integrate statistical insights into daily operations
How to use Minitab statistical functions to improve business processes
Sept. 28–29, 2022, at the MassMutual Center in Springfield, MA
Elsmar Cove is a leading forum for quality and standards compliance
Is the future of quality management actually business management?
Too often process enhancements occur in silos where there is little positive impact on the big picture
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth

More News

Donald J. Wheeler

Six Sigma

Analyzing Observational Data

Shewhart’s economic operation is the starting point for lean production

Published: Monday, February 7, 2022 - 12:03

Most of the world’s data are obtained as byproducts of operations. These observational data track what happens over time and have a structure that requires a different approach to analysis than that used for experimental data. An understanding of this approach will reveal how Shewhart’s generic, three-sigma limits are sufficient to define economic operation for all types of observational data.

Management requires prediction, yet all data are historical. To use historical data to make predictions, we will have to use some sort of extrapolation. We might extrapolate from the product we have measured to a product not measured, or we might even extrapolate from the product measured to a product not yet made. Either way, the problem of prediction requires that we know when these extrapolations are reasonable and when they are not.

The structure of observational data

Before we talk about prediction, we need to consider the structure of observational data. For any one product characteristic we can usually list dozens, or even hundreds, of cause-and-effect relationships which affect that characteristic. Some of these causes will have larger effects than the others. So, if we had perfect knowledge, we could arrange the causes in order according to their effects to obtain a Pareto like Figure 1.

Figure 1: A typical cause-and-effect Pareto

This gives us a model for the structure of the variation for a given product characteristic: A critical few causes have dominant effects, and the many other causes have trivial effects. In production it is never going to be economical to control all of the causes, so we seek to identify and control the critical few.

Economic operation

Given the model in Figure 1, economic production would require that we control the first four factors. Attempting to control more factors would involve diminishing returns, and controlling fewer would result in excess variation in production.

Figure 2: Economic operation requires a balance

Figure 2 defines economic operation where all of the causes with large effects are controlled and the remaining causes with trivial effects are allowed to vary. The levels we choose for each of the controlled factors will determine the average value for the process outcomes. By holding these factors constant at the chosen levels, we will also eliminate them as a source of variation.

As a result, virtually all of the variation in the product stream will come from the remaining, uncontrolled factors (causes 5 through 20 in Figure 2). Here, the process variation will be the result of a large number of cause-and-effect relationships where no one cause has an effect that dominates the effects of the other causes. If these causes are independent of each other, then their variation will combine in an additive manner and the central limit theorem assures us that the histogram will be approximately normal.

Figure 3: Economic operation results in approximate normality

We have known about the central limit theorem for more than 200 years. Last month’s column, “How Can the Sum of Skewed Variables Be Normally Distributed?” illustrated this fact of life. When we are operating economically, the effects of the uncontrolled causes will sum up to produce a bell-shaped histogram.

But the effects of the trivial many do not have to combine as a sum. The normal distribution is the distribution of maximum entropy, and any operation that increases entropy will eventually have a histogram that approaches a normal. So, regardless of how the effects of the trivial many combine, economic operation will result in a histogram that is approximately normal.

In his 1931 book Economic Control of Manufactured Product, Shewhart described a “state of control” (a predictable process) as being one where the variation can be described as the result of a large number of uncontrolled causes where no one cause has a predominant effect (as in Figure 2).

Shewhart also defined a “state of maximum control” as one where the histogram of Figure 3 is approximately normal. Thus, Figures 2 and 3 provide useful models for what Shewhart defined as economic operation.

With this conceptual model of what economic operation should look like, we are ready to begin to analyze observational data. We know what we are looking for, and we can begin to compare what we find with the conceptual model. However, we need to first consider what happens when a process is not operated economically.

What usually happens

We never have the perfect knowledge assumed by Figures 1, 2, and 3. While we do our best to identify and control all of the causes with dominant effects, it is difficult to find them all when the number of known causes may be in the hundreds. As a result, we will usually unknowingly end up with one or more dominant causes remaining in the uncontrolled set.

Figure 4: Cause-and-effect Pareto for an unpredictable process

When the set of uncontrolled causes contains one or more causes with dominant effects, everything changes. First, the set of controlled causes will only partially determine the process average. This happens because the large effect of Factor 4 means that it can take the whole process histogram on walkabout. As Factor 4 varies, you can count on having unexpected changes in the process average from time to time.

Second, as Factor 4 takes the process on walkabout, the operators may start adjusting Factors 1, 2, and 3 to keep the process on-target. This will create further variation.

Third, as Factor 4 and the operators take the process on walkabout, the variation from all the other uncontrolled causes will be taken along, making the histogram even fatter. The result for one such process is shown in Figure 5.

Figure 5: The effect of an uncontrolled dominant cause

Thus, the impact of having a dominant cause in the set of uncontrolled factors will inevitably be increased variation in the product stream. This increased variation can result in substantial excess costs of production and use.

We can only reverse the situation shown in Figure 5 by identifying the assignable causes that have dominant effects (Factor 4 in Figure 4) and making them part of the set of controlled inputs for the process. Thus, the problem of analysis for observational data is the problem of detecting the dominant cause-and-effect relationships that remain in the set of uncontrolled factors. And this is precisely what Shewhart’s process behavior charts allow us to do.

Shewhart’s approach

A process behavior chart makes no assumptions about your process. It makes no assumptions about the data that characterize the product stream. It simply compares the variation found in your observational data with the generic limits we expect when your process is operated in an economic equilibrium.

Figure 6: We characterize a process by comparing performance with potential

To get limits that approximate the process potential, we use an appropriate within-subgroup measure of dispersion to capture the routine variation inherent in the process. (In the case of individual values, this will be either the average or the median of the set of two-point moving ranges.)

We then use this within-subgroup dispersion to compute generic, fixed-width, three-sigma limits centered on the average. Both theory and 100 years of practice have shown that these limits approximate what the process can do when it is operated in a state of economic equilibrium.

If (as in Figure 2) the process variation is the result of a large number of causes where no one cause has a predominant effect, then the process performance should fall within the generic limits defining the process potential.

Figure 7: How process potential works

However, as in Figure 7, if one or more assignable causes with dominant effects are present, the running record is likely to go outside the limits. When this happens, we have strong evidence that an assignable cause is present.

Predictable operation

When a process is operated predictably, it is being operated up to its full potential. It will have the minimum variance that is consistent with economic operation. This is why seeking to control a common cause of routine variation is a low payback strategy.

When a process has been operated predictably in the past, then it is logical to expect that it will continue to be operated predictably in the future. The past behavior forms a reasonable basis for predictions of what will be.

Unpredictable operation

If the process shows evidence of assignable causes, it will behave unpredictably. Such a process will be operating at less than its full potential, and it will almost always be economical to make the assignable causes part of the set of controlled factors.

This is a high payback strategy for two reasons. When we make an assignable cause part of the group of control factors, we not only gain an additional process input to use in adjusting the process average but we also remove a large chunk of variation from the product stream. In this way, even though the process may have been unpredictable in the past, we learn how to improve the process and come closer to operating it predictably and on-target in the future.

An unpredictable process is not going to spontaneously begin to be operated predictably in the future. Assignable causes will continue to take our process on walkabout, and no computation will let us predict what our unpredictable process will actually produce.

Figure 8: Assignable causes belong in the set of controlled factors


When analyzing observational data, we have to focus on how the data vary. However, this focus has nothing to do with the shape of the histogram. We examine how the data vary over time to characterize the data as either predictable or unpredictable. These two broad categories suffice to tell us how to use the data. With a predictable process, prediction is feasible; with an unpredictable process, prediction is futile but action may be taken to move the process closer to its full potential.

The generic, three-sigma limits of a process behavior chart define this broad model of predictable and economic behavior. When assignable causes are present they will disrupt the process and take the data outside these limits.

Because assignable causes can affect the histogram in various ways, it is always a mistake to start your analysis of observational data by looking at the histogram. Until you know whether your process is being operated predictably or unpredictably, you will not know how to interpret the histogram.

So, the first question in the analysis of observational data is the question of predictability. When the process is operated predictably, the data will be homogeneous and the histogram will tend to be normally distributed due to the high-entropy condition known as routine variation.

When the process is operated unpredictably, the data will not be homogeneous and the histogram will be a smash-up of different conditions resulting in all kinds of different shapes. So, with observational data, an irregular or skewed histogram is more likely to be the result of some assignable cause taking the process on walkabout than anything else.

This is why you do not need to “prequalify” your data. You do not need to fit a probability model to your data. Neither should you place your data on a normal probability plot. And you certainly do not need to transform your data to make them “more normal.” All of these prequalification activities assume the process is already being operated predictably. Assignable causes completely undermine this assumption, making these activities nonsense.

So regardless of what your histogram may look like, put your data on a suitable process behavior chart and characterize your process behavior as predictable or unpredictable. Use the data to make predictions for your predictable processes, and use the chart to look for the assignable causes that are taking your unpredictable processes on walkabout.

Learn how to operate your processes economically and you will improve quality, productivity, and competitive position. In their landmark book The Machine That Changed the World, Womack, Jones, and Roos explicitly and repeatedly state that lean production is predicated upon having predictable processes. Without the foundation of predictability, all you have is wishful thinking.


About The Author

Donald J. Wheeler’s picture

Donald J. Wheeler

Dr. Wheeler is a fellow of both the American Statistical Association and the American Society for Quality who has taught more than 1,000 seminars in 17 countries on six continents. He welcomes your questions; you can contact him at djwheeler@spcpress.com.