Featured Product
This Week in Quality Digest Live
Six Sigma Features
Mark Rosenthal
The intersection between Toyota kata and VSM
Scott A. Hindle
Part 7 of our series on statistical process control in the digital era
Adam Grant
Wharton’s Adam Grant discusses unlocking hidden potential
Scott A. Hindle
Part 6 of our series on SPC in a digital era
Douglas C. Fair
Part 5 of our series on statistical process control in the digital era

More Features

Six Sigma News
Helps managers integrate statistical insights into daily operations
How to use Minitab statistical functions to improve business processes
Sept. 28–29, 2022, at the MassMutual Center in Springfield, MA
Elsmar Cove is a leading forum for quality and standards compliance
Is the future of quality management actually business management?
Too often process enhancements occur in silos where there is little positive impact on the big picture
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth

More News

Davis Balestracci

Six Sigma

Bears Repeating: Given Two Different Numbers...

Is it statistically significant? Who cares?

Published: Tuesday, February 17, 2015 - 12:17

I have evolved to using fewer, simpler tools in my consulting and have never been more effective, as I commented upon in my last column. It made me ponder the relevance of much of what I learned in my master’s statistics program. Thinking of the most basic concepts, I decided to look up what the American Society for Quality considers the (Six Sigma) Green Belt body of knowledge. If you click on its link, I want to draw your particular attention to: “III. Six Sigma—Measure (B, C, D),” “IV. Six Sigma—Analyze (A. B),” and “V. Six Sigma—Improve & Control (A, B).”

In the foreword to Quality Improvement Through Planned Experimentation, by Ronald Moen, Thomas Nolan, and Lloyd Provost (McGraw-Hill, 2012)—which I believe is the best book on industrial design of experiments—Deming himself writes:

“Unfortunately, the statistical methods in textbooks and in the classroom do not tell the student that the problem in the use of data is prediction. What the student learns is how to calculate a variety of tests (t-test, F-test, chi-square, goodness of fit, etc.) in order to announce that the difference between the two methods or treatments is either significant or not significant. Unfortunately, such calculations are a mere formality. Significance or the lack of it provides no degree of belief—high, moderate, or low—about prediction of performance in the future, which is the only reason to carry out the comparison, test, or experiment in the first place.”

Some of that Green Belt stuff occasionally comes in handy, but it always comes down to “plot the dots”

Wichita State University published an article in 2010 titled, “Airline performance improves; one of best years ever, according to Airline Quality Rating.” The airline industry had improved in three of the four major elements of the Airline Quality Rating (AQR): on-time performance, baggage handling, and customer complaints.

Several professors pored over some data to conclude, in essence, that some numbers were bigger than others. I found the latest report, “Airline Quality Rating 2014,” which compares 2013 to 2012. It’s authored by Brent D. Bowen from Embry-Riddle Aeronautical University–Prescott, and Dean E. Headley from Wichita State University.

I am going to show charts to compare three cases of the Bowen’s and Headley’s analyses, starting with the overall annual quality rating score, followed by one case where their conclusion about a difference seemed to be correct, and one case where their alleged difference did not exist.

Here are their high-level conclusions taken directly from the report:

As shown below, the AQR is an overall weighted composite score of the other components mentioned in the conclusion:
OT = on-time, DB = denied boarding, MB = mishandled baggage, CC = customer complaints

This score is an example of a weighted aggregate of several factors. Some high-level indicators for balanced scorecards (a/k/a analytics) are created like this. They can be useful, but are always tricky because the individual elements being combined to make up the score can exhibit special causes.

Here are two displays of the overall industry AQR data, which were taken from the report:
• A simple line graph for the 24 months, including the trend line
• A month-by-month bar graph comparison of each of the two years’ results.

Do you agree with their conclusion of improvement in 2013?

These two graphs were included in the report for each of the 16 individual airlines as well, which made up more than half of the report. The report also contained, for each airline, two years of monthly data for the on-time performance, mishandled baggage, and customer complaints (along with the rate of people being denied boarding—but those data were calculated quarterly). Other than the two types of graphs above, there wasn’t another graph in sight. The emphasis was on treating the difference between the 2012 average and the 2013 average for every indicator as a special cause needing explanation.

Also note that one airline (United) had the biggest improvement gain over 2013 and one other airline (AirTran) had the largest decline. Wait a minute—given a set of 16 numbers (in this case, the 2012–2013 differences in AQR), isn't one going to be the largest and one the smallest? And the point is...? Special cause, or a lottery?

I started a cursory analysis by simply plotting the 24 months of overall industry AQR as a control chart:

2013: -1.03 vs. 2012: -1.11

I also did some higher power analysis using linear regression assuming a simple seasonal model. The year-to-year difference doesn’t seem to be significant, i.e., common cause. For those of you who care about such things, the p-value was ~0.163, and as in all of my subsequent analyses, it passed all appropriate regression analysis diagnostics.

Their conclusion above treats the difference between -1.11 and -1.30 as special cause.

For on-time performance, my initial chart was:

2013: 0.784 vs. 2012: 0.819

Using the same simple analysis as above, there seemed to be evidence of a year-to-year difference (p~ 0.014). After adjusting for the model, the December 2013 result is not a special cause as appears on the chart above. So in this case their conclusion in was correct. Good luck, perhaps?

The initial chart for mishandled baggage was:

2013: 3.19 vs. 2012: 3.08

A moving range chart confirmed my initial suspicion: the November to December moving ranges for both years were special causes indicating seasonality. After I applied the simple seasonality model, the year seems to break up into four “chunks”: January–May, June–August, September–November, and December.

Because this analysis comes to a different conclusion than that of the report, I wanted to show the chart with the seasonality put in:

If you compare the two years’ “needles,” there is no difference (once again, for those of you who care, p~ 0.23).

In this case, their conclusion treats the difference between 3.19 and 3.09 as special cause when it is common cause.

Isn’t it amazing how non-random randomness can look?

My point for today is that it’s amazing how non-random randomness can look—and those bar graphs and trend lines are quite pretty, aren’t they? They put you at the mercy of the “human variation” in perception of the people in the room. Any differences are treated as special causes needing explanation, and people have no trouble finding them. As Deming used to say, “Off to the Milky Way!” How many meetings do you attend with data presented this way?

It reminds me of one of my favorite quotes: “When I die, let it be in a meeting. The transition from life to death will be barely perceptible.”

One of my charts did more than the work of two of their graphs—and would certainly lead to different, more productive conversations.

Help people resist their initial tendency to “explain” any differences in tables of numbers. New, unprecedented results will require new conversations.


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.



Do you suppose they got what they were looking for?  Not necessarily what the data says.  A set of numbers can be manipulated to get what you are looking for.

Great, great example, Davis!

It remains as true as ever...any fool can make a trend out of two points! What does it say about those who manage using those "trends?"