Featured Product
This Week in Quality Digest Live
Health Care Features
Stephanie Ojeda
How addressing customer concerns benefits the entire quality process
Michael King
Augmenting and empowering life-science professionals
Meg Sinclair
100% real, 100% anonymized, 100% scary
Kristi McDermott
Technology and what the future requires for patients and providers to thrive
Alonso Diaz
Consulting the FDA’s Case for Quality program

More Features

Health Care News
Recognized among early adopters as a leading innovation for the life sciences industry
Study of intelligent noise reduction in pediatric study
Streamlines annual regulatory review for life sciences
The company is also facilitating donations to the cause
Mass spectromic analysis from iotaSciences
Showcasing the latest in digital transformation for validation professionals in life sciences
An expansion of its medical-device cybersecurity solution as independent services to all health systems
Purchase combines goals and complementary capabilities
Better compliance, outbreak forecasting, and prediction of pathogens such as listeria or salmonella

More News

Davis Balestracci

Health Care

The Famous DOE Question

How many experiments should I run?

Published: Sunday, August 14, 2016 - 23:00

I hope this little diversion into design of experiments (DOE) that I’ve explored in my last few columns has helped clarify some things that may have been confusing. Even if you don’t use DOE, there are still some good lessons about understanding the ever-present, insidious, lurking cloud of variation.

Building on my June column, consider another of C. M. Hendrix’s “ways to mess up an experiment”: Insufficient data to average out random errors (aka, a failure to appreciate the toxic effect of variation).

This is where the issue of sample size comes in, and it’s by no means trivial.

How many experiments should I run? It depends.

The ability to detect effects depends on your process’s standard deviation, which in the tar scenario from my May column simulation was +/– 4 (the real process was actually +/– 8).

Here’s a surprising reality for many: The number of variables doesn’t necessarily directly determine the number of experiments. But let’s continue the tar scenario:

“Three variables? It’s obvious: Let’s run a 2 x 2 x 2 factorial.” (Eight experiments.)
Most people might not realize that this design would allow detection of an approximate 8 percent to 9 percent difference between the high and low levels of a variable, e.g., the average difference in tar if one goes from 55°  to 65°, or 26 percent to 31 percent copper sulfate, or 0 percent to 12 percent excess nitrite.

“I want to do only four experiments, so I’ll do a 2 x 2 factorial and study the other variables later.”
There are consequences! Running an unreplicated 2 x 2 (four experiments) on two of the variables (e.g., omitting excess nitrite) would allow detection of an 11 percent to 13 percent difference between the high and low variable settings. Interactions between excess nitrite and each of the other two variables would be unknown.

“What do you mean, replicate it?”
Replication of your 2 x 2 x 2 factorial (16 experiments) would then allow detection of an approximate 5.5 percent to 6.5 percent difference. To get this same accuracy with two variables, the 2 x 2 factorial would have to be repeated three more times (16 total experiments)—a wasted opportunity.

If you knew up front that 16 experiments would be needed for your objective, you could now easily include excess nitrite. And you could easily add two additional variables (five total, perhaps those two variables you were planning to “study later?”) with no serious consequences.

Wouldn’t it be nice to discover up front that the excess nitrite could subsequently be set to zero? It’s your decision, and it depends on the answer to this question: What size effect must you detect to take a desired action?

I’ve often had this conversation in various guises:

Client: I have three variables I can test. Given the potential cost savings per each percent tar reduction, I need to detect a one percent difference.

Davis: Sit down. I’m afraid I have some bad news. That would require 500 to 680 experiments, depending on how badly you want to detect that effect.

Client: Ohhh... what if I cut it down to two variables?

Davis: Sorry. Still 500 to 680.

Client: Really? OK, I’ll settle for detecting 2 percent.

Davis: You’d better stay sitting. That would now require 130 to 170 experiments. But wait; let’s chat some more. Under the right circumstances, I might be able to recommend a 15-run design (three-variable Box-Behnken) that will map the region;  or, should you wish to study two additional variables, there is a five-variable design that would allow you to study these two additional variables and map the region in 33 experiments (believe it or not, four variables would also take ~30 experiments).

Why the dramatic difference in numbers? It depends on your objective, which brings me to another Hendrix “mess up”: Establishing effects (factorial) when the objective was to optimize (response surface), or vice versa.

People think it’s as simple as running a factorial design based on the number of variables, and then performing statistical t-tests. It’s not.

A healthcare example—for everyone

Suppose you’re interested in examining three components of a weight-loss intervention:
• Keeping a food diary (yes or no)
• Increasing activity (yes or no)
• Home visit (yes or no)

You plan on randomly assigning individuals to one of the eight experimental conditions, each representing a different treatment protocol. For example, the individuals randomly assigned to Condition 2 would receive a home visit, but neither of the other two intervention components. Those randomly assigned to Condition 7 would receive the “keeping a food diary” and “increasing physical activity components,” but wouldn’t receive a home visit. People assigned to Condition 1 will have to rely on sheer willpower.

Sounds simple enough.

I happen to be visiting your facility, and you ask me for a sample size recommendation for the number of people needed.

I smile and say, “Please sit down.”

To be continued next time.


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.