PROMISE: Our kitties will never sit on top of content. Please turn off your ad blocker for our site.
puuuuuuurrrrrrrrrrrr
Donald J. Wheeler
Published: Monday, August 6, 2018 - 12:03 Data mining is the foundation for the current fad of “big data.” Today’s software makes it possible to look for all kinds of relationships among the variables contained in a database. But owning a pick and shovel will not do you much good if you do not know the difference between gold and iron pyrite. When you start rummaging around in a collection of existing data (a database) to discover if you can use some variables to “predict” other variables you are data snooping (known today as data mining). With today’s software we can go snooping in very large databases in an effort to extract useful relationships. However, in the interest of clarity we will use a small data set and do our snooping using nothing more than bivariate linear regression. The issues illustrated here are the same regardless of the size of the data set and regardless of the techniques used to “model the data.” The data set consists of five weekly production variables from a chemical plant. Figure 1 shows the data for a baseline of eight weeks of production. We will treat Y as our response variable, and see how well the other four variables do in predicting the value for Y. Figure 2 shows the results of four simple regressions using the baseline data. The relationship modeled is shown in the first column. The second column lists the coefficient of determination for each regression equation. These values explain how much of the variation in Y can be explained by the regression equation. The third column lists the p-value for the slope term of the regression equation. As always, small p-values indicate regression line slopes that are detectably different from a horizontal line. Three of the relationships modeled have p-values that are less than the traditional 0.05. Each of these models can explain more than 80 percent of the variation in Y. The simple regression of Y = f(X4) has a p-value of 0.65, so we conclude that by itself X4 appears to be useless for predicting the value for Y. In the case of the first three simple regressions listed in figure 2, we might logically ask if we can improve things by adding a second variable to our regression equation. Since the inclusion of additional variables will always increase the coefficient of determination we will need to use the conditional p-values to decide if a given addition is likely to represent a real improvement in the way our model fits the data. As always, it is the small conditional p-values that represent detectably better fits. Figure 3 shows the results for adding a second variable to the model Y = f(X1). The bivariate regression model Y = f(X1, X2) explains 86.7 percent of the variation in the response variable Y. However, the conditional p-value for using X2 in addition to X1 is 0.21, which means that this bivariate regression model is not detectably better than Y = f(X1). The bivariate regression model Y = f(X1, X3) explains 92.7 percent of the variation in the response variable Y. The conditional p-value for using X3 in addition to X1 is 0.029. Since this value is less than the traditional 0.05 alpha level, this bivariate regression model can be said to be detectably better than Y = f(X1). The bivariate regression model Y = f(X1, X4) explains 82.9 percent of the variation in the response variable Y. However, the conditional p-value for using X4 in addition to X1 is 0.577, which means that this bivariate regression model is not detectably better than Y = f(X1). Thus, out of these four models we would pick Y = f(X1, X3) as the best choice for explaining and predicting the response variable Y. Figure 4 shows the results for adding a second variable to the model Y = f(X2). The bivariate regression model Y = f(X2, X1) explains 86.7 percent of the variation in the response variable Y. However, the conditional p-value for using X1 in addition to X2 is 0.73, which means that this bivariate regression model is not detectably better than Y = f(X2). The bivariate regression model Y = f(X2, X3) explains 91.2 percent of the variation in the response variable Y. However, the conditional p-value for using X3 in addition to X2 is 0.143, which means that this bivariate regression model is not detectably better than Y = f(X2). The bivariate regression model Y = f(X2, X4) explains 90.2 percent of the variation in the response variable Y. However, the conditional p-value for using X4 in addition to X2 is 0.207, which means that this bivariate regression model is not detectably better than Y = f(X2). Thus, out of these four models we would pick Y = f(X2) as the best choice for explaining and predicting the response variable Y. Figure 5 shows the results for adding a second variable to the model Y = f(X3). The bivariate regression model Y = f(X3, X1) explains 92.7 percent of the variation in the response variable Y. However, the conditional p-value for using X1 in addition to X3 is 0.103, which means that this bivariate regression model is not detectably better than Y = f(X3). So while figure 3 shows that adding X3 to X1 is better than using X1 alone, here we find that adding X1 to X3 is not better than using X3 alone. The bivariate regression model Y = f(X3, X2) explains 91.2 percent of the variation in the response variable Y. However, the conditional p-value for using X2 in addition to X3 is 0.190, which means that this bivariate regression model is not detectably better than Y = f(X3). The regression model Y = f(X3, X4) explains 87.6 percent of the variation in the response variable Y. However, the conditional p-value for using X4 in addition to X3 is 0.826, which means that this bivariate regression model is not detectably better than Y = f(X3). Thus, out of these 10 models our best choices for explaining and predicting the response variable Y are either Y = f(X2) or Y = f(X3). In this case, bivariate regressions do not result in detectably better fits to the baseline data. Given the extremely small size of this illustrative data set we shall not consider regressions using three or four variables. The data for weeks 9 through 25 are shown in figure 6. When we go data snooping using all 25 records of figures 1 and 6 combined we get the simple regressions of figure 7. There only two simple regressions that have a p-value less than 0.05. The regression on X3 explains about 29 percent of the variation in Y, while the regression on X4 explains 71 percent of the variation in the response variable Y. These results are considerably different from what we found earlier in figure 2. The baseline data led us to expect that a simple regression using either X2 or X3 would predict about 85 percent of the variation in the response variable Y. The combined data suggests that a simple regression using X4 will predict about 71 percent of the variation in Y. So which analysis is right? The key to understanding what is happening here is to compare the values for X4 in figure 1 with the values for X4 in figure 6. The response variable Y represents the weekly steam usage for the plant. X1 represents the amount of fatty acid in storage. X2 represents the amount of glycerin produced. X3 is the number of hours of operation for the plant. X4 is the weekly average temperature for the plant site. Since steam is used both for process heat and for heating the buildings, the amount of steam used increased during colder weeks. This was missed in the initial analysis because the baseline data came from summer weeks. The scatterplots and regressions for the model Y = f(X4) for the baseline and combined data sets are shown in figures 8 and 9. Because the range of values for X4 was restricted in the baseline data, the first analysis missed the relationship between X4 and Y even though this relationship is the most dominant single relationship in these data. This illustrates the first caveat: The First Caveat of Data Snooping: This one caveat places all data snooping on a slippery slope. All of your data are historical. All of your data snooping applies to what has already happened. All of the questions of interest pertain to what will happen in the future. We collect and analyze data in order to make predictions so that we can take appropriate actions. But how can we do this when we may end up missing some important predictors? There are situations where data snooping is the only option. For example, when studying accidents we have to use the existing data simply because it is hard to find any volunteers for experiments. I call this data snooping out of necessity. When data snooping out of necessity we will generally have some idea that we are trying to examine. Here the objective is to discover if the database contains any evidence to support, or to refute, the idea. Whether this idea is derived from theory or is based on empirical observations, the idea will provide a framework for interpreting the results of the data snooping. One such study that I was associated with was a study of the impact of the type of on-street parking (low-angle, high-angle, or parallel) upon mid-block traffic accidents. The data were sparse and expensive to collect. When analyzing these data I had to be very careful to take the context into account. Those comparisons that made no sense in context were interpreted as representing the noise in the data. Once we had established the noise level, we could look for comparisons that were more strongly supported by the data and which also made sense in context. When several different comparisons that were strongly supported by the data all told the same story, and when that story made sense in the context, then the combination of plausibility, replication, and strength of signal made the findings credible. The result? The hazard increases with the utilization. When fully utilized, every type of on-street parking appeared to be equally hazardous. However, data snooping out of necessity is completely different from data snooping out of convenience. Data snooping out of convenience occurs when we rummage around in a database just because it is there and we want to see if we can find something. Working without either theory or experience as a guide we risk being led on walkabout by the missing data within the structure of the database. In one case I was called in to rescue some data miners who had gotten lost in their database. I found the miners were using more than 100 variables to define what was, at most, a 12-dimensional vector space. In addition, more than 99.9 percent of the variable combinations were missing from the database. When dealing with such sparse and non-orthogonal data structures everything that is found has to be considered as very tentative simply because you are virtually certain to be fitting noise more often than you are fitting signals. While the example used here was extremely simple, the principle illustrated is real. The first caveat of data snooping is that we can miss important relationships because the data may not contain the full range of values for some variables. This is what makes data snooping so unsatisfactory as a general approach to data analysis. Yet with the advent of big data it seems that anyone with a computer thinks they are the data miner that will discover the mother lode. Let all data miners beware. Next month we will look at three additional caveats of data snooping. Quality Digest does not charge readers for its content. We believe that industry news is important for you to do your job, and Quality Digest supports businesses of all types. However, someone has to pay for this content. And that’s where advertising comes in. Most people consider ads a nuisance, but they do serve a useful function besides allowing media companies to stay afloat. They keep you aware of new products and services relevant to your industry. All ads in Quality Digest apply directly to products and services that most of our readers need. You won’t see automobile or health supplement ads. So please consider turning off your ad blocker for our site. Thanks, Dr. Wheeler is a fellow of both the American Statistical Association and the American Society for Quality who has taught more than 1,000 seminars in 17 countries on six continents. He welcomes your questions; you can contact him at djwheeler@spcpress.com. Data Snooping, Part 1
What pitfalls lurk within your database?
Data snooping
Figure 1: Baseline: Five variables for eight weeks of production
Figure 2: Four simple linear regressionsRegressions using two independent variables
Adding variables to X1
Figure 3: Adding variables to Y = f(X1)Adding variables to X2
Figure 4: Adding variables to Y = f(X2)Adding variables to X3
Figure 5: Adding variables to Y = f(X3)Using additional data
Figure 6: Data for weeks 9 through 25
Figure 7: Four simple linear regressions for all 25 weeksWhat happened?
Figure 8: Regression of Y upon X4 for baseline data
Figure 9: Regression of Y upon X4 for combined data
Important relationships may be missed
when the data set does not contain
the full range of routine values for some variables.Data snooping out of necessity
Data snooping out of convenience
Summary
Our PROMISE: Quality Digest only displays static ads that never overlay or cover up content. They never get in your way. They are there for you to read, or not.
Quality Digest Discuss
About The Author
Donald J. Wheeler
© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.
“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.