I recently got hold of the set of data shown in figure 1. What can be done to analyze and make sense of these 65 data values is the theme of this article. Read on to see what is exceptional about these data, not only statistically speaking.
 Figure 1: Example data set.
|
A good start?
While I was attending a training class several years ago, a recommended starting point in an analysis was to use the “Graphical Summary” in the statistical software, which is in the options for “Basic Statistics.” The default output for figure 1’s data set is shown in figure 2.
 Figure 2: Output of the Graphical Summary
|
With 18 different statistics on display it seems hard to believe that anything further could be required. Yet, this array of statistics serves to describe these data rather than analyze them. Not only that, how many people in your audience would really know what all these basic statistics represent?
Making sense of these data does not depend on figure 2, as will be discussed. Using figure 2 as a first step in an analysis could, moreover, send you on a wild goose chase, especially if the p-value of 0.008 encourages you to try transformations of the data to get more normally distributed data.
A rational sequence is often key
Often, the most important piece of context in a data set is its sequence order, with time order often the most rational. Figure 3 charts the above data in five different sequences to help explain why: Would each ordering sequence lead to the same interpretation and prediction? No, even though each chart uses the exact same 65 values.
 Figure 3: Graphical plots of the example data in different ordering sequences
|
Why did I not use the software’s “Graphical Summary” for figure 3? Because I’d have had five identical outputs. This option in the software is completely oblivious to the data being ordered in a rational or irrational sequence, which also means that common statistics like the average and standard deviation are unaffected by the order sequence of the data. (For the enthusiast such statistics are called “symmetric functions of the data.”)
Although it is unlikely for a data set to be mistakenly presented for analysis in an ordering of low to high or high to low, I have seen it happen. One process capability case study was left with nowhere to go but the exit door when the Excel file shared with fellow training participants consisted of only one column of data, with the values ordered from lowest to highest value.
The rational ordering of these example data will be revealed in a moment, but first a look at the method of analysis.
Method of analysis
While time-series plots, and run charts with the median included, are great to visualize and learn from the data, the method used herein to analyze and make most sense of the data is a process behavior chart, commonly known as a control chart for individual values. Process behavior charts for three of the orderings of figure 3 are shown in figure 4. Note that the y-axis scale is made the same on each chart to make clear how the width of the red process limits is sensitive to the ordering sequence used. (How the limits on the chart—called “natural process limits”—are computed is shown in the Postscript section below.)
 Figure 4: Process behavior charts for three possible organizations of the example data set
|
The upper chart—ordered by column—is the most visually different.
First, its natural process limits, or control limits, are the narrowest of the three charts.
Second, the chart shows a sustained upwards shift that is evident toward the end of the record, which is a clear signal of change in this process or system. These data are non-homogeneous, i.e., there is a lack of statistical control, the meaning of which is discussed below.
On the other hand, figure 4’s lower two charts, using randomized order and ordering by row, are consistent with reasonably homogeneous sets of data —i.e., a state of statistical control, meaning stable and predictable over time. This is an effectively unchanging system over time.
So, which is it? Are the data homogeneous or non-homogeneous? Without the context of sequence, it would be impossible to answer the homogeneity question. The answer is non-homogeneous because the data’s time sequence follows the ordering by column in figure 1.
Combining context with the method of analysis
To make sense of data, a good method of analysis is not enough. As was just shown, context is needed to make the method of analysis effective.
The data of figure 1 are yearly values starting in 1953 and continuing to the present. In fact, the last value in the record was calculated on May 11, 2018. (The data are shown more completely in the Postscript, along with the moving-range method calculations used to obtain the limits on the process behavior charts for individual values.) The process behavior chart showing the data’s natural time sequence on the x-axis is that seen in figure 5.
 Figure 5: Process behavior chart using the data’s natural time sequence
|
Why, you might ask, are figure 5’s data said to be non-homogeneous? Also, what is meant by non-homogenous data in plain English?
Why? Because by using the two most common detection rules, we can detect the signals of change in this stream of data:
Rule 1: A point beyond the (red) natural process limits.
Rule 2: A sequence of eight or more values on either side of the (green) central line.
The red points in figure 5 are the signals and have a “1” or “2” next to them to indicate which rule detected the signal. (A brief discussion of the detection rules is found in the Postscript.)
What does non-homogenous mean? It means that something has changed. A lack of stability, or consistency, over time has been detected by the process behavior chart method of analysis. In practical terms, this jargon means that the red points with a “1” or “2” next to them provide the green light to ask some “what happened when?” questions:
• When did the change, or changes, occur? The change in the process or system under study does not necessarily occur at the same time as the signal on the process behavior chart; hence, the need for careful, context-driven interpretation to identify the likely starting time of the change.
• What caused the change, or changes? To answer this question, investigative skills are put to work.
Detection rule one finds the biggest signals and, for this data set, these are toward the end of figure 5’s record with three points above the upper limit by quite some margin. Here, the change appears to have started before the time of the first “1” (2011–2012) in figure 5. A useful rule of thumb to follow to help judge when a change in a process occurred is the last time a point crossed the central line. Applying this rule of thumb to figure 5 results in figure 6:
• The first set of limits uses the data up to 2007–2008.
• The second set of limits uses the data from 2008–2009 to the end of the record.
 Figure 6: Process behavior chart with two sets of limits determined from the data
|
Let us digress for a moment and assume that figure 6’s chart was an ongoing production process. Why does this chart look useful, even though we have two signals in the first set of limits? (The two runs below the central line are detected by detection rule 2.)
First, the two signals are small, sustained shifts, so their impact on the location of the natural process limits should be relatively small.
Second, what has been going on most recently in an ongoing process is often of greatest interest, and figure 6 seems to identify pretty well when the most recent, and biggest, change occurred (i.e., starting around 2008–2009).
Third, investigations to find an assignable cause, or root cause, often have the greatest chance of success the more timely an investigation starts (so the more we go back in time, the harder it can be).
Fourth, resource availability is often limited, so why open up multiple investigations until the first—focused on the biggest signals, and with the expected biggest payback—brings some successes and builds confidence?
The two runs below the central line in figure 6—signaled with a “2” next to the red points—do ask if it would be worthwhile to create different sets of limits within the period 1953–1954 to 2007–2008. Although several sets of limits can sometimes be required—see, for example, “Why We Keep Having 100-Year Floods”—this was judged unnecessary here. The reason why is influenced by the question of interest, which is introduced below.
Data analysis should help you learn something
While statisticians might analyze data for fun or out of curiosity, does anybody else? To make things worthwhile, before collecting or analyzing data, always be clear on “What questions are the data supposed to answer?”
This leads us to highlight a big mistake up to this point: The y-axis in the charts has been left without a label because we have not known what the data represent. This is essential context.
The example data are El Pichichi goal-scoring data from the Spanish soccer league, with the values themselves being the ratio of goals scored to games played in each yearly season. (The ratio value is used to allow for fair comparisons from year to year because a winner who played more games had more opportunities to score.) Because a key expectation of a goal scorer is to score, the higher the data value, the better the goal-scoring performance.
As a soccer fan (football in the UK), the question I was interested in is this: Are Lionel Messi and Cristiano Ronaldo exceptional soccer goal scorers?
For those not familiar with Messi and Ronaldo, they are currently the two most famous soccer players in the world and are often in the news due to their goal-scoring exploits.
 Figure 7: Messi and Ronaldo making the headlines in the Spanish sporting press, having broken some records
|
While Lionel Messi made his debut for FC Barcelona in 2004, his first uninterrupted campaign came in the 2008–2009 season, during which he won the Ballon d’Or and FIFA World Player of the Year award by record voting margins (source: Wikipedia). Cristiano Ronaldo joined Real Madrid, and hence, the Spanish league, in 2009 for a then world-record transfer fee. Both Messi and Ronaldo have been voted world player of the year five times, sharing it like a monopoly since 2008.
This context argues that the second set of limits in figure 6, starting from 2008–2009, will help to answer the question of interest. Figure 8, using figure 6’s second set of limits only, shows who has won the Pichichi top goal-scorer trophy since 2008–2009.
 Figure 8: Process behavior chart of the Pichichi data since 2008–2009 with the yearly winner noted on the chart
|
As seen in figure 8, Messi has won the Pichichi trophy five times and Ronaldo three, meaning we have good reason to consider the possibility of their goal-scoring performance as being exceptional.
What, however, of the two other winners, Diego Forlán and Luis Suárez, found in figure 8? Can we ignore them when speaking of Messi and Ronaldo? After all, their respective values are consistent with the limits computed since 2008–2009.
Figure 9 is used to help. In this process behavior chart, the computed limits use the data from 1953–1954 up to 2007–2008 (the first set of limits in figure 6). These limits are then extended to cover the period 2008–2009 up to 2017–2018.
 Figure 9: Process behavior chart of all Pichichi winners with the limits based on the data up to 2007–2008
|
Forlán’s win, in 2008–2009, is consistent with the level of variation in the baseline period. Even though he had a great season, it would be hard to argue that Forlán is exceptional in relation to the previous winners since 1953–1954.
Suárez won in 2015–2016 (third to last value in figure 9), and his value is above the upper limit from the baseline period. We could argue, in reference to this upper limit, that he had an exceptional season of goal scoring, noting that not one player got above this upper limit during the period 1953–1954 to 2007–2008.
Nonetheless, returning to the question of interest: Are Lionel Messi and Cristiano Ronaldo exceptional soccer goal scorers? Thanks to the process behavior chart analysis, it is proposed the answer has to be yes. Not only did Messi and Ronaldo each exceed figure 9’s upper limit twice, it is also the extent to which they exceeded this limit that is important. No other player got close to this level of performance. Context also agrees with this interpretation: As stated above, Messi and Ronaldo have shared the world player of the year trophy since 2008.
Does FC Barcelona’s Luis Suárez’ Pichichi win in 2015–2016 mean that we judge him to be exceptional, like Messi and Ronaldo? Few, if any, soccer writers would say yes, which is where subject matter knowledge comes in. But if Suárez could repeat his goal-scoring feat once or twice more....
Finally, a common discussion point in the media is often, “Who is better: Messi or Ronaldo?” The Pichichi data cannot be used to separate the two, only to tell us that both share a comparable level of goal-scoring performance, which is shown to be exceptional. But if either Messi or Ronaldo help their national team to win the current World Cup in Russia, and their goal-scoring exploits are a decisive factor, the “Who is better?” question might take on a new dimension.
What have we learned?
Although no single example can be expected to generalize to all eventualities, the following four elements have been shown to be essential to analyze data effectively.
An objective
This is often best stated as the question(s) the data are supposed to answer.
Without an objective, you are doing things for fun, meaning you may well be a statistician doing something in your spare time, or probably (but hopefully not) wasting your time.
The objective is often a key factor in how the data will be collected in the first place, thereby influencing the context as well as helping to decide which method of analysis is most appropriate.
Context, or what the data represent
Time sequence: Very often, the most important piece of context belonging to a data set is its time order—ignore this at your peril!
A method of analysis
To be effective, time-series plots and process behavior charts are built on a rational ordering of the data.
Process behavior charts filter out the noise in a data set—the job of the natural process limits—so the signals of interest (sometimes called “exceptional variation”) can be reliably detected.
Although histograms can be useful, they look the same no matter which sequence the data are organized in; hence, do not default to histograms to detect signals of change in a data set.
Judgment
To use the method of analysis effectively, judgment is often a key ingredient.
Sometimes, process behavior charts require more than one set of limits to be effective. This decision places a demand on user judgment, which often must be defended in front of one’s peers.
Closing comments
As just stated, judgment has a role to play in data analysis.
Should I have created more than one set of limits in the baseline period (i.e., figure 6 up to 2007–2008)?
What of my interpretation of Forlán and Suárez?
If your interpretation differs to mine, please post a comment.
As I’ve seen on a near weekly basis for years now, a great way to make sense of any data set is to start an analysis with a process behavior chart, the primary tool of statistical process control. Because many man-made processes are unstable in behavior over time—like the Pichichi data—this basic chart can be used to catalyze learning and improvement. The Pichichi data aren’t from a production process, but their behavior over time could have been.
Postscript: Detection rules and how to compute the limits
Detection rules
Numerous process behavior chart detection rules are offered by software packages to detect signals of process change. Although detection rule one is always applied, there are some differences in opinion concerning other rules. The detection rules are there to help you discover what is going on in the process or system through an understanding of variation. I used two detection rules:
Rule 1: A point beyond the (red) natural process limits
Rule 2: A sequence of eight or more values on either side of the (green) central line
Rule one requires just one point to fall beyond a natural process limit and is able to detect process changes of bigger magnitude than rule 2, which is effective at detecting signals of sustained, smaller shifts in a process. Rules one and two are also pretty easy to explain to others. An analogy for rule two uses a two-sided coin: If I tossed the coin eight or more times and it always returned a head, wouldn’t you think something fishy was going on? If so, you’d have the green light to look at the coin yourself or ask what’s going on.
Other detection rules are also used, such as two points of three beyond 2-sigma, and four points out of five beyond 1-sigma. The only rule that is definite is rule one, and it is often sufficient. Other rules can be added as needed, as discussed in the article, “When Should We Use Extra Detection Rules?”
As a general rule, use extra detection rules if 1) you think you need them; 2) you can also explain the additional rules; and 3) those using and interpreting the charts are comfortable with the extra rules.
Computing the limits
The Pichichi goal-scoring data are naturally one-value-per-time-period data, meaning the natural subgroup size is one. When this is the case, a process behavior chart for individual values is the chart to look at first up. Calculation of the 3-sigma limits (i.e., natural process limits) in figure 5 from the original data is shown below (SDwithin is the estimate of “sigma”).
 |
Quality Digest does not charge readers for its content. We believe that industry news is important for you to do your job, and Quality Digest supports businesses of all types.
However, someone has to pay for this content. And that’s where advertising comes in. Most people consider ads a nuisance, but they do serve a useful function besides allowing media companies to stay afloat. They keep you aware of new products and services relevant to your industry. All ads in Quality Digest apply directly to products and services that most of our readers need. You won’t see automobile or health supplement ads.
Our PROMISE: Quality Digest only displays static ads that never overlay or cover up content. They never get in your way. They are there for you to read, or not.
So please consider turning off your ad blocker for our site.
Thanks,
Quality Digest