© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.

“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.

Published on *Quality Digest* (https://m.qualitydigest.com)

**Published: **06/30/2021

Measurement systems analysis (MSA) for attributes, or attribute agreement analysis, is a lot like eating broccoli or Brussels sprouts. We must often do things we don't like because they are necessary or good for us. While IATF 16949:2016, Clause 7.1.5.1.1—“Measurement systems analysis,” does not mention attribute agreement analysis explicitly, it does say that MSA shall be performed to assess “variation present in the results of each type of inspection, measurement, and test equipment system identified in the control plan.” It does not limit this requirement to the familiar real-number measurements with which we are comfortable.

Common sense says, meanwhile, that it is beneficial to understand the capabilities and limitations of inspections for attributes. The last thing we want to hear from a customer is, for example, “Your ANSI/ASQ Z1.4 sampling plan with an acceptable quality level of 0.1 percent just shipped us a lot with 2-percent nonconforming work.” Samuel Windsor describes how an attribute gage study saved a company $400,000 a year, which is a powerful incentive to learn about this and use it where applicable.^{1} Jd Marhevko has done an outstanding job of extending attribute agreement analysis to judgment inspections as well as go/no-go gages.^{2}

This is the first part of a two-part series that will address the following issues:

1. This part will describe how MSA for attributes is performed, along with the signal detection method and hypothesis test risk analysis. The former allows rough quantification of the gage’s repeatability or equipment variation. The latter includes interrater comparison. An appendix will show how to simulate attribute gage studies for training and other purposes.

2. The second part will cover the analytical method that allows detailed quantification of the attribute gage’s repeatability.

Attribute agreement analysis is, despite the relative simplicity of attribute data, far more complicated than MSA for variables data for which Donald Wheeler provides an excellent overview.^{3} *Gage reproducibility and repeatability for variables is, however, independent of the part measurements*. Equipment variation (i.e., repeatability) and appraiser variation (reproducibility) are the same for parts that are in specification, out of specification, or borderline. This is not true for attribute inspections; hence, the comparison to unpopular cruciferous vegetables.

MSA for attributes seeks to quantify equipment variation, appraiser variation, and also accuracy (i.e., the ability to distinguish good parts from bad ones) but *all three considerations depend on the status of the parts*. There will be almost no equipment or appraiser variation, or inaccuracy, for parts that are well within specification or clearly out of specification. The definition of Zone I in figure 1 is that inspectors (and the gage) will agree consistently that the parts are nonconforming. They will similarly agree consistently that the parts in Zone III are good, again for perfect repeatability and reproducibility. Only in Zone II will any equipment and/or appraiser variation make themselves known.

*The results of the study therefore depend heavily on the nature of the parts used*. Brussels sprouts, broccoli, and MSA for attributes are therefore at best acquired tastes, but we know we have to eat or do them, respectively, because they are good for us. If attribute agreement analysis can deliver $400,000 bottom-line results, we must acquire a taste for that as well.

The Automotive Industry Action Group (AIAG) offers some of the best manuals available to support quality management systems, and their content is applicable to most non-automotive applications. AIAG’s MSA manual provides a comprehensive overview of MSA for attributes, and StatGraphics supports the chapter’s entire content.^{4} StatGraphics also includes the data files (gageatt1.sgd and gageanalytic.sgd) so the user can try the AIAG example.

The AIAG manual describes several procedures for assessing attribute measurement systems. The signal detection approach and analytic method require quantitative (real-number) part measurements by gages capable of real-number measurements. The hypothesis test analysis does not and is applicable to sensory and judgment inspections.

1. The signal detection approach requires dimensional measurements of the parts. Its deliverables are:

• Boundaries of Zones I, II, and III

• Gage standard deviation

2. Hypothesis test analysis does not require dimensional measurements or even knowledge of whether the parts are good or bad, although the latter information is necessary to assess how well the inspectors agree with the parts’ actual status. A caveat is that* the results will depend on the selection of parts*, but appraiser variation and accuracy can be addressed through improvements in methods and conditions. Deliverables include:

• Effectiveness, or correct decisions divided by decision opportunities; this requires knowledge as to whether the parts are good or bad.

• Repeatability (does not require knowledge of the part status)

• Reproducibility (does not require knowledge of the part status)

• Cohen’s kappa measurement for interrater agreement, and also agreement of the appraisers with the actual status of the part. The former does not require knowledge of the part status, while the latter does.

• Miss rate, or acceptances of nonconforming parts divided by opportunities to detect nonconforming parts. This, and the false alarm rate, require knowledge of the part status. Both depend on the status of the parts, but both are actionable if they are found to depend on gage bias, gage variation, and/or appraiser variation.

• False alarm rate, or rejections of good parts divided by opportunities to accept good parts

3. The analytic method quantifies gage bias and variation. The deliverables are:

• Gage bias, and remember that many applications use different gages at each end of the specification. Jody Muelaner reminds us in the context of plug gages, “Each end of a go/no-go gage must be considered as a separate gage, requiring its own uncertainty evaluation.”^{5}

• Equipment variation in the form of the gage standard deviation and percent GRR.

The AIAG reference provides an excellent and comprehensive overview of how to perform an attribute agreement analysis. More parts will be required than the usual five to 15 (depending on the number of appraisers) that are used by MSA for variables data because pass/fail results do not contain nearly as much information. In addition, roughly 25 percent of the parts should come from the Zone II region, where mischaracterization is possible.

The *reference value*, or dimensional or other real-number measurement, must then be measured for each part, if possible. This determines whether the part is in or out of specification, as indicated by 1 (pass) and 0 (fail), respectively. If the parts are not measurable, then people with expertise in visual or other judgment inspection must classify the parts as good or bad. This precludes use of the signal detection approach and analytic method that require reference values.

Each inspector then assesses each part a certain number of times, as they would do in a MSA for variables data. The inspector records 1 for acceptance and 0 for rejection. If there are three inspectors A, B, and C, then A-1 refers to A’s first assessment of the part, A-2 to his or her second, and A-3 to his or her third. The results are recorded as shown in the AIAG reference for an attribute study data sheet. The first 10 rows of the example used in this article appear in figure 2.

The AIAG reference offers an example in which the process performance index is so bad (0.5, or a 1.5 sigma process) that it will generate a relatively large fraction of nonconforming and borderline parts. The lower specification limit is 0.45, and the upper specification limit is 0.55, which means, given 1.5 standard deviations between the nominal and each specification limit, the process standard deviation is 0.0333.

The organization wants to use a go/no-go gage to contain the poor quality, even though the part dimension is measurable in real numbers. This is probably because the latter would be expensive or even cost-prohibitive, while a go/no-go gage can check far more parts per hour at the cost of getting only pass/fail results.

The team wanted to know explicitly how well the gages would perform in the region of the specification limit. The team selected accordingly 25 percent of the sample from near each specification limit *as determined from the gage capable of measuring real numbers*. Three inspectors used the go/no-go gage to test each part three times, for a total of nine assessments of each part.

The attribute study data sheet (Table III-C 1, page 134) is available in the AIAG reference and also as gageatt1.sgd in StatGraphics 18’s sample data files. I decided, however, to generate another example not copyrighted by AIAG to provide an entire data set as an Excel file for *Quality Digest’s* readers, as shown in Appendix 1. The first 10 rows appear in figure 2.

The signal detection approach (AIAG reference, pp. 143–144) allows us to estimate the measurement system’s gage reproducibility and repeatability (GRR) and also shows how to identify Zones I, II, and III. This requires that the parts be measurable in terms of real numbers, as stated earlier.

To do this with StatGraphics, copy the attribute study data sheet (for all 50 parts) into StatGraphics in the Data and Code Columns format. (I was unable to make it work in the One Row per Part format from figure 2, but I might have done something wrong). Part of this appears in figure 3.

Then select SPC/Gage Studies/Attribute Data/Signal Theory Method. Then enter the required information, as shown in figure 4. Note that the “measurement” is the pass/fail result and not the real-number measurement (i.e., reference value), which goes in the Reference Value field.

The next dialogue box will ask for the tolerance, which is 0.1 (the difference between 0.55 and 0.45). The results appear in figure 5. Zones II are the pink bars, where parts are not rejected or accepted consistently. *Each Zone II should be centered roughly on the specification limit, where we expect 50 percent of the parts to be accepted*. The one for the LSL appears to be too far to the left, and remember that a bias of 0.005 was introduced for the smaller of the two gages. If Zone II is not where we expect it, then, the gage may be biased. *Figure 5 therefore exemplifies an actionable deliverable from an attribute gage study*. It tells us that there might be a problem with one of the gages upon which we depend to protect customers from poor quality.

Figure 6, again from StatGraphics, shows how the signal detection approach estimates the gage variation.

The AIAG manual explains that the average of the Zone II widths d = 0.0221, when divided by the specification width of 0.1, yields a GRR of 0.221. StatGraphics multiplies by 6/5.15 to get 25.78 percent. The StatGraphics documentation explains that the calculation is Kd/(5.15 times the specification width) where K is the user-specified number of sigma intervals, with 6 being the default, for gage studies. The AIAG manual divides d by the tolerance to get the GRR%.

The estimated gage standard deviation is d/6. The AIAG manual explains that d is an estimate of the GRR, and GRR = 6 times the gage standard deviation. Division of 0.0221 by 6 yields 0.00368, which compares favorably to the simulated gage standard deviation of 0.004.

We have seen so far, then, that the signal detection approach provides 1) an estimate of the gage standard deviation; and 2) a visual indicator of potential gage bias. The gage standard deviation also allows us to illustrate Zones I, II, and III. The probability of acceptance for a part whose reference value (actual measurement) is x is, where Φ is the cumulative standard normal distribution,

Figure 7 (from Excel) shows the resulting gage performance chart and also the previously calculated zones. Recall that we want the 50-percent acceptance points on the gage performance chart to be in the centers of their zones, and the fact that the one for the lower specification limit is too far to the right reflects the bias for the smaller gage (or “go” side of a plug or pin gage, which must fit into the hole, while the other side must not). This kind of information can therefore be extremely useful for assessing the system.

Now that we have defined Zones I, II, and III, we can tabulate the study and proceed with the hypothesis test risk analysis.

StatGraphics will perform this for data in either format (data and code columns, or one row per part). Since we have the date in the former format, we will use that. Figure 8 shows the relevant dialogue box. *The measurement is the pass/fail result and not the real-number reference value, which is in fact optional because this can be used for judgment inspections as well*.

In the subsequent dialogue, leave the default confidence level of 95 percent. We can obtain 1) an analysis summary; 2) an interrater comparison; and 3) an agreement plot. Figure 9 shows the analysis summary.

1. Measurement System Effectiveness

• Repeatability reflects the number of times all the inspectors got consistent results for a part, i.e., agreed among themselves about the status of a part. The 135 out of 150 result reflects those from the Repeatability by Operators section (46+44+45). The gage bias will not affect this outcome, just as an out-of-calibration gage will not affect the repeatability component of MSA for variables data. If the bias is bad enough, then the results will be wrong, but they will still be consistent. *Accuracy and precision, while both important, are not the same thing*.

• The Reproducibility result of 44 out of 50 means that all three inspectors agreed about the parts 88 percent of the time. The gage bias will not, however, affect this outcome, either, although an appraiser bias would. *The proportion of Zone II parts will, on the other hand, affect both repeatability and reproducibility*. If the appraisers agree among themselves and each other consistently, then the parts are in Zone I or Zone III by definition.

• The Operators vs. Reference result shows that all three inspectors got the same correct result for 44 of the 50 parts. The AIAG reference (page 139) explains that the inspectors agreed 1) with themselves for uniform repeatability; 2) with each other for uniform reproducibility; and 3) with the reference, for uniform accuracy. This is far from a perfect metric, however, because it will again depend on the proportion of items in Zones I, II, and III. If almost all the parts are in Zones I and/or III, i.e., very bad or very good, we will get close to 100 percent. If most of them are in Zone II, we will get very mediocre results. The same goes for the miss rate and false-alarm rate discussed below. Note, however, that if the gage bias is sufficiently large, it will affect the Operators vs. Reference result.

2. Repeatability by operators breaks down the information from Repeatability in Measurement System Effectiveness.

3. “Operators vs. Reference” repeats the information in “Repeatability by Operators” but adds fields for false positives and false negatives. A false positive means the inspector passed the item *consistently*, 3 out of 3 times in this case, when the item should have failed. A false negative means the inspector rejected an acceptable item *consistently*. If the inspector did not obtain consistent results for a part, it does not count.

• This example had a false negative, which means Inspector A rejected a part three out of three times, even though it was in specification. This was for part 8, which was just below the USL (at 0.5493), which means it had almost a 50:50 chance of rejection by random chance, so three out of three rejections by the same inspector are consistent with random chance (1 in 8). The same would go for three out of three acceptances, even though this is the correct outcome.

4. “Operators summary” is similar except for the fact that it assesses each inspector’s individual assessments of the parts. Equation Set 1 summarizes the miss rate and false-alarm rate.

The StatGraphics documentation adds that the miss rate should be no more than 2 percent and the false alarm rate no more than 5 percent. In this case, each inspector had a 25-percent miss rate. Eight of the parts were out of specification, which means there were 72 opportunities (eight parts times nine assessments of each) to accept nonconforming work, and nonconforming work was accepted 18 times. Ten of these were at the lower end and could have been aggravated by the 0.005 bias of the smaller gage; it thinks the parts (or holes, if a plug gage) are larger than they really are. Half of the eight misses at the upper end were for a 0.551 part, which had close to a 50-percent chance of acceptance, so these four misses can be written off as random chance. Another three were for a 0.553 part, which is still less than one gage standard deviation from the specification limit.

This reinforces the previous statement that* the outcome of these studies will depend heavily on the nature and mix (Zone I, Zone II, and Zone III) of the parts*. The miss rate and false-alarm rates are therefore not as informative as we might like to think.

StatGraphics also calculates Cohen’s kappa for agreement between the inspectors, as shown in figure 10. Kappa measures how well the inspectors agree with each other (e.g., 0.769 for agreement of A with B) and with the reference or actual part status (e.g., 0.764 for A and the reference). Lower and upper confidence limits also are provided. Appendix 2 shows how kappa is obtained.

The AIAG manual adds that kappa values of 0.75 or greater indicate good to excellent agreement, while those less than 0.40 indicate poor agreement. *The practical deliverable here is an indicator of how well the inspectors agree with each other*. If the gage is not inspector-dependent, then there should be good agreement. If the agreement is poor, then the next step is to assess why the gage is inspector-dependent; the same principle applies to appraiser variation for gages that measure real numbers.

The rate at which inspectors agree with the standard also becomes important when the inspection is a visual or other sensory one, i.e., there is no gage. The Windsor reference cites visual inspections for blisters, voids, scratches, and roughness on an electroplated product, and a complicating issue was the fact that no part was totally free of these defects. This resulted, as one might expect, in substantial inconsistency in inspection results. Creation of standards with visual aids, including photographs, improved consistency and effectiveness enormously.

1. Windsor, Samuel E. “Attribute Gage R&R.” *Six Sigma Forum Magazine*, August 2003.

2. Marhevko, Jd. “Attribute Agreement Analysis (AAA): Calibrating the Human Gage!” *Statistics Digest, The Newsletter of the ASQ Statistics Division*, vol. 36, no. 1, 2017.

3. Wheeler, Donald. “Gauge R&R Methods Compared.” *Quality Digest*, Dec. 2, 2013.

4. Automotive Industry Action Group. *Measurement Systems Analysis*, Fourth Edition, Section C: Attribute Measurement Systems Study, 2010.

5. Muelaner, Jody. “Attribute Gage Uncertainty.” Engineering.com, May 2, 2019.

Readers may wish to simulate attribute gage studies for training and other purposes. This can be done in Excel as follows. Table 1 shows that it is possible to introduce a bias for each gage and/or inspector to make matters even more interesting and, more important, shows how the study can expose the problem in question. This example includes a bias of 0.005 for the gage at the lower specification limits.

The first step is to simulate 50 measurements from a normal distribution whose mean is 0.5 and whose standard deviation is 0.0333. This can be done easily with Excel’s built-in random number generator. These are the reference values for the parts, and they appeared in Column B of figure 2.

The rest of the row is completed as follows.

Reference (column C) =IF(AND(B22<USL,LSL<B22),1,0). If the reference value in cell B22 is inside the specification limits, the result is 1 (pass) and otherwise 0 (fail)

Code (column D) requires the information in columns E through M. =IF(SUM(E22:M22)=9,“+”,IF(SUM(E22:M22)=0,“-”,“x”)), which means that if the inspectors pass the parts nine out of nine times (three for each inspector), a + is recorded to indicate Zone III. If they fail the parts nine out of nine times, a – is recorded to indicate Zone I. Otherwise an x is recorded to indicate Zone II.

A-1 (Column E) is assessed as follows. =IF($B22<Target,IF($B22+Lower_gage_bias+bias_A+sigma_gage*NORMSINV(RAND())<LSL,0,1),IF($B22+Upper_gage_bias+bias_A+sigma_gage*NORMSINV(RAND())>USL,0,1)), which can be broken down as follows:

• =IF($B22<Target,IF($B22+Lower_gage_bias+bias_A+sigma_gage*NORMSINV(RAND())<LSL,0,1) means that, if the reference value is less than the target or nominal, then add to it 1) the bias, if any, for the smaller go/no-go gage; 2) the bias, if any, for inspector A; and 3) the gage standard deviation multiplied by a randomly-generated standard normal deviate or z score. *This is the measurement that the gage perceives*. If this is less than the LSL, then the inspector rejects the part (0); if not, the inspector accepts the part (1).

• IF($B22+Upper_gage_bias+bias_A+sigma_gage*NORMSINV(RAND())>USL,0,1)) is used when the reference value is larger than the target, so if the part is rejected, it will be by the larger gage. Add to the reference value 1) the bias, if any, of the larger gage; 2) the bias, if any, of the inspector; and 3) the gage standard deviation multiplied by a randomly generated standard normal deviate or z score. If the result exceeds the USL, then the inspector rejects the part (0) and otherwise accepts it (1).

• A-2 and A-3 are determined similarly.

• B-1, B-2, and B-3 are the same except they use the bias for inspector B. C-1, C-2, and C-3 use the bias for inspector C.

Note that, once you generate results, you might want to copy the entire table as cell values only because Excel seems to recalculate the RAND() results every time a new calculation is performed. Another way to do this job is to generate an array of random numbers from a uniform distribution (0,1), i.e., 0 to 1 not inclusive of 0 and 1, and then use them in place of the RAND() function. Then the simulation results will not change until you create a new set of random numbers.

Kappa is calculated from cross-tabulations, as shown for inspectors A and B in table 2, and it is obtained by comparing the same assessment (first, second, or third) from each inspector. That is, A-1 is compared to B-1, A-2 to B-2, and A-3 to B-3, where 1, 2, and 3 refer to each inspector’s first, second, and third measurement, respectively.

In this case:

• The inspectors agreed to reject (0) the part 16 times.

• Inspector B accepted a part A rejected five times.

• Inspector A accepted a part B rejected three times.

• The inspectors agreed to accept (1) the part 126 times.

• The expected number of rejection agreements is 150 times the probability that both inspectors reject the part; that is, 150 times (21/150) and (19/150) because A’s total rejections add to 21, and B’s add to 19. The result is 2.66, so we expect A to agree with B about rejections for 2.66 parts. These are indicated by parenthesis, and note that the expected counts must add up to the actual counts both horizontally and vertically.

• The expected number of acceptance agreements is similarly 150 times (129/150) for inspector A and (131/150) for inspector B, or 112.66.

Kappa measures interrater agreement, and the concept is similar to that of a contingency table that uses the chi square statistic with one degree of freedom. The AIAG manual warns, however, that “Kappa is a measure rather than a test” and adds that “a large number of parts covering the entire spectrum of possibilities is necessary.” This brings us back to the observation that, if all the parts are (for example) in Zone III, interrater agreement will be perfect by definition while agreement will be poor if all the parts are in Zone II. *Remember also that gage variation and/or bias will affect the agreement of all the inspectors with the reference values*. Kappa is computed as follows for Inspector A and Inspector B.

• p_{0} = sum of observed probabilities in the diagonal cells from top left to lower right, i.e. (16+126)/150 = 0.9466

• p_{e} = sum of expected probabilities in the same cells, i.e. (2.66+112.66)/150 = 0.7688

• The result of 0.769 matches the result from StatGraphics.

**Links:**

[1] http://rube.asq.org/pub/sixsigma/past/vol2_issue4/windsor.html

[2] https://asq.org/statistics/2016/10/statistics/statistics-digest-february-2017.pdf

[3] https://www.qualitydigest.com/inside/quality-insider-article/gauge-rr-methods-compared.html

[4] https://www.aiag.org/store/publications/details?ProductCode=MSA-4

[5] https://www.engineering.com/story/attribute-gage-uncertainty