A Method for Analysis of Categorical Data for Serkan Erdural

advertisement
A Method for Analysis of Categorical Data for
Robust Product or Process Design
Serkan Erdural1, Gülser Köksal1,* and Özlem İlk2
1
Industrial Engineering Department, Middle East Technical University,
06531 Ankara, Turkey
2
Statistics Department, Middle East Technical University, 06531 Ankara,
Turkey
*
Corresponding author, e-mail address: koksal@ie.metu.edu.tr
Summary. In industrial processes decreasing variation is very important
while achieving the targets on the average. Identifying and using robust
product and process parameter levels that are insensitive to sources of
variation provide a serious competitive advantage to producers. A quality
response of products or processes is typically measured quantitatively on
an interval or ratio scale. However, in some cases, the response has to be
measured qualitatively on a nominal or ordinal scale. Many effective
methods based on statistical design and analysis of experiments have been
developed and used to find the robust levels of product and process parameters when the response variables are continuous. However, methods
proposed in the literature to find robust parameter levels when the response
variable is categorical are considerably limited in variety and performance.
This paper offers a simple and effective method for the analysis of categorical response data for robust product or process design. It demonstrates
on an example how the method handles both location and dispersion effects to explore robust settings, and it shows that the method is simple and
effective.
Key words: Categorical data, robust design, general linear models.
574
1. Introduction
Variation in the performance of a product degrades the product’s
quality. Today’s producers constantly search for better ways of producing
goods and services at target performance each time. One of these approaches is robust design (also known as parameter design or optimization). The fundamental principle of robust design is that there exist certain
levels of product and process parameters that are not affected by sources of
variation. These levels are searched for by statistical design and analysis of
experiments. This is a very cost-effective approach introduced by Taguchi
(Phadke, 1989) and extended and successfully applied by many since then.
A quality response of products or processes is typically measured
quantitatively on an interval or ratio scale. However, in some cases, the response has to be measured qualitatively using a nominal or ordinal scale.
Quality characteristics that require expert judgment or comparison to a
standard or go/no-go gauges in their measurement are of the latter type.
Manufacturers find it easy and cheap to assess quality on an ordinal scale
such as “very good”, “good” and “bad”, or “pass” and “fail”.
Many approaches have been proposed to find robust parameter levels when the response variables are measured on an interval or ratio scale
(Kackar, 1985, Phadke, 1989). To investigate the relationship between the
response and the product or process parameters (control factors) under the
influence of uncontrollable (noise) factors, conventional experimental design techniques are employed. The collected data through experimentation
are typically used to estimate the quality characteristic’s (or the response
variable’s) mean and variance for given parameter levels. Then, a search is
performed to find the parameter levels that yield the minimum variance
and bring the mean to the target. Taguchi’s method (Taguchi and Wu,
1980) makes this search considerably easy for practitioners by suggesting
them to choose the levels that maximize the so-called signal-to-noise ratios
(SNR). SNR is an appropriate function of the mean and the variance that
allow minimization of expected quality loss to the customers.
Techniques available for the robust design analysis of ordinal categorical data include Accumulation Analysis (AA), scoring methods, Generalized Linear models and Bayesian Analysis (BA).
Accumulation Analysis was introduced by Taguchi (1974) for analyzing ordered categorical data from industrial experiments. It is an
ANOVA-like approach using cumulative frequencies of the response categories. AA has been heavily criticized by Nair (1986), Hamada and Wu
575
(1986, 1990) and Box and Jones (1986). The main pitfalls of the method
are: (1) its cumulative frequencies do not satisfy the necessary model assumptions, (2) factor effects become dependent, (3) sometimes it detects
spurious factor effects, (4) it detects a mixture of location and dispersion
effects.
Another method for analyzing ordered categories is assigning
scores on the ordered categories and performing ANOVA on these scores.
This is a simple method but shortcomings of this method are: (1) the
scored categories are not continuous, either, (2) the results totally depend
on the scores assigned. (See Nair (1986) and Hamada and Wu (1990) for
further discussion.)
Logistic regression, which is a special type of Generalized Linear
Models, is commonly used for analyzing categorical data. It uses a link
function and estimation of the parameters is carried out via maximum likelihood (McCullagh, 1980). This method generally analyzes location effects, but gives less information about dispersion.
Chipman and Hamada (1996) introduce Bayesian Analysis for ordered categorical data. It is a powerful technique to analyze both location
and dispersion effects. By using Gibbs Sampling algorithm, they sample
from the posterior distribution of the factor coefficients of the generalized
linear models. It has many advantages, but the main disadvantage of the
problem is that it needs complex computer applications, expert knowledge
to determine the priors. It is then difficult to implement by the practitioners.
In section 2, we propose a new method for analyzing both location
and dispersion effects. An illustrative case study is presented in section 3
with a comparison of the results of the proposed method to those of AA
and BA approaches.
2. Proposed Approach
The proposed approach for analysis of ordinal categorical data for
robust product or process design is given as follows.
1) Generate an appropriate experimental design and collect data:
By considering the factors of the design problem generate an appropri-
576
ate (fractional factorial) experimental design. Conduct the experiments
and collect the response data.
2) Fit an ordinal categorical regression model and calculate event probabilities for each category: By using ordinal categorical regression, fit
a model that estimates the event probabilities for each category. The
model should be as follows:
Link ( P(Yi ≤ j )) = γ j + β ′X i
(1)
where j is the category (j = 0,1,…,J-1), γj is the cut point (constant), β
is the vector of coefficients and Xi is the vector of the control factors’
levels at combination i of the experiment.
3) Estimate expected category for each factor combination: By using factor level combinations in step 1 and estimated event probabilities for
each category in step 2, estimate the expected category and the variance for each factor combination i of the experiment as follows:
(2)
J −1
E (Yi ) = ∑ jP(Yi = j )
j =0
V (Yi ) = E (Yi 2 ) − [E (Yi )]
2
4) Calculate SNR ratio: Calculate Taguchi’s signal-to-noise ratios using
E(Yi) and V(Yi) calculated in step 3 for each factor combination i. For
instance the SNR for a smaller-the-better type of a response is
SNRi = - 10 log [ (E(Yi))2 + V(Yi) ]
(3)
(For other types of SNR, see Phadke, 1989.)
5) Find the optimal factor levels that maximize the SNR: By using
ANOVA, main effects and interaction effects, find the optimal factor
levels that maximize SNR, hence achieve the minimum variance. If the
mean is not at the target at these factor levels, use the factors not significantly affecting the variance (or the SNR) to bring the mean to the
target.
577
3. Illustrative Case Study
In this section, the “Foam Molding Experiment” data, originally
analyzed by Jinks (1987) and Chipman and Hamada (1996), are used to illustrate the proposed approach. The data arise from an experiment to reduce voids in a urethane-foam product. The response can be “good”, “acceptable”, or “poor”, and all the design variables are at two levels, -1 and l.
The design is a fractionated eight-run control array crossed with a four-run
noise array. There are seven control factors, A, B, C, D, E, F, G, and two
noise factors, H and I. The experimental layout and the collected data are
given in Table 1.
Table1. Foam Molding Experiment Design and Frequencies for Good (0), OK (1)
and Poor (2)
H
1
1
-1
-1
I
1
1
-1
-1
A B
C D E F G
0 1 2
0 1 2
0 1 2
0 1 2
-1
-1
-1
6
1
0
-1
-1
-1
-1
1
1
-1
1
1
1
-1
1
1
-1
1
1
1
1
1
-1
-1
-1
-1
3
6
1
4
0
4
5
10
0
1
1
1
1
0
3
7
3
4
3
0
6
4
0
7
3
-1
-1
1
1
0
0
10
0
1
9
0
0
10
0
0
10
1
1
-1
-1
0
0
10
0
10
0
0
3
7
0
9
1
-1
1
-1
1
3
5
2
3
7
0
3
5
2
1
6
3
1
-1
1
-1
2
8
0
4
5
1
0
5
5
1
5
4
-1
-1
1
1
-1
2
7
1
2
5
3
2
7
1
1
6
3
-1
1
-1
-1
1
0
4
6
1
7
2
0
4
6
0
3
7
The data in Table 1 have been modeled by ordinal logistic regression method (Agresti, 2002). Noise factors H and I are not included directly into model, but all of the data obtained from the four replicates are
used in modeling. The analysis results show that factors A, B, C, E, F and
G are statistically significant. The logistic regression equations for the
event probabilities of the categories are given as:
Logit [P(Y=0)] = -2.59611 + 0.693708 A -0.912559 B –0.488463 C +
0.523686 E - 0.513168 F - 0.768099 G
Logit [P(Y ≤ 1)] = 0.360502 + 0.693708 A -0.912559 B –0.488463 C +
0.523686 E - 0.513168 F - 0.768099 G
(4)
578
For each experiment trial, the event probabilities of each category
are calculated by using the equations in (4). Also, by using equations in
(2), expected category and variance from this category are obtained. These
results are presented in Table 2.
In the problem, achieving the smallest category each time is desired. Hence a parameter optimization that will minimize both the mean
and the variance is needed. Taguchi’s SNR in equation (3) is a suitable
choice to solve the problem. Table 2 shows the calculated SNR for each
experiment trial.
Table 2. Estimates for probabilities of categories, expected category, category
variance and SNR
Factors
P(Yi=j)
A
B
C
E
F
G
j=0
j=1
j=2
E(Yi) V(Yi)
SNRi
-1
-1
-1
-1
-1
-1
0.244
0.617
0.139
0.895
0.372
-0.691
-1
-1
-1
1
1
1
0.066
0.511
0.423
1.357
0.362
-3.430
-1
1
1
-1
1
1
0.002
0.027
0.972
1.970
0.032
-5.926
-1
1
1
1
-1
-1
0.053
0.465
0.482
1.429
0.351
-3.791
1
0.230
0.622
0.148
0.919
0.372
-0.847
1
-1
1
1
1
1
1
-1
-1
1
-1
1
-1
0.148
0.622
0.230
1.081
0.372
-1.878
1
-1
1
1
-1
0.175
0.628
0.196
1.021
0.371
-1.504
1
-1
-1
-1
1
0.043
0.420
0.537
1.494
0.336
-4.096
A
B
-2 .0
-2 .5
Mean of SNR
-3 .0
-3 .5
-4 .0
-1
1
G
-2 .0
-2 .5
-3 .0
-3 .5
-4 .0
-1
1
Fig.1. Main Effects of the Factors on SNR
-1
1
579
In order to obtain smaller mean and variance of the categories, the
SNR ratio values should be maximized. ANOVA of the SNR data shows
that factors A, B and G are significant. Figure 1 shows the main effects of
these factors. Levels for factors A, B and G should be selected as +, -, -,
respectively, to achieve maximum SNR. From logistic regression results
and equations in (4), factors C, E and F have also significant location effects on response. -, +, - levels of these factors should be chosen for the
smallest mean value. As a result, the optimum factor levels to achieve
minimum mean and variance are identified as (A,B,C,E,F,G) = (+,-,-,+,-,-).
At these levels, we estimate P(Good) = 0.79 and V(P(Good)) = 0.0069.
Table 3 presents a summary of the results of the proposed approach
as well as the Accumulation Analysis and the Bayesian Approach. AA
provides the worst estimate for the mean and no information about the
variance. Also it detects spurious effect of factor D. AA provides a mixture
of location and dispersion information; therefore it is impossible to detect
these effects separately. BA provides much better estimates as it is discussed by Chipman and Hamada (1996) in detail. The proposed method
yields almost the same results with BA, except that it finds factors C, E
and F only affecting the mean significantly.
Table 3. Comparison of the Proposed Approach to the Accumulation Analysis and
the Bayesian Approach
Approach
Significant Factors & Levels P(Good) V(P(Good))
Accumulation Analysis
A+, B-, C-, D-, E+, F-, G(location & dispersion effects)
0.45
-
Bayesian Approach
A+, B-, C-, E+, F-, G(location & dispersion effects)
0.80
0.0076
Proposed Approach
A+, B-, G- (location&dispersion effects)
C-, E+, F- (location effects)
0.79
0.0069
4. Summary and Conclusions
In this study, a method is developed to analyze categorical data for
the purpose of easy and effective robust product or process design. As the
example shows the method is fast, easy to implement and almost as effective as the more comprehensive Bayesian Approach. Actually the effectiveness of the method in finding the true location and dispersion effects
and in estimating the mean and the variance of the response can be in-
580
creased. For this purpose, empirical models of E(Yi) and V(Yi) can be developed in terms of the control factors, using equations (1) and (2), and
then the optimal factor levels can be sought for through non-linear programming or response surface optimization. The authors currently work on
these extensions conducting more real life case studies.
References
Agresti A (2002) Categorical Data Analysis. John Wiley & Sons Inc. Hoboken
New Jersey
Chipman H, Hamada M (1996) Bayesian Analysis of Ordered Categorical Data
From Industrial Experiments. Technometrics 38(1):1-10
Hamada M, Wu CFJ (1986) Should Accumulation Analysis and Related Methods
be Used for Industrial Experiments?. Discussion of Testing in Industrial Experiments with Ordered Categorical Data by VN Nair. Techometrics 28:302306
_____(1990) A Critical Look at Accumulation Analysis and Related Methods.
Technometrics 32:119-162
Jinks J (1987) Reduction of Voids in a Urethane-Foam Product. In Fifth Symposium on Taguchi Methods Dearborn, MI: American Suppliers Institute Inc, pp
135-148
Kackar RN (1985) Off-Line Quality Control, Parameter Design, and the Taguchi
Method. Journal of Quality Technology 17(4):176-188
McCullagh P (1980) Regression Models for Ordinal Data. Journal of the Royal
Statistical Society B 42:109-142
Nair VN (1986) Testing in Industrial Experiments With Ordered Categorical Data.
Technometrics 28(4):283-291
Phadke MS (1989) Quality Engineering Using Robust Design. Prentice-Hall,
Englewood Cliffs, New Jersey, USA
Taguchi G (1974) A New Statistical Analysis for Clinical Data, the Accumulating
Analysis, in Contrast With the Chi-Square Test. Suishin Igaku 29:806-813
Download