A Method for Analysis of Categorical Data for Robust Product or Process Design Serkan Erdural1, Gülser Köksal1,* and Özlem İlk2 1 Industrial Engineering Department, Middle East Technical University, 06531 Ankara, Turkey 2 Statistics Department, Middle East Technical University, 06531 Ankara, Turkey * Corresponding author, e-mail address: koksal@ie.metu.edu.tr Summary. In industrial processes decreasing variation is very important while achieving the targets on the average. Identifying and using robust product and process parameter levels that are insensitive to sources of variation provide a serious competitive advantage to producers. A quality response of products or processes is typically measured quantitatively on an interval or ratio scale. However, in some cases, the response has to be measured qualitatively on a nominal or ordinal scale. Many effective methods based on statistical design and analysis of experiments have been developed and used to find the robust levels of product and process parameters when the response variables are continuous. However, methods proposed in the literature to find robust parameter levels when the response variable is categorical are considerably limited in variety and performance. This paper offers a simple and effective method for the analysis of categorical response data for robust product or process design. It demonstrates on an example how the method handles both location and dispersion effects to explore robust settings, and it shows that the method is simple and effective. Key words: Categorical data, robust design, general linear models. 574 1. Introduction Variation in the performance of a product degrades the product’s quality. Today’s producers constantly search for better ways of producing goods and services at target performance each time. One of these approaches is robust design (also known as parameter design or optimization). The fundamental principle of robust design is that there exist certain levels of product and process parameters that are not affected by sources of variation. These levels are searched for by statistical design and analysis of experiments. This is a very cost-effective approach introduced by Taguchi (Phadke, 1989) and extended and successfully applied by many since then. A quality response of products or processes is typically measured quantitatively on an interval or ratio scale. However, in some cases, the response has to be measured qualitatively using a nominal or ordinal scale. Quality characteristics that require expert judgment or comparison to a standard or go/no-go gauges in their measurement are of the latter type. Manufacturers find it easy and cheap to assess quality on an ordinal scale such as “very good”, “good” and “bad”, or “pass” and “fail”. Many approaches have been proposed to find robust parameter levels when the response variables are measured on an interval or ratio scale (Kackar, 1985, Phadke, 1989). To investigate the relationship between the response and the product or process parameters (control factors) under the influence of uncontrollable (noise) factors, conventional experimental design techniques are employed. The collected data through experimentation are typically used to estimate the quality characteristic’s (or the response variable’s) mean and variance for given parameter levels. Then, a search is performed to find the parameter levels that yield the minimum variance and bring the mean to the target. Taguchi’s method (Taguchi and Wu, 1980) makes this search considerably easy for practitioners by suggesting them to choose the levels that maximize the so-called signal-to-noise ratios (SNR). SNR is an appropriate function of the mean and the variance that allow minimization of expected quality loss to the customers. Techniques available for the robust design analysis of ordinal categorical data include Accumulation Analysis (AA), scoring methods, Generalized Linear models and Bayesian Analysis (BA). Accumulation Analysis was introduced by Taguchi (1974) for analyzing ordered categorical data from industrial experiments. It is an ANOVA-like approach using cumulative frequencies of the response categories. AA has been heavily criticized by Nair (1986), Hamada and Wu 575 (1986, 1990) and Box and Jones (1986). The main pitfalls of the method are: (1) its cumulative frequencies do not satisfy the necessary model assumptions, (2) factor effects become dependent, (3) sometimes it detects spurious factor effects, (4) it detects a mixture of location and dispersion effects. Another method for analyzing ordered categories is assigning scores on the ordered categories and performing ANOVA on these scores. This is a simple method but shortcomings of this method are: (1) the scored categories are not continuous, either, (2) the results totally depend on the scores assigned. (See Nair (1986) and Hamada and Wu (1990) for further discussion.) Logistic regression, which is a special type of Generalized Linear Models, is commonly used for analyzing categorical data. It uses a link function and estimation of the parameters is carried out via maximum likelihood (McCullagh, 1980). This method generally analyzes location effects, but gives less information about dispersion. Chipman and Hamada (1996) introduce Bayesian Analysis for ordered categorical data. It is a powerful technique to analyze both location and dispersion effects. By using Gibbs Sampling algorithm, they sample from the posterior distribution of the factor coefficients of the generalized linear models. It has many advantages, but the main disadvantage of the problem is that it needs complex computer applications, expert knowledge to determine the priors. It is then difficult to implement by the practitioners. In section 2, we propose a new method for analyzing both location and dispersion effects. An illustrative case study is presented in section 3 with a comparison of the results of the proposed method to those of AA and BA approaches. 2. Proposed Approach The proposed approach for analysis of ordinal categorical data for robust product or process design is given as follows. 1) Generate an appropriate experimental design and collect data: By considering the factors of the design problem generate an appropri- 576 ate (fractional factorial) experimental design. Conduct the experiments and collect the response data. 2) Fit an ordinal categorical regression model and calculate event probabilities for each category: By using ordinal categorical regression, fit a model that estimates the event probabilities for each category. The model should be as follows: Link ( P(Yi ≤ j )) = γ j + β ′X i (1) where j is the category (j = 0,1,…,J-1), γj is the cut point (constant), β is the vector of coefficients and Xi is the vector of the control factors’ levels at combination i of the experiment. 3) Estimate expected category for each factor combination: By using factor level combinations in step 1 and estimated event probabilities for each category in step 2, estimate the expected category and the variance for each factor combination i of the experiment as follows: (2) J −1 E (Yi ) = ∑ jP(Yi = j ) j =0 V (Yi ) = E (Yi 2 ) − [E (Yi )] 2 4) Calculate SNR ratio: Calculate Taguchi’s signal-to-noise ratios using E(Yi) and V(Yi) calculated in step 3 for each factor combination i. For instance the SNR for a smaller-the-better type of a response is SNRi = - 10 log [ (E(Yi))2 + V(Yi) ] (3) (For other types of SNR, see Phadke, 1989.) 5) Find the optimal factor levels that maximize the SNR: By using ANOVA, main effects and interaction effects, find the optimal factor levels that maximize SNR, hence achieve the minimum variance. If the mean is not at the target at these factor levels, use the factors not significantly affecting the variance (or the SNR) to bring the mean to the target. 577 3. Illustrative Case Study In this section, the “Foam Molding Experiment” data, originally analyzed by Jinks (1987) and Chipman and Hamada (1996), are used to illustrate the proposed approach. The data arise from an experiment to reduce voids in a urethane-foam product. The response can be “good”, “acceptable”, or “poor”, and all the design variables are at two levels, -1 and l. The design is a fractionated eight-run control array crossed with a four-run noise array. There are seven control factors, A, B, C, D, E, F, G, and two noise factors, H and I. The experimental layout and the collected data are given in Table 1. Table1. Foam Molding Experiment Design and Frequencies for Good (0), OK (1) and Poor (2) H 1 1 -1 -1 I 1 1 -1 -1 A B C D E F G 0 1 2 0 1 2 0 1 2 0 1 2 -1 -1 -1 6 1 0 -1 -1 -1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 1 1 1 1 -1 -1 -1 -1 3 6 1 4 0 4 5 10 0 1 1 1 1 0 3 7 3 4 3 0 6 4 0 7 3 -1 -1 1 1 0 0 10 0 1 9 0 0 10 0 0 10 1 1 -1 -1 0 0 10 0 10 0 0 3 7 0 9 1 -1 1 -1 1 3 5 2 3 7 0 3 5 2 1 6 3 1 -1 1 -1 2 8 0 4 5 1 0 5 5 1 5 4 -1 -1 1 1 -1 2 7 1 2 5 3 2 7 1 1 6 3 -1 1 -1 -1 1 0 4 6 1 7 2 0 4 6 0 3 7 The data in Table 1 have been modeled by ordinal logistic regression method (Agresti, 2002). Noise factors H and I are not included directly into model, but all of the data obtained from the four replicates are used in modeling. The analysis results show that factors A, B, C, E, F and G are statistically significant. The logistic regression equations for the event probabilities of the categories are given as: Logit [P(Y=0)] = -2.59611 + 0.693708 A -0.912559 B –0.488463 C + 0.523686 E - 0.513168 F - 0.768099 G Logit [P(Y ≤ 1)] = 0.360502 + 0.693708 A -0.912559 B –0.488463 C + 0.523686 E - 0.513168 F - 0.768099 G (4) 578 For each experiment trial, the event probabilities of each category are calculated by using the equations in (4). Also, by using equations in (2), expected category and variance from this category are obtained. These results are presented in Table 2. In the problem, achieving the smallest category each time is desired. Hence a parameter optimization that will minimize both the mean and the variance is needed. Taguchi’s SNR in equation (3) is a suitable choice to solve the problem. Table 2 shows the calculated SNR for each experiment trial. Table 2. Estimates for probabilities of categories, expected category, category variance and SNR Factors P(Yi=j) A B C E F G j=0 j=1 j=2 E(Yi) V(Yi) SNRi -1 -1 -1 -1 -1 -1 0.244 0.617 0.139 0.895 0.372 -0.691 -1 -1 -1 1 1 1 0.066 0.511 0.423 1.357 0.362 -3.430 -1 1 1 -1 1 1 0.002 0.027 0.972 1.970 0.032 -5.926 -1 1 1 1 -1 -1 0.053 0.465 0.482 1.429 0.351 -3.791 1 0.230 0.622 0.148 0.919 0.372 -0.847 1 -1 1 1 1 1 1 -1 -1 1 -1 1 -1 0.148 0.622 0.230 1.081 0.372 -1.878 1 -1 1 1 -1 0.175 0.628 0.196 1.021 0.371 -1.504 1 -1 -1 -1 1 0.043 0.420 0.537 1.494 0.336 -4.096 A B -2 .0 -2 .5 Mean of SNR -3 .0 -3 .5 -4 .0 -1 1 G -2 .0 -2 .5 -3 .0 -3 .5 -4 .0 -1 1 Fig.1. Main Effects of the Factors on SNR -1 1 579 In order to obtain smaller mean and variance of the categories, the SNR ratio values should be maximized. ANOVA of the SNR data shows that factors A, B and G are significant. Figure 1 shows the main effects of these factors. Levels for factors A, B and G should be selected as +, -, -, respectively, to achieve maximum SNR. From logistic regression results and equations in (4), factors C, E and F have also significant location effects on response. -, +, - levels of these factors should be chosen for the smallest mean value. As a result, the optimum factor levels to achieve minimum mean and variance are identified as (A,B,C,E,F,G) = (+,-,-,+,-,-). At these levels, we estimate P(Good) = 0.79 and V(P(Good)) = 0.0069. Table 3 presents a summary of the results of the proposed approach as well as the Accumulation Analysis and the Bayesian Approach. AA provides the worst estimate for the mean and no information about the variance. Also it detects spurious effect of factor D. AA provides a mixture of location and dispersion information; therefore it is impossible to detect these effects separately. BA provides much better estimates as it is discussed by Chipman and Hamada (1996) in detail. The proposed method yields almost the same results with BA, except that it finds factors C, E and F only affecting the mean significantly. Table 3. Comparison of the Proposed Approach to the Accumulation Analysis and the Bayesian Approach Approach Significant Factors & Levels P(Good) V(P(Good)) Accumulation Analysis A+, B-, C-, D-, E+, F-, G(location & dispersion effects) 0.45 - Bayesian Approach A+, B-, C-, E+, F-, G(location & dispersion effects) 0.80 0.0076 Proposed Approach A+, B-, G- (location&dispersion effects) C-, E+, F- (location effects) 0.79 0.0069 4. Summary and Conclusions In this study, a method is developed to analyze categorical data for the purpose of easy and effective robust product or process design. As the example shows the method is fast, easy to implement and almost as effective as the more comprehensive Bayesian Approach. Actually the effectiveness of the method in finding the true location and dispersion effects and in estimating the mean and the variance of the response can be in- 580 creased. For this purpose, empirical models of E(Yi) and V(Yi) can be developed in terms of the control factors, using equations (1) and (2), and then the optimal factor levels can be sought for through non-linear programming or response surface optimization. The authors currently work on these extensions conducting more real life case studies. References Agresti A (2002) Categorical Data Analysis. John Wiley & Sons Inc. Hoboken New Jersey Chipman H, Hamada M (1996) Bayesian Analysis of Ordered Categorical Data From Industrial Experiments. Technometrics 38(1):1-10 Hamada M, Wu CFJ (1986) Should Accumulation Analysis and Related Methods be Used for Industrial Experiments?. Discussion of Testing in Industrial Experiments with Ordered Categorical Data by VN Nair. Techometrics 28:302306 _____(1990) A Critical Look at Accumulation Analysis and Related Methods. Technometrics 32:119-162 Jinks J (1987) Reduction of Voids in a Urethane-Foam Product. In Fifth Symposium on Taguchi Methods Dearborn, MI: American Suppliers Institute Inc, pp 135-148 Kackar RN (1985) Off-Line Quality Control, Parameter Design, and the Taguchi Method. Journal of Quality Technology 17(4):176-188 McCullagh P (1980) Regression Models for Ordinal Data. Journal of the Royal Statistical Society B 42:109-142 Nair VN (1986) Testing in Industrial Experiments With Ordered Categorical Data. Technometrics 28(4):283-291 Phadke MS (1989) Quality Engineering Using Robust Design. Prentice-Hall, Englewood Cliffs, New Jersey, USA Taguchi G (1974) A New Statistical Analysis for Clinical Data, the Accumulating Analysis, in Contrast With the Chi-Square Test. Suishin Igaku 29:806-813