EXTRACTING CORRELATED SETS USING THE CHI-SQUARED MEASUREMENT WITHIN N-ARY RELATIONS: AN IMPLEMENTATION Alain Casali 1, Christian Ernst 2, Franck Gasnier 3, Jamil Stephan 2 2 1 Université de la Méditerranée / LIF, alain.casali@lif.univ-mrs.fr Ecole des Mines de St Etienne / CMP-GC, ernst@emse.fr – stephan@emse.fr 3 STMicroElectronics Rousset, franck.gasnier@st.com Abstract. This paper presents a practical approach for improving R2R/device analysis performance by extracting correlated sets within data measures. The method considers n-ary relations presenting a vast number of numerical attributes and uses a data mining levelwise algorithm based on both the chi-squared and the support measurements. Significant results are obtained. – 1 – MOTIVATION Context. The field of APC aims at highlighting correlations between parameters related to production, so as to be able to rectify possible drifts of the associated processes. Within this framework, in collaboration with STMicroElectronics Rousset, this work focused on the device analysis of the principal trajectories having an impact on the yield. Objectives. The analysis was based on Excel files of measurements associated with product lots whose characteristics are that they include a significant number of attributes (nature of the measurements) compared to the number of lines of data (ratio of 1800 per 300 on average), the data being all numerical and often incomplete (see Fig. 1a). Moreover, we tried to highlight correlations involving at least one target attribute. Fig. 1a: An example of data file Fig. 1b: Equivalent n-ary relation after preprocessing – 2 – DESCRIPTION OF THE APPROACH General. Link detection between values in a data base often amounts to the extraction of decision rules in binary relations making it possible to get directional bonds (A, B → C for example), basing on the support-confidence platform. This step consists in extracting the frequent patterns (such as the 3-pattern or 3-ItemSet A, B, C) among the items using levelwise algorithms, and then deducing the decision or association rules [AGR 96], [HIP 00], [CEG 06]. Selected step. An alternative to that procedure involves extracting the set of correlated patterns according to the statistical chisquared measurement [BRI 97]. This measures the variation with independence, takes into account the absence of values, and is non-directional. From a theoretical point of view, we have [STE 06] - shown that, in some cases, the correlation according to this measurement is a monotonic constraint and that the problem is thus limited to searching for the border containing the minimum correlated patterns; - proposed a generalization of the method for n-ary relations, by using a binary function which preserves the context and the initial semantics of the analyzed files. Practical aspects. As it is usually the case, we first of all developed preprocessing stages [PYL 99] in order to transform the raw data into an exploitable format by the levelwise algorithm presented in Fig. 2. We successively - deleted the poorly significant attributes (doubles, attributes having few distinct values or presenting too many null values, etc.), which eliminated on average over 2/3 of the attributes of the initial files; - discretized the values of the remaining attributes; for each of them, we normalized their values, then cut out the values obtained in intervals, and finally attributed a code interval to each value: two identical values of two distinct attributes are differently coded out, and thus cannot interfere in the analysis. The result of these preprocessing stages on the example file in Fig 1.a is shown in Fig. 1b. The resulting values are the interval codes associated with the initial raw values (0 representing the null value), and correspond to the entry items used by the selected algorithm. Note that the axes of Fig. 1b are reversed compared to Fig. 1a. -1- IN : ItemSet I, Fraction p%, Threshold mc (chi2), Threshold s (support), Target Attribute ta, Relation r OUT : Set of minimal correlated patterns 1. C2 := APrioriGen(I); // generate the 2-Patterns candidates having the support 2. i := 2 3. while Ci <> 0 do 4. Li := 0 5. for each X Ci do 6. Build the contingency table of X 7. if p% of the table’s cells have a support s then 8. if chi2(X) mc then Li := Li X // and mark up the Li’s including a ta item 9. endif 10. end for 11. Ci+1 := APrioriGen(Ci – Li) 12. i := i + 1 13. end while 14. return i Li // limited to the patterns including one item of ta Fig. 2: The Correlation – Chi2 – Support algorithm developed on n-ary relations – 3 – EVALUATION OF RESULTS General. The results of 4 runs of the program on a standard PC in different configurations for the same measurement example file (438 attributes, 298 rows, 657 items having s 0.2) are shown in Fig. 3. The field Pot is the potential number of k-Patterns likely to be chosen by the algorithm at level k. Next represents the k-Patterns found at level k and used at level k+1, and Res is the number of k-Patterns including one target attribute item and kept at level k. p% : 0.3 Test no 1 Nb of 2-Patterns Nb of 3-Patterns Nb of 4-Patterns p% : 0.3 Test no 3 Nb of 2-Patterns Nb of 3-Patterns Nb of 4-Patterns mc : 4.2 Pot 215 245 478 104 115 mc : 1.6 Pot 215 245 1 923 124 174 s : 0.2 Next 30 597 21 675 0 p% : 0.3 Res 35 6 0 Test no 2 Nb of 2-Patterns Nb of 3-Patterns Nb of 4-Patterns s : 0.2 Next 59 589 57 297 0 p% : 0.25 Res 107 124 0 Test no 4 Nb of 2-Patterns Nb of 3-Patterns Nb of 4-Patterns mc : 6.2 Pot 215 245 206 870 105 mc : 3.6 Pot 215 245 2 328 884 1 427 457 s : 0.2 Next 19 984 11 122 0 Res 9 0 0 s : 0.2 Next 67 911 588 777 18 Res 78 466 1 Fig. 3: Detected correlations with four different configurations using the same n-ary relation The lower the thresholds of each triplet (p%, mc, s), the higher the result is, with inclusions between the patterns that are kept. The aim of the variations is to admit candidates which are not chosen with constraining thresholds, and then to check whether by modifying choices made at the outset, correlations that are not initially detected can be found: what, overall, has been verified. Conclusions. The methodology and the analysis made it possible for STMicroElectronics Rousset to highlight new correlations between parameters of its files of measurements: approximately one fourth of the correlations determined by the first test were not known to STMicroElectronics, and most of the results have been validated by electrical and/or physical analysis. While the proposed method confirmed that levelwise algorithms do not provide results beyond four search levels, it proved its value for n-ary relations with a very large number of numerical attributes. The study aims at supporting the development of effective R2R control loops. – 4 – FUTURE WORK / REFERENCES We are now working along the following lines: - optimization of the procedure (aggregation of attributes, merge/split of intervals) while safeguarding the context; - development of other methods (detection of frequent patterns using in-width / in-depth search algorithms, clustering, etc.), with comparative study of the various methods as well as the results. In the medium term, we intend to initiate a background based procedure integrating a whole set of methods, measurements and results, so as to allow, when a new analysis is made, the generation of the most suitable result. [AGR 96] R.Agrawal, H.Mannila, R.Srikant, H.Toivonen, and A.I.Verkamo, Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, 1996, pp 307-328 [BRI 97] S.Brin, R.Motwani, and C.Silverstein, Beyond market baskets : generalizing association rules to correlations, Proc. of the International Conference on Management of Data, SIGMOD, 1997, pp 265-276 [CEG 06] A.Ceglar, and J.F.Roddick, Association mining, ACM Computing Surveys, Vol 38 (2), 2006, 42 p. [HIP 00] J.Hipp, U.Guntzer, and G.Nakhaeizadeh, Algorithms for Association Rule Mining - A General Survey and Comparison, SIGKDD Explorations, Vol. 2, 2000, pp 58-64 [PYL 99] D.Pyle, Data Preparation for Data Mining, Morgan Kaufman Pub., 1999, 539 p. [STE 06] J.Stephan, Extraction des règles de corrélation pour la décision dans une base de données relationnelle, Mémoire de stage MR2-IF, Université des Sciences de Luminy, France, 2006, 28 p. -2-