Extraction d`ensembles corrélés selon la mesure Khi

advertisement
EXTRACTING CORRELATED SETS USING THE CHI-SQUARED MEASUREMENT
WITHIN N-ARY RELATIONS: AN IMPLEMENTATION
Alain Casali 1, Christian Ernst 2, Franck Gasnier 3, Jamil Stephan 2
2
1
Université de la Méditerranée / LIF, alain.casali@lif.univ-mrs.fr
Ecole des Mines de St Etienne / CMP-GC, ernst@emse.fr – stephan@emse.fr
3
STMicroElectronics Rousset, franck.gasnier@st.com
Abstract. This paper presents a practical approach for improving R2R/device analysis performance
by extracting correlated sets within data measures. The method considers n-ary relations presenting a
vast number of numerical attributes and uses a data mining levelwise algorithm based on both the
chi-squared and the support measurements. Significant results are obtained.
– 1 – MOTIVATION
Context. The field of APC aims at highlighting correlations between parameters related to production, so as to be able to rectify
possible drifts of the associated processes. Within this framework, in collaboration with STMicroElectronics Rousset, this work
focused on the device analysis of the principal trajectories having an impact on the yield.
Objectives. The analysis was based on Excel files of measurements associated with product lots whose characteristics are that they
include a significant number of attributes (nature of the measurements) compared to the number of lines of data (ratio of 1800 per
300 on average), the data being all numerical and often incomplete (see Fig. 1a). Moreover, we tried to highlight correlations
involving at least one target attribute.
Fig. 1a: An example of data file
Fig. 1b: Equivalent n-ary relation after preprocessing
– 2 – DESCRIPTION OF THE APPROACH
General. Link detection between values in a data base often amounts to the extraction of decision rules in binary relations making it
possible to get directional bonds (A, B → C for example), basing on the support-confidence platform. This step consists in
extracting the frequent patterns (such as the 3-pattern or 3-ItemSet A, B, C) among the items using levelwise algorithms, and then
deducing the decision or association rules [AGR 96], [HIP 00], [CEG 06].
Selected step. An alternative to that procedure involves extracting the set of correlated patterns according to the statistical chisquared measurement [BRI 97]. This measures the variation with independence, takes into account the absence of values, and is
non-directional. From a theoretical point of view, we have [STE 06]
- shown that, in some cases, the correlation according to this measurement is a monotonic constraint and that the problem is
thus limited to searching for the border containing the minimum correlated patterns;
- proposed a generalization of the method for n-ary relations, by using a binary function which preserves the context and the
initial semantics of the analyzed files.
Practical aspects. As it is usually the case, we first of all developed preprocessing stages [PYL 99] in order to transform the raw
data into an exploitable format by the levelwise algorithm presented in Fig. 2. We successively
- deleted the poorly significant attributes (doubles, attributes having few distinct values or presenting too many null values,
etc.), which eliminated on average over 2/3 of the attributes of the initial files;
- discretized the values of the remaining attributes; for each of them, we normalized their values, then cut out the values
obtained in intervals, and finally attributed a code interval to each value: two identical values of two distinct attributes are
differently coded out, and thus cannot interfere in the analysis.
The result of these preprocessing stages on the example file in Fig 1.a is shown in Fig. 1b. The resulting values are the interval codes
associated with the initial raw values (0 representing the null value), and correspond to the entry items used by the selected
algorithm. Note that the axes of Fig. 1b are reversed compared to Fig. 1a.
-1-
IN : ItemSet I, Fraction p%, Threshold mc (chi2), Threshold s (support), Target Attribute ta, Relation r
OUT : Set of minimal correlated patterns
1. C2 := APrioriGen(I);
// generate the 2-Patterns candidates having the support
2. i := 2
3. while Ci <> 0 do
4.
Li := 0
5.
for each X  Ci do
6.
Build the contingency table of X
7.
if p% of the table’s cells have a support  s then
8.
if chi2(X)  mc then Li := Li  X
// and mark up the Li’s including a ta item
9.
endif
10. end for
11. Ci+1 := APrioriGen(Ci – Li)
12. i := i + 1
13. end while
14. return i Li
// limited to the patterns including one item of ta
Fig. 2: The Correlation – Chi2 – Support algorithm developed on n-ary relations
– 3 – EVALUATION OF RESULTS
General. The results of 4 runs of the program on a standard PC in different configurations for the same measurement example file
(438 attributes, 298 rows, 657 items having s  0.2) are shown in Fig. 3. The field Pot is the potential number of k-Patterns likely to
be chosen by the algorithm at level k. Next represents the k-Patterns found at level k and used at level k+1, and Res is the number of
k-Patterns including one target attribute item and kept at level k.
p% : 0.3
Test no 1
Nb of 2-Patterns
Nb of 3-Patterns
Nb of 4-Patterns
p% : 0.3
Test no 3
Nb of 2-Patterns
Nb of 3-Patterns
Nb of 4-Patterns
mc : 4.2
Pot
215 245
478 104
115
mc : 1.6
Pot
215 245
1 923 124
174
s : 0.2
Next
30 597
21 675
0
p% : 0.3
Res
35
6
0
Test no 2
Nb of 2-Patterns
Nb of 3-Patterns
Nb of 4-Patterns
s : 0.2
Next
59 589
57 297
0
p% : 0.25
Res
107
124
0
Test no 4
Nb of 2-Patterns
Nb of 3-Patterns
Nb of 4-Patterns
mc : 6.2
Pot
215 245
206 870
105
mc : 3.6
Pot
215 245
2 328 884
1 427 457
s : 0.2
Next
19 984
11 122
0
Res
9
0
0
s : 0.2
Next
67 911
588 777
18
Res
78
466
1
Fig. 3: Detected correlations with four different configurations using the same n-ary relation
The lower the thresholds of each triplet (p%, mc, s), the higher the result is, with inclusions between the patterns that are kept. The
aim of the variations is to admit candidates which are not chosen with constraining thresholds, and then to check whether by
modifying choices made at the outset, correlations that are not initially detected can be found: what, overall, has been verified.
Conclusions. The methodology and the analysis made it possible for STMicroElectronics Rousset to highlight new correlations
between parameters of its files of measurements: approximately one fourth of the correlations determined by the first test were not
known to STMicroElectronics, and most of the results have been validated by electrical and/or physical analysis. While the proposed
method confirmed that levelwise algorithms do not provide results beyond four search levels, it proved its value for n-ary relations
with a very large number of numerical attributes. The study aims at supporting the development of effective R2R control loops.
– 4 – FUTURE WORK / REFERENCES
We are now working along the following lines:
- optimization of the procedure (aggregation of attributes, merge/split of intervals) while safeguarding the context;
- development of other methods (detection of frequent patterns using in-width / in-depth search algorithms, clustering, etc.),
with comparative study of the various methods as well as the results.
In the medium term, we intend to initiate a background based procedure integrating a whole set of methods, measurements and
results, so as to allow, when a new analysis is made, the generation of the most suitable result.
[AGR 96] R.Agrawal, H.Mannila, R.Srikant, H.Toivonen, and A.I.Verkamo, Fast Discovery of Association Rules, Advances in Knowledge
Discovery and Data Mining, 1996, pp 307-328
[BRI 97] S.Brin, R.Motwani, and C.Silverstein, Beyond market baskets : generalizing association rules to correlations, Proc. of the International
Conference on Management of Data, SIGMOD, 1997, pp 265-276
[CEG 06] A.Ceglar, and J.F.Roddick, Association mining, ACM Computing Surveys, Vol 38 (2), 2006, 42 p.
[HIP 00] J.Hipp, U.Guntzer, and G.Nakhaeizadeh, Algorithms for Association Rule Mining - A General Survey and Comparison, SIGKDD
Explorations, Vol. 2, 2000, pp 58-64
[PYL 99] D.Pyle, Data Preparation for Data Mining, Morgan Kaufman Pub., 1999, 539 p.
[STE 06] J.Stephan, Extraction des règles de corrélation pour la décision dans une base de données relationnelle, Mémoire de stage MR2-IF,
Université des Sciences de Luminy, France, 2006, 28 p.
-2-
Download