Finding Local Patterns in Data
Exploratory Data Analysis
Scan the data without much prior focus
Find unusual parts of the data
Analyse attribute dependencies
interpret this as ‘ rule ’ : if X=x and Y=y then Z is unusual
Complex data: nominal, numeric, relational?
the Subgroup
Exploratory Data Analysis
Classification: model the dependency of the target on the remaining attributes.
problem: sometimes classifier is a black-box, or uses only some of the available dependencies.
for example: in decision trees, some attributes may not appear because of overshadowing.
Exploratory Data Analysis: understanding the effects of all attributes on the target.
Interactions between Attributes
Single-attribute effects are not enough
XOR problem is extreme example: 2 attributes with no info gain form a good subgroup
Apart from
A=a, B=b, C=c, …
consider also
A=a B=b, A=a C=c, …, B=b C=c, …
A=a B=b C=c, …
…
Subgroup Discovery Task
“ Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute s
”
Inductive constraints:
Minimum support
(Maximum support)
Minimum quality (Information gain, X 2 , WRAcc)
Maximum complexity
…
Confusion Matrix
A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target:
within subgroup, positive
within subgroup, negative
outside subgroup, positive
outside subgroup, negative subgroup target
T
T .42
F .12
.54
F
.13
.55
.33
1.0
Confusion Matrix
High numbers along the TT-FF diagonal means a positive correlation between subgroup and target
High numbers along the TF-FT diagonal means a negative correlation between subgroup and target
Target distribution on DB is fixed
Only two degrees of freedom subgroup target
T
T .42
F .12
.54
F
.13
.55
.33
.45
.46
1.0
Quality Measures
A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number
WRAcc, weighted relative accuracy
Balance between coverage and unexpectedness
WRAcc(S,T) = p(ST) – p(S) p(T)
between −.25 and .25, 0 means uninteresting subgroup
T
F
T target
F
.42
.13
.55
.12
.33
.54
1.0
WRAcc(S,T) = p(ST)−p(S) p(T)
= .42 − .297 = .123
Quality Measures
WRAcc: Weighted Relative Accuracy
Information gain
X 2
Correlation Coefficient
Laplace
Jaccard
Specificity
…
Subgroup Discovery as Search true
A=a
1
A=a
2
B=b
1
B=b
2
C=c
1
…
A=a
1
B=b
1
A=a
1
B=b
1
C=c
1
A=a
…
1
B=b
2
… A=a
2
B=b
1
T
F
T
.42
…
F
.13
.12
.33
.54
.55
1.0
minimum support level reached
Refinements are (anti-)monotonic entire database
Refinements are (anti-) monotonic in their support…
…but not in interestingness.
This may go up or down.
target concept
S
3 refinement of S
2
S
2 refinement of S
1 subgroup S
1
ROC Space
ROC = Receiver Operating Characteristics
Each subgroup forms a point in ROC space, in terms of its False Positive
Rate, and True Positive
Rate.
TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup )
FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup )
ROC Space Properties entire database
‘ ROC heaven ’ perfect subgroup
‘ ROC hell ’ random subgroup perfect negative subgroup empty subgroup minimum support threshold
Measures in ROC Space
0 positive negative
WRAcc Information Gain isometric
Other Measures
Precision
Gini index
Foil gain Correlation coefficient
Refinements in ROC Space
.
.
Refinements of S will reduce the FPR and TPR, so will appear to the left and below S.
Blue polygon represents possible refinements of S.
With a convex measure, f is bounded by measure of corners.
.
..
If corners are not above minimum quality or current best (top k?), prune search space below S.
Multi-class problems
Generalising to problems with more than 2 classes is fairly staightforward: target
C
1
C
2
C
3
T subgroup
.27
.06
.22
.55
combine values to quality measure
F
.03
.19
.23
.45
.3
.25
.45
1.0
X 2 Information gain
Numeric Subgroup Discovery
Target is numeric: find subgroups with significantly higher or lower average value
Trade-off between size of subgroup and average target value h = 2200 h = 3600 h = 3100
Types of SD for Numeric Targets
Regression subgroup discovery
numeric target has order and scale
Ordinal subgroup discovery
numeric target has order
Ranked subgroup discovery
numeric target has order or scale
Vancouver 2010 Winter Olympics ordinal target
Partial ranking objects share a rank regression target
7
9
9
9
11
12
13.5
13.5
16
16
Rank
1
2
3
4
5
6
20
23
25
25
25
16
20
20
20
20
Offical IOC ranking of countries (med > 0)
Country Medals
USA 37
Germany 30
Canada 26
Norway
Austria
23
16
Russ. Fed.
15
Athletes Continent Popul.
214 N. America 309
152
205
100
79
179
Europe
Europe
Europe
Asia
82
N. America 34
4.8
8.3
142
Korea 14
Netherlands 8
Czech Rep. 6
Poland
Italy
Japan
6
5
5
Finland 5
Australia 3
Belarus 3
Slovakia 3
Croatia 3
46
95
40
49
73
18
34
92
50
110
94
Asia
China
Sweden
France
Fractional ranks Asia
11
11
107
107
Europe
Europe
Switzerland 9 144 Europe
Europe
Europe
Europe
Europe
Asia
Europe 5.3
Australia 22
Europe
Europe
Europe
9.6
5.4
4.5
73
1338
9.3
65
7.8
16.5
10.5
38
60
127
Slovenia 3
Latvia 2
Great Britain 1
Estonia 1
Kazakhstan 1
49
58
52
30
38
Europe
Europe
Europe
Europe
Asia
2
2.2
61
1.3
16
Language Family
Germanic
Germanic
Germanic
Germanic
Germanic
Slavic
Altaic
Sino-Tibetan
Germanic
Italic
Germanic
Germanic
Slavic
Slavic
Italic
Japonic
Finno-Ugric
Germanic
Slavic
Slavic
Slavic
Slavic
Slavic
Germanic
Finno-Ugric
Turkic n y y y n y y n y y
Repub.
y y n n y y y y n y y y y y y y n n n n n y n n n n y n n y y
Polar y n n n n n n n y n n
Interesting Subgroups
‘ polar = yes ’
1. United States
3. Canada
4. Norway
6. Russian Federation
9. Sweden
16 Finland
‘ language_family = Germanic & athletes 60 ’
1. United States
2. Germany
3. Canada
4. Norway
5. Austria
9. Sweden
11. Switzerland
Intuitions
Size: larger subgroups are more reliable
Rank: majority of objects appear at the top language_family = Slavic
Position: ‘ middle ’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
*
**
*
Intuitions
Size: larger subgroups are more reliable
Rank: majority of objects appear at the top polar = yes
Position: ‘ middle ’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
*
**
*
*
*
Intuitions
Size: larger subgroups are more reliable
Rank: majority of objects appear at the top population
10M
Position: ‘ middle ’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
**
**
*
*
*
**
*
Intuitions
Size: larger subgroups are more reliable
Rank: majority of objects appear at the top language_family = Slavic & population
10M
Position: ‘ middle ’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
**
*
Average
Quality Measures
Mean test
z-Score
t-Statistic
Median X 2 statistic
AUC of ROC
Wilcoxon-Mann-Whitney Ranks statistic
Median MAD Metric
the open source Subgroup Discovery tool
Cortana Features
Generic Subgroup Discovery algorithm
quality measure
search strategy
inductive constraints
Flat file, .txt, .arff, (DB connection to come)
Support for complex targets
41 quality measures
ROC plots
Statistical validation
Target Concepts
‘ Classical ’ Subgroup Discovery
nominal targets (classification)
numeric targets (regression)
Exceptional Model Mining
(to be discussed in a few slides)
multiple targets
regression, correlation
multi-label classification
Mixed Data
Data types
binary
nominal
numeric
Numeric data is treated dynamically (no discretisation as preprocessing)
all: consider all available thresholds
bins: discretise the current candidate subgroup
best: find best threshold, and search from there
Statistical Validation
Determine distribution of random results
random subsets
random conditions
swap-randomization
Determine minimum quality
Significance of individual results
Validate quality measures
how exceptional?
Open Source
You can
Use Cortana binary datamining.liacs.nl/cortana.html
Use and modify Cortana sources (Java)
Subgroup Discovery with multiple target attributes
Mixture of Distributions
100
90
60
50
80
70
40
30
20
10
0
0 20 40 60 80 100
50
40
30
20
10
0
0
100
90
80
70
60
Mixture of Distributions
100
40
30
20
10
0
0
90
80
70
60
50
20 40 60 80 100
20 40 60 80 100
50
40
30
20
100
90
80
70
60
10
0
0 20 40 60 80 100
50
40
30
20
10
0
0
100
90
80
70
60
Mixture of Distributions
100
40
30
20
10
0
0
90
80
70
60
50
20 40 60 80 100
20 40 60 80 100
50
40
30
20
100
90
80
70
60
10
0
0 20 40 60 80 100
For each datapoint it is unclear whether it belongs to G or G
Intensional description of exceptional subgroup G?
Model class unknown
Model parameters unknown
Solution: extend Subgroup Discovery
Use other information than X and Y: object desciptions D
Use Subgroup Discovery to scan sub-populations in terms of D
Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.
Solution: extend Subgroup Discovery
Use other information than X and Y: object desciptions D
Use Subgroup Discovery to scan sub-populations in terms of D
Model over subgroup becomes target of SD
Exceptional Model Mining
Subgroup Discovery: find subgroups of the database where the target attribute s show an unusual distribution, by means of modeling over the target attributes .
Exceptional Model Mining object description target concept
X y
Define a target concept (X and y)
Exceptional Model Mining object description target concept
X y modeling
Define a target concept (X and y)
Choose a model class C
Define a quality measure φ over C
Exceptional Model Mining object description target concept
X y modeling
Subgroup Discovery
Define a target concept (X and y)
Choose a model class C
Define a quality measure φ over C
Use Subgroup Discovery to find exceptional subgroups G and associated model M
Quality Measure
Specify what defines an exceptional subgroup G based on properties of model M
Absolute measure (absolute quality of M)
Correlation coefficient
Predictive accuracy
Difference measure (difference between M and M)
Difference in slope
qualitative properties of classifier
Reliable results
Minimum support level
Statistical significance of G
Correlation Model
Correlation coefficient
φ
ρ
= ρ ( G )
Absolute difference in correlation
φ abs
= | ρ ( G ) - ρ ( G ) |
Entropy weighted absolute difference
φ ent
= H ( p ) · | ρ ( G ) - ρ ( G ) |
Statistical significance of correlation difference φ scd
compute z-score from ρ through Fisher transform z
*
1 z '
z '
1
n
3 n
3
compute p-value from z-score
Regression Model
Compare slope b of y i
= a + b ·x i
+ e, and y i
= a + b ·x i
+ e
Compute significance of slope difference φ ssd
drive = 1 basement = 0 #baths ≤ 1
y = 41 568 + 3.31
·x y = 30 723 + 8.45
·x
Gene Expression Data
11_band = ‘ no deletion ’ survival time ≤ 1919
XP_498569.1 ≤ 57
y = 3313 - 1.77·x y = 360 + 0.40·x
Classification Model
Decision Table Majority classifier
BDeu measure (predictiveness) whole database
RIF1 160.45
Hellinger (unusual distribution)
prognosis = ‘ unknown ’
General Framework object description target concept
X y modeling
Subgroup Discovery
General Framework object description
Subgroup Discovery target concept
X y
Regression ●
Classification ●
Clustering
Association
Graphical modeling ●
…
General Framework object description
Subgroup Discovery ●
Decision Trees
SVM
… target concept
X y
Regression
Classification
Clustering
Association
Graphical modeling
…
General Framework propositional ● multi-relational ●
…
Subgroup Discovery
Decision Trees
SVM
… target concept
X y
Regression
Classification
Clustering
Association
Graphical modeling
…