Subgroup - LIACS Data Mining Group

advertisement

Subgroup Discovery

Finding Local Patterns in Data

Exploratory Data Analysis

 Scan the data without much prior focus

 Find unusual parts of the data

 Analyse attribute dependencies

 interpret this as ‘ rule ’ : if X=x and Y=y then Z is unusual

 Complex data: nominal, numeric, relational?

the Subgroup

Exploratory Data Analysis

 Classification: model the dependency of the target on the remaining attributes.

 problem: sometimes classifier is a black-box, or uses only some of the available dependencies.

 for example: in decision trees, some attributes may not appear because of overshadowing.

 Exploratory Data Analysis: understanding the effects of all attributes on the target.

Interactions between Attributes

 Single-attribute effects are not enough

 XOR problem is extreme example: 2 attributes with no info gain form a good subgroup

 Apart from

A=a, B=b, C=c, …

 consider also

A=a  B=b, A=a  C=c, …, B=b  C=c, …

A=a  B=b  C=c, …

Subgroup Discovery Task

“ Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute s

 Inductive constraints:

 Minimum support

 (Maximum support)

 Minimum quality (Information gain, X 2 , WRAcc)

 Maximum complexity

 …

Subgroup Discovery: the

Binary Target Case

Confusion Matrix

 A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target:

 within subgroup, positive

 within subgroup, negative

 outside subgroup, positive

 outside subgroup, negative subgroup target

T

T .42

F .12

.54

F

.13

.55

.33

1.0

Confusion Matrix

 High numbers along the TT-FF diagonal means a positive correlation between subgroup and target

 High numbers along the TF-FT diagonal means a negative correlation between subgroup and target

 Target distribution on DB is fixed

 Only two degrees of freedom subgroup target

T

T .42

F .12

.54

F

.13

.55

.33

.45

.46

1.0

Quality Measures

A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number

WRAcc, weighted relative accuracy

 Balance between coverage and unexpectedness

 WRAcc(S,T) = p(ST) – p(S)  p(T)

 between −.25 and .25, 0 means uninteresting subgroup

T

F

T target

F

.42

.13

.55

.12

.33

.54

1.0

WRAcc(S,T) = p(ST)−p(S)  p(T)

= .42 − .297 = .123

Quality Measures

 WRAcc: Weighted Relative Accuracy

 Information gain

 X 2

 Correlation Coefficient

 Laplace

 Jaccard

 Specificity

 …

Subgroup Discovery as Search true

A=a

1

A=a

2

B=b

1

B=b

2

C=c

1

A=a

1

B=b

1

A=a

1

B=b

1

 C=c

1

A=a

1

B=b

2

A=a

2

B=b

1

T

F

T

.42

F

.13

.12

.33

.54

.55

1.0

minimum support level reached

Refinements are (anti-)monotonic entire database

Refinements are (anti-) monotonic in their support…

…but not in interestingness.

This may go up or down.

target concept

S

3 refinement of S

2

S

2 refinement of S

1 subgroup S

1

Subgroup Discovery and

ROC space

ROC Space

ROC = Receiver Operating Characteristics

Each subgroup forms a point in ROC space, in terms of its False Positive

Rate, and True Positive

Rate.

TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup )

FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup )

ROC Space Properties entire database

‘ ROC heaven ’ perfect subgroup

‘ ROC hell ’ random subgroup perfect negative subgroup empty subgroup minimum support threshold

Measures in ROC Space

0 positive negative

WRAcc Information Gain isometric

Other Measures

Precision

Gini index

Foil gain Correlation coefficient

Refinements in ROC Space

.

.

Refinements of S will reduce the FPR and TPR, so will appear to the left and below S.

Blue polygon represents possible refinements of S.

With a convex measure, f is bounded by measure of corners.

.

..

If corners are not above minimum quality or current best (top k?), prune search space below S.

Multi-class problems

 Generalising to problems with more than 2 classes is fairly staightforward: target

C

1

C

2

C

3

T subgroup

.27

.06

.22

.55

combine values to quality measure

F

.03

.19

.23

.45

.3

.25

.45

1.0

X 2 Information gain

Subgroup Discovery for

Numeric targets

Numeric Subgroup Discovery

 Target is numeric: find subgroups with significantly higher or lower average value

 Trade-off between size of subgroup and average target value h = 2200 h = 3600 h = 3100

Types of SD for Numeric Targets

 Regression subgroup discovery

 numeric target has order and scale

 Ordinal subgroup discovery

 numeric target has order

 Ranked subgroup discovery

 numeric target has order or scale

Vancouver 2010 Winter Olympics ordinal target

Partial ranking objects share a rank regression target

7

9

9

9

11

12

13.5

13.5

16

16

Rank

1

2

3

4

5

6

20

23

25

25

25

16

20

20

20

20

Offical IOC ranking of countries (med > 0)

Country Medals

USA 37

Germany 30

Canada 26

Norway

Austria

23

16

Russ. Fed.

15

Athletes Continent Popul.

214 N. America 309

152

205

100

79

179

Europe

Europe

Europe

Asia

82

N. America 34

4.8

8.3

142

Korea 14

Netherlands 8

Czech Rep. 6

Poland

Italy

Japan

6

5

5

Finland 5

Australia 3

Belarus 3

Slovakia 3

Croatia 3

46

95

40

49

73

18

34

92

50

110

94

Asia

China

Sweden

France

Fractional ranks Asia

11

11

107

107

Europe

Europe

Switzerland 9 144 Europe

Europe

Europe

Europe

Europe

Asia

Europe 5.3

Australia 22

Europe

Europe

Europe

9.6

5.4

4.5

73

1338

9.3

65

7.8

16.5

10.5

38

60

127

Slovenia 3

Latvia 2

Great Britain 1

Estonia 1

Kazakhstan 1

49

58

52

30

38

Europe

Europe

Europe

Europe

Asia

2

2.2

61

1.3

16

Language Family

Germanic

Germanic

Germanic

Germanic

Germanic

Slavic

Altaic

Sino-Tibetan

Germanic

Italic

Germanic

Germanic

Slavic

Slavic

Italic

Japonic

Finno-Ugric

Germanic

Slavic

Slavic

Slavic

Slavic

Slavic

Germanic

Finno-Ugric

Turkic n y y y n y y n y y

Repub.

y y n n y y y y n y y y y y y y n n n n n y n n n n y n n y y

Polar y n n n n n n n y n n

Interesting Subgroups

‘ polar = yes ’

1. United States

3. Canada

4. Norway

6. Russian Federation

9. Sweden

16 Finland

‘ language_family = Germanic & athletes  60 ’

1. United States

2. Germany

3. Canada

4. Norway

5. Austria

9. Sweden

11. Switzerland

Intuitions

Size: larger subgroups are more reliable

Rank: majority of objects appear at the top language_family = Slavic

Position: ‘ middle ’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

*

**

*

Intuitions

Size: larger subgroups are more reliable

Rank: majority of objects appear at the top polar = yes

Position: ‘ middle ’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

*

**

*

*

*

Intuitions

Size: larger subgroups are more reliable

Rank: majority of objects appear at the top population

10M

Position: ‘ middle ’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

**

**

*

*

*

**

*

Intuitions

Size: larger subgroups are more reliable

Rank: majority of objects appear at the top language_family = Slavic & population

10M

Position: ‘ middle ’ of subgroup should differ from middle of ranking

Deviation: objects should have similar rank

**

*

 Average

Quality Measures

 Mean test

 z-Score

 t-Statistic

 Median X 2 statistic

 AUC of ROC

 Wilcoxon-Mann-Whitney Ranks statistic

 Median MAD Metric

Meet Cortana

the open source Subgroup Discovery tool

Cortana Features

 Generic Subgroup Discovery algorithm

 quality measure

 search strategy

 inductive constraints

 Flat file, .txt, .arff, (DB connection to come)

 Support for complex targets

 41 quality measures

 ROC plots

 Statistical validation

Target Concepts

 ‘ Classical ’ Subgroup Discovery

 nominal targets (classification)

 numeric targets (regression)

 Exceptional Model Mining

(to be discussed in a few slides)

 multiple targets

 regression, correlation

 multi-label classification

Mixed Data

 Data types

 binary

 nominal

 numeric

 Numeric data is treated dynamically (no discretisation as preprocessing)

 all: consider all available thresholds

 bins: discretise the current candidate subgroup

 best: find best threshold, and search from there

Statistical Validation

 Determine distribution of random results

 random subsets

 random conditions

 swap-randomization

 Determine minimum quality

 Significance of individual results

 Validate quality measures

 how exceptional?

Open Source

 You can

 Use Cortana binary datamining.liacs.nl/cortana.html

 Use and modify Cortana sources (Java)

Exceptional Model Mining

Subgroup Discovery with multiple target attributes

Mixture of Distributions

100

90

60

50

80

70

40

30

20

10

0

0 20 40 60 80 100

50

40

30

20

10

0

0

100

90

80

70

60

Mixture of Distributions

100

40

30

20

10

0

0

90

80

70

60

50

20 40 60 80 100

20 40 60 80 100

50

40

30

20

100

90

80

70

60

10

0

0 20 40 60 80 100

50

40

30

20

10

0

0

100

90

80

70

60

Mixture of Distributions

100

40

30

20

10

0

0

90

80

70

60

50

20 40 60 80 100

20 40 60 80 100

50

40

30

20

100

90

80

70

60

10

0

0 20 40 60 80 100

 For each datapoint it is unclear whether it belongs to G or G

 Intensional description of exceptional subgroup G?

 Model class unknown

 Model parameters unknown

Solution: extend Subgroup Discovery

 Use other information than X and Y: object desciptions D

 Use Subgroup Discovery to scan sub-populations in terms of D

Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.

Solution: extend Subgroup Discovery

 Use other information than X and Y: object desciptions D

 Use Subgroup Discovery to scan sub-populations in terms of D

 Model over subgroup becomes target of SD

Exceptional Model Mining

Subgroup Discovery: find subgroups of the database where the target attribute s show an unusual distribution, by means of modeling over the target attributes .

Exceptional Model Mining object description target concept

X y

 Define a target concept (X and y)

Exceptional Model Mining object description target concept

X y modeling

 Define a target concept (X and y)

 Choose a model class C

 Define a quality measure φ over C

Exceptional Model Mining object description target concept

X y modeling

Subgroup Discovery

 Define a target concept (X and y)

 Choose a model class C

 Define a quality measure φ over C

 Use Subgroup Discovery to find exceptional subgroups G and associated model M

Quality Measure

 Specify what defines an exceptional subgroup G based on properties of model M

 Absolute measure (absolute quality of M)

 Correlation coefficient

 Predictive accuracy

 Difference measure (difference between M and M)

 Difference in slope

 qualitative properties of classifier

 Reliable results

 Minimum support level

 Statistical significance of G

Correlation Model

 Correlation coefficient

φ

ρ

= ρ ( G )

 Absolute difference in correlation

φ abs

= | ρ ( G ) - ρ ( G ) |

 Entropy weighted absolute difference

φ ent

= H ( p ) · | ρ ( G ) - ρ ( G ) |

 Statistical significance of correlation difference φ scd

 compute z-score from ρ through Fisher transform z

* 

1 z '

 z '

1

 n

3 n

3

 compute p-value from z-score

Regression Model

 Compare slope b of y i

= a + b ·x i

+ e, and y i

= a + b ·x i

+ e

 Compute significance of slope difference φ ssd

drive = 1  basement = 0  #baths ≤ 1

y = 41 568 + 3.31

·x y = 30 723 + 8.45

·x

Gene Expression Data

11_band = ‘ no deletion ’  survival time ≤ 1919

XP_498569.1 ≤ 57

y = 3313 - 1.77·x y = 360 + 0.40·x

Classification Model

 Decision Table Majority classifier

 BDeu measure (predictiveness) whole database

RIF1  160.45

 Hellinger (unusual distribution)

prognosis = ‘ unknown ’

General Framework object description target concept

X y modeling

Subgroup Discovery

General Framework object description

Subgroup Discovery target concept

X y

Regression ●

Classification ●

Clustering

Association

Graphical modeling ●

General Framework object description

Subgroup Discovery ●

Decision Trees

SVM

… target concept

X y

Regression

Classification

Clustering

Association

Graphical modeling

General Framework propositional ● multi-relational ●

Subgroup Discovery

Decision Trees

SVM

… target concept

X y

Regression

Classification

Clustering

Association

Graphical modeling

Download