Are we really discovering "interesting" knowledge from data?

advertisement
Are We Really Discovering
“Interesting” Knowledge from Data?
Alex A. Freitas
University of Kent, UK
Outline of the Talk

Introduction to Knowledge Discovery
–

Selecting “interesting” rules/patterns
–
–


Criteria to evaluate discovered patterns
User-driven (subjective) approach
Data-driven (objective) approach
Summary
Research Directions
The Knowledge Discovery Process
– the “standard” definition

“Knowledge Discovery in Databases is the non-trivial
process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data”
(Fayyad et al. 1996)

Focus on the quality of discovered patterns
– independent of the data mining algorithm

This definition is often quoted, but not very seriously
taken into account
Criteria to Evaluate the “Interestingness” of
Discovered Patterns
useful
Amount of
Research
novel, surprising
comprehensible
valid (accurate)
Difficulty of
measurement
Main Machine Learning / Data Mining
Advances in the Last Decade ??

Ensembles (boosting, bagging, etc)
Support Vector Machines
Etc…

These methods aim at maximizing accuracy, but…


–
they usually discover patterns that are not comprehensible
Motivation for comprehensibility


User’s interpretation and validation of
discovered patterns
Give the user confidence in the discovered
patterns
–

Improves actionability, usefulness
Importance of comprehensibility depends on
the application
–
Pattern recognition vs. medical/scientific applications
Discovering comprehensible
knowledge

Most comprehensible kinds of knowledge
representation ??
–
–
–
–


Decision trees
IF-THEN rules
Bayesian nets
Etc…
No major comparison of the comprehensibility of these
representations - (Pazzani, 2000)
Typical comprehensibility measure: size of the model
–
Syntactical measure, ignores semantics
Accuracy and comprehensibility do not
imply novelty or surprisingness

(Brin et al. 1997) found 6,732 rules with a maximum
‘conviction’ value in Census data, e.g.:
–
–
–
“five years olds don’t work”
“unemployed residents don’t earn income from work”
“men don’t give birth”

(Tsumoto 2000) found 29,050 rules, out of which only
220 (less than 1%) were considered interesting or
surprising by the user

(Wong & Leung 2000) found rules with 40-60%
confidence which were considered novel and more
accurate than some doctor’s domain knowledge
The intersection between useful
(actionable) and surprising patterns

(Silberchatz & Tuzhilin 1996) argue that there is
a large intersection between these two notions
actionable

surprising
The remainder of the talk focuses on the notions
of novelty or surprisingness (unexpectedness)
Selecting the Most Interesting
Rules in a Post-Processing Step


Goal: select a subset of interesting rules out of all
discovered rules, in order to show to the user just the
most interesting rules
User-driven, “subjective” approach
–

Uses the domain knowledge, believes or preferences of the
user
Data-driven, “objective” approach
–
–
Based on statistical properties of discovered patterns
More generic, independent of the application domain and user
Data-driven vs user-driven rule
interestingness measures
all rules
all rules
data-driven
measure
user-driven
measure
Note: user-driven
measures 
real user interest
believes, Dom. Knowl.
user
user
selected
rules
selected
rules
Selecting Interesting Association Rules
based on User-Defined Templates

User-driven approach: the user specifies templates for
interesting and non-interesting rules

Inclusive templates specify interesting rules

Restrictive templates specify non-interesting rules

A rule is considered interesting if it matches at least
one inclusive template and it matches no restrictive
template
(Klemettinen et al. 1994)
Selecting Interesting Association Rules
– an Example

Hierarchies of items and their categories
food
drink
clothes
rice fish … hot dog coke wine … water trousers … socks

Inclusive template: IF (food) THEN (drink)
Restrictive template: IF (hot dog) THEN (coke)

Rule IF (fish) THEN (wine) is interesting

Selecting Surprising Rules based
on General Impressions

User-driven approach: the user specifies her/his
general impressions about the application domain
(Liu et al. 1997, 2000); (Romao et al. 2004)

Example:
IF (salary = high) THEN (credit = good)
(different from “reasonably precise knowledge”, e.g.
IF salary > £35,500 …)

Discovered rules are matched with the user’s general
impressions in a post-processing step, in order to
determine the most surprising (unexpected) rules
Selecting Surprising Rules based
on General Impressions – cont.

Case I – rule with unexpected consequent
A rule and a general impression have similar antecedents
but different consequents

Case II – rule with unexpected antecedent
A rule and a general impression have similar consequents
but the rule antecedent is different from the antecedent of
all general impressions

In both cases the rule is considered unexpected
Selecting Surprising Rules based
on General Impressions - example

GENERAL IMPRESSION:
IF (salary = high) AND (education_level = high)
THEN (credit = good)

UNEXPECTED RULES:
IF (salary > £30,000) AND (education  BSc) AND (mortage = yes)
THEN (credit = bad)
/* Case I - unexpected consequent */
IF (mortgage = no) AND (C/A_balance > £10,000)
THEN (credit = good)
/* Case II – unexpected antecedent */
/* assuming no other general impression has a similar antecedent */
Data-driven approaches for
discovering interesting patterns

Two broad groups of approaches:

Approach based on the statistical strength of patterns
–

entirely data-driven, no user-driven aspect
Based on the assumption that a certain kind of pattern
is surprising to “any user”, in general
–
–
But still independent from the application domain and
from believes / domain knowledge of the target user
So, mainly data-driven
Rule interestingness measures based on
the statistical strength of the patterns

More than 50 measures have been called rule
“interestingness” measures in the literature

For instance, using rule notation: IF (A) THEN (C)
Interest = |A  C| – (|A|  |C|) / N (Piatetsky-Shapiro, 1991)

–
–

Measures deviation from statistical independence between A, C
Measures co-occurrence of A and C, not implication A  C
The vast majority of these measures are based only on
statistical strength or mathematical properties
(Hilderman & Hamilton 2001), (Tan et al. 2002)

Are they correlated with real human interest ??
Evaluating the correlation between datadriven measures and real human interest

3-step Methodology:
–
–
–

Rank discovered rules according to each of a number of
data-driven rule interestingness measures
Show (a subset of) discovered rules to the user, who
evaluates how interesting each rule is
Measure the linear correlation between the ranking of
each data-driven measure and real human interest
Rules are evaluated by an expert user, who
also provided the dataset for the data miner
First Study – (Ohsaki et al. 2004)



A single dataset (hepatitis), and a single user
39 data-driven rule interestingness measures
First experiment: 30 discovered rules
–

1 measure had correlation  0.4; correlation: 0.48
Second experiment: 21 discovered rules
–
4 measures had correlation  0.4; highest corr.: 0.48

In total, out of 78 correlations, 5 (6.4%) were  0.4

NB: the paper also shows other performance measures
Second Study – (Carvalho et al. 2005)




8 datasets, one user for each dataset
11 data-driven rule interestingness measures
Each data-driven measure was evaluated over
72 <user-rule> pairs (8 users  9 rules/dataset)
88 correlation values (8 users  11 measures)
–
–
31 correlation values (35%) were  0.6
The correlations associated with each measure
varied considerably across the 8 datasets/users
Data-driven approaches based on finding
specific kinds of surprising patterns

Principle: a few special kinds of patterns tend to be
surprising to most users, regardless of the “meaning” of
the attributes in the pattern and the application domain

Does not require domain knowledge or believes
specified by user

We discuss three approaches:
–
–
–
Discovery of exception rules contradicting general rules
Measuring the novelty of text-mined rules with WordNet
Detection of Simpson’s paradox
Selecting Surprising Rules Which Are
Exceptions of a General Rule

Data-driven approach: consider the following 2 rules
R1: IF Cond1 THEN Class1 (general rule)
R2: IF Cond1 AND Cond2 THEN Class2 (exception)
Cond1, Cond2 are conjunctions of conditions

Rule R2 is a specialization of rule R1

Rules R1 and R2 predict different classes

An exception rule is interesting if both the rule and its
general rule have a high predictive accuracy
(Suzuki & Kodratoff 1998), (Suzuki 2004)
Selecting Surprising Exception Rules
– an Example

Mining data about car accidents – (Suzuki 2004)
IF (used_seat_belt = yes)
THEN (injury = no)
(general rule)
IF (used_seat_belt = yes) AND (passenger = child)
THEN (injury = yes)
(exception rule)

If both rules are accurate, then the exception rule would be
considered interesting (it is both accurate and surprising)
Measuring Novelty of Text-Mined Rules
with WordNet
(Basu et al. 2001)




Motivation: avoid the discovery of trivial text-mined
rules such as “SQL  database”
WordNet is an online lexical knowledge-base of about
130,000 English words and some semantic
relationships between them
Rule format: IF A (words) THEN C (words)
Novelty measure based on measuring the semantic
distance between each pair of words in A and C
–
Distance based on number of edges in the shortest path
between words in the WordNet network
Simpson’s Paradox – an example

Income and taxes in the USA (in 109 US$)
Year
Value
1974
income
880.18
tax
123.69
rate
0.141
From 1974 to 1978, the
1978
tax rate (tax / income)
income 1242.40
has increased
tax
188.58
rate
0.152
Simpson’s Paradox – an example
(cont.)
Year
1974
income
tax
rate
1978
income
tax
rate
<5
income category (in 103 US$)
5..10 10..15 15..100 >100
Total
41.7 146.4 192.7
2.2
13.7
21.5
0.054 0.093 0.111
470.0
75.0
0.160
29.4
11.3
0.384
880.18
123.69
0.141
19.9 122.9
0.7
8.8
0.035 0.072
865.0
137.9
0.159
62.81
24.05
0.383
1242.40
188.58
0.152
171.9
17.2
0.100
Overall the rate increased, but it decreased in each category!
Discovering Instances of Simpson’s
Paradox as Surprising Patterns


Simpson’s paradox occurs when:
One value of attribute A (year) increases the probability
of event X (tax) in a given population but, at the same
time, decreases the probability of X in every
subpopulation defined by attribute B (income category)

Although Simpson’s paradox is known by statisticians,
occurrences of the paradox are surprising to users

Algorithms that find all instances of the paradox in data
and rank them in decreasing order of surprisingness:
–
(Fabris & Freitas 1999, 2005) – algorithms for “flat”, singlerelation data and for hierarchical multidimensional (OLAP) data
Summary



Data mining algorithms can easily discover too many
rules/patterns to the user – there is a clear motivation
for selecting the most interesting rules/patterns
The main challenge is to discover novel, useful
patterns, going beyond accuracy and comprehensibility
User-driven and data-driven approaches have
complementary advantages and disadvantages
–

Using a hybrid approach seems sensible
Discovering patterns that are truly interesting to the
user without using a lot of user-specified domain
knowledge is an open problem
Research Direction – improving the
effectiveness of data-driven measures

Estimate the user’s interest in a rule with an ensemble
of data-driven rule interestingness measures: int1…intm
Training Set
Test Set
Rule 1
:
int1 . . . Intm user’s int
0.9
0.4 high
:
Rule n
... :
0.3
0.8
int1 . . . Intm user’s int
0.2
0.5
?
:
:
low
0.9
...
:
:
0.2
?

Mapping from (int1…intm) into predicted user’s interest:
can be a predefined function (e.g. voting) or learned

In (Abe et al. 2005) a classification algo learns the mapping
Research Direction – increasing the
autonomy of the user-driven approach


User-driven approach is very user-dependent, requires
“prior model” (e.g. general impressions) from the user
Automatic learning of general impressions with text
mining
data
Web
text
mining
general
impressions
data
mining
patterns
Research Direction – a broader role
for interestingness measures

Think about interestingness (surprisingness,
usefulness) since the start of the KDD process:
pre-proc
data mining
post-proc
interestingness measures
E.g. Attribute selection and construction trying to
maximize not only accuracy and comprehensibility, but
also surprisingness and/or usefulness
Any Questions ??
Thanks for listening!
References




(Abe et al. 2005) H. Abe, S. Tsumoto, M. Ohsaki, T. Yamaguchi. A rule
evaluation support method with learning models based on objective rule
evaluation indices. To appear in 2005 IEEE Int. Conf. on Data Mining.
(Basu et al. 2001) S. Basu, R.J. Mooney, K.V. Pasupuleti, J. Ghosh.
Evaluating the novelty of text-mined rules using lexical knowledge. Proc.
KDD-2001, 233-238.
(Brin et al. 1997) S. Brin, R. Motwani, J.D. Ullman, S. Tsur. Dynamic
itemset counting and implication rules for market basket data. Proc. KDD97.
(Carvalho et al. 2005) D.R. Carvalho, A.A. Freitas, N.F. Ebecken.
Evaluating the correlation between objective rule interestingness
measures and real human interest. To appear in Proc. PKDD-2005.
References (cont.)




(Fabris & Freitas1999) C.C. Fabris and A.A. Freitas. Discovering
surprising patterns by detecting occurrences of Simpson's paradox. In:
Research and Development in Intelligent Systems XVI, 148-160. Springer
(Fabris & Freitas 2005) C.C. Fabris and A.A. Freitas. Discovering
surprising instances of Simpson’s paradox in hierarchical
multidimensional data. To appear in Int. J. of Data Warehousing and
Mining.
(Fayyad et al., 1996) U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From
data mining to knowledge discovery: an overview. In: Advances in
Knowledge Discovery and Data Mining, 1-34. AAAI Press.
(Hilderman & Hamilton 2001) R.J. Hilderman and H.J. Hamilton.
Knowledge Discovery and Measures of Interest, Kluwer.
References (cont.)




(Klemettinen et al. 1994) M. Klemettinen, H. Mannila, P. Ronkainen, H.
Toivonen, A.I. Verkamo. Finding interesting rules from large sets of
discovered association rules. Proc. 3rd Int. Conf. on Information and
Knowledge Management, 401-407.
(Liu et al. 1997) B. Liu, W. Hsu, S. Chen. Using general impressions to
analyze discovered classification rules. Proc. KDD-97, 31-36
(Liu et al. 1999) B. Liu, W. Hsu, L-F. Mun, H-Y Lee. Finding interesting
patterns using user expectations. IEEE Trans. on Knowledge Engineering
11(6), 817-832.
(Ohsaki et al. 2004) M. Ohsaki, S. Kitaguchi, K. Okamoto, H. Yokoi, T.
Yamaguchi. Evaluation of rule interestingness measures with a clinical
dataset on hepatitis. Proc. PKDD-2004, 362-373.
References (cont.)





(Pazzani 2000) M.J. Pazzani. Knowledge discovery from data? IEEE
Intelligent Systems, Mar/Apr. 2000, pp. 10-13.
(Piatetsky-Shapiro, 1991) G. Piatetsky-Shapiro. Discovery, analysis and
presentation of strong rules. In: Knowledge Discovery in Databases, 229248. AAAI/MIT Press.
(Romao et al. 2004) W. Romao, A.A. Freitas, I.M.S. Gimenes.
Discovering Interesting Knowledge from a Science & Technology
Database with a Genetic Algorithm. Applied Soft Computing 4, 121-137.
(Silberchatz & Tuzhilin 1996) S. Silberchatz and A. Tuzhilin. What makes
patterns interesting in knowledge discovery systems. IEEE Trans.
Knowledge and Data Engineering, 8(6).
(Suzuki & Kodratoff 1998) E. Suzuki and Y. Kodratoff. Discovery of
surprising exception rules based on intensity of implication. Proc. PKDD98.
References (cont.)




(Suzuki 2004) E. Suzuki. Discovering interesting exception rules with rule
pair. Proc. Workshop on Advances in Inductive Rule Learning at PKDD2004, 163-178.
(Tan et al. 2002) P-N. Tan, V. Kumar, J. Srivastava. Selecting the right
interestingness measure for association patterns. Proc. KDD-2002.
(Tsumoto 2000) S. Tsumoto. Clinical knowledge discovery in hospital
information systems: two case studies. Proc. PKDD-2000, 652-656.
(Wong & Leung 2000) M.L. Wong and K.S. Leung. Data mining using
grammar based genetic programming and applications. Kluwer.
Download