Lecture1

advertisement
What is Data Mining about?

Basic Introduction

Main Tasks in DM

Applications of DM

Relations to other disciplines

Machine Learning: Supervised and
unsupervised learning

Concepts, Concept Space…
Reading material:
 Chapter 1 of the textbook by Witten,

Chapters 1, 2 of Textbook by T.
Mitchell,

Chapter 1 of the book by Han.
Data vs. information
—Huge amount of data from sports,
business, science, medicine,
conomics. Kb,Mb,Gb,Tb, Lb…
Data: recorded facts
Information: patterns underlying the data.
—Raw data is not helpful
—How to extract patterns from raw data?
Example:
—Marketing companies use historical
response data to build models to predict
who will respond to a direct mail or
telephone solicitation.
—The government agency sifts through the
records of financial transactions to detect
money laundering or drug smuggling.
—Diagnosis, building expert systems to
help physicians based on the previous
experience
Data Mining = KDD:
— Knowledge Discovery in Database
System: extensive works in database
system,
— Statistical learning has been being
active for decades.
— Machine Learning: a mature
subfield in Artificial Intelligence.
Data Mining is
Extraction of implicit, previously unknown, and
potentially useful information from data;
— Needed: programs that detect patterns
and regularities in the data;
— Strong patterns can be used to make
predictions.
Problems:
— Most patterns are not interesting;
— Patterns may be inexact (or even
completely spurious) if data is garbled or
missing.
Machine learning techniques
— Algorithms for acquiring structural
descriptions from examples
— Structural descriptions represent
patterns explicitly
Machine Learning Methods can be used to
— predict outcome in new situation;
— understand and explain how prediction
is derived (maybe even more important).
Can machines really learn?
— Definitions of “learning” from dictionary:
To get knowledge of by study, experience, or
being taught; To commit to memory; To receive
instruction…
How to measure this `learning’? The last two
tasks are easy for computers.
— Operational definition:
Things learn when they change their behavior in
a way that makes them perform better in the
future.
 Does a slipper learn?
 Does learning imply intention?
Definition: A computer program is
said to learn from experience E with
respect some class of tasks T and
perfomance P, if its performance at
tasks in T, as measured by P,
improves with experience E.
Designing a learning system:
Example: A checkers learning problem:
Task T: Playing Checkers;
Performance measure P: percent of games
won against opponents;
Training experience E: playing practice
games against itself.
A Learning system includes:
— Choosing the training Experience
1. Whether the training experience provides direct or
indirect feedback regarding the choices made by
performance system.
2. To which degree the learner can control the
sequence of training examples.
3. How well the training experience represents the
distribution of examples of the final system
performance P must be measured.
— Choosing the target function
Determine which kind of knowledge will be learned
and how this will be used by the performance
program.
A Data Mining Process consists of:
— Choosing a representation for the target
function
— Choosing a learning algorithm
The weather problem
— Conditions for playing an unspecified
game
Play
Yes
Yes
No
No
Yes
……
Windy
False
False
True
False
True
……
Humidity
Normal
High
High
High
Normal
……
Temperature
Mild
Hot
Hot
Hot
Cool
……
Outlook
Rainy
Overcast
Sunny
Sunny
Overcast
……
Structural Description:
If---Then Structure:
If outlook = sunny and humidity = high then play
= no
Classification vs. association rules
— Classification rule: predicts value of prespecified attribute (the classification of an
example)
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
— Association rule: predicts value of
arbitrary attribute or combination of
attributes
If outlook = sunny and humidity = high then
play = no
If temperature = cool then humidity = normal
If humidity = normal and windy = false then
play = yes
Enumerating the version space:
Domain space: All possible combinations of
the examples, equals to the product of the
number of possible values for each
attribute. For the weather problem,
3*3*2*2=36. (`Play’ is the target attribute).
How many possible classification rules?
— If some attributes do not appear in the
if…then structure, we use `?’ to denote the
value of the corresponding attribute.
For example, (?,mild, normal,?, play) means
If the temperature = mild and humidity = normal,
then play=yes.
Therefore, the concept space is: 4*4*3*3*2=288
Space of rules set: approximately 2.7*10^27
Version space is the space of consistent
concept with respect to the present training
set.
Weather data with mixed
attributes
— Two attributes with numeric values
Play
Yes
Yes
No
No
Yes
……
Windy
False
False
True
False
True
……
Humidity
85
90
86
80
95
……
Temperature
85
80
83
75
65
……
Outlook
Rainy
Overcast
Sunny
Sunny
Overcast
Classification Rules:
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
Question: How to count the version space
and concept space? How to add test in the
classification rule?
The contact lenses data
Age
Spectacle prescription Astigmatism Tear production rate Recommended lenses
Prepresbyopic
Hypermetrope
Yes
Reduced
None
Prepresbyopic
Hypermetrope
Yes
Normal
None
Presbyopic
Myope
No
Reduced
None
Presbyopic
Myope
No
Normal
None
Presbyopic
Myope
Yes
Reduced
None
Presbyopic
Myope
Yes
Normal
Hard
Presbyopic
Hypermetrope
No
Reduced
None
Presbyopic
Hypermetrope
No
Normal
Soft
Presbyopic
Hypermetrope
Yes
Reduced
None
Presbyopic
Hypermetrope
Yes
Normal
None
Prepresbyopic
Hypermetrope
No
Normal
Soft
Prepresbyopic
Hypermetrope
No
Reduced
None
Prepresbyopic
Myope
Yes
Normal
Hard
Prepresbyopic
Myope
Yes
Reduced
None
Prepresbyopic
Myope
No
Normal
Soft
Prepresbyopic
Myope
No
Reduced
None
Young
Hypermetrope
Yes
Normal
Hard
Young
Hypermetrope
Yes
Reduced
None
Young
Hypermetrope
No
Normal
Soft
Young
Hypermetrope
No
Reduced
None
Young
Myope
Yes
Normal
Hard
Young
Myope
Yes
Reduced
None
Young
Myope
No
Normal
Soft
Young
Myope
No
Reduced
None
Issues: Instances with little difference might
have the same value for the target attribute.
A complete and correct rule set
If tear production rate = reduced then
recommendation = none
If age = young and astigmatic = no and tear production rate
= normal then recommendation = soft
If age = pre- presbyopic and astigmatic = no and tear
production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal
then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then
recommendation = hard
If age young and astigmatic = yes and tear production rate =
normal then recommendation = hard
If age = pre- presbyopic and spectacle prescription =
hypermetrope and astigmatic = yes then
recommendation = none
If age = presbyopic and spectacle prescription =
hypermetrope and astigmatic = yes
then recommendation = none
In total, we have 9 rules. Can we
summarize the patterns more efficiently?
Classifying iris flowers
Sepal length Sepal width
5.1
3.5
4.9
3.0
7.0
3.2
6.4
3.2
6.3
3.3
5.8
2.7
……
Petal length
1.4
1.4
4.7
4.5
6.0
5.1
Petal width
0.2
0.2
1.4
1.5
2.5
1.9
Type
Setosa
Setosa
Versicolor
Versicolor
Virginica
Virginica
If petal length < 2.45 then Iris
setosa
If sepal width < 2.10 then Iris
versicolor...
Predicting CPU performance
Cycle time(ns) Main memory(kb) Cache Channels Performance
MYCT
Mmin Mmax
Cach Chmin Chmax PRP
125
256 6000
256
16
128
198
29
8000 32000
32
8
32
269
480
512
8000
32
0
0
67
480
1000 4000
0
0
0
45
……
PRP = -55.9 + 0.0489 MYCT + 0.0153
MMIN + 0.0056 MMAX + 0.6410 CACH 0. 2700 CHMIN + 1.480 CHMAX
Examples: 209 different computer configurations.
Using Linear regression function
Data from labor negotiations
— A case with missing values
Attribute
Type
1
2
Duration
{number of years} 1
2
Wage raise first year
Percentage%
2
4
Wage raise second year Percentage%
?
5
Wage raise third year
Percentage%
?
?
Living cost adjustment
{none,tcf,tc}
none tcf
Working time per week {Hours}
25
35
Pension
{none,ret-allw,empl-cntr} none ?
Standby pay
Percentage%
?
13
Shift supplement
Percentage%
?
5
Education allowance
{yes,no}
yes
?
Statutory holidays
number of days 11
15
Vacation
{below,avg,gen}
avg gen
Long-term disability assist {yes,no}
no
?
Dental plan
{none,half,full} none ?
Bereavement assist
{yes,no}
no
?
Health plan contribution
{none,half,full} none ?
Acceptability of contract {good,bad}
bad good
3 …
3
4.3
4.4
?
?
38
?
?
4
?
12
gen
?
full
?
full
good
Why have these values been missed and
how to estimate these missing values?
40
2
4.5
4
?
none
40
?
?
4
?
12
avg
yes
full
yes
half
good
Soybean classification
Attribute
Number of values Sample value
Environment Time of Occurrence
7
July
Precipitation
3
Above normal
Seed
Condition
2
Normal
Mold growth
2
Absent
Fruit
Condition of fruit pods
4
Normal
Fruit spots
5
?
Leaves
Condition
2
Abnormal
Leaf spot size
3
?
Stem
Condition
2
Abnormal
Stem Lodging
2
Yes
Root
Condition
3
Normal
Diagnosis
19 Diaporthe stem canker
Domain knowledge plays an important role.
If leaf condition is normal and
stem condition is abnormal and
stem cankers is below soil line and
canker lesion color is brown
then diagnosis is rhizoctonia root rot
If leaf malformation is absent and
stem condition is abnormal and
stem cankers is below soil line and
canker lesion color is brown
then diagnosis is rhizoctonia root rot
Data Mining Applications:
Processing loan application
— Given: questionnaire with financial and
personal information
— Problem: should money be lent?
— Simple statistical method covers 90% of
cases
— Borderline cases referred to loan officers
— But: 50% of accepted borderline cases
defaulted!
— Solution(?): reject all borderline cases
No! Borderline cases are most active customers
Enter machine learning
— 1000 training examples of borderline cases
— 20 attributes: age, years with current
employer, years at current address, years with
the bank, other credit cards possessed…
— Learned rules predicted 2/ 3 of borderline
cases correctly: a big deal in business.
— Rules could be used to explain decisions to
customers
More Applications: Screening images,
Load forecasting, Diagnosis of Machine
fault, Marketing and sales, DNA recognition,
etc…
Inductive Learning: finding a concept
that fits the data.
Let us recall the weather problem. It is
possible that the target attribute `play=no’
no matter what are the values of the other
attributes. We use the symbol `’ to denote
this situation or equivalently (,,,,) in
the concept space.
General-to-specific ordering:
Two descriptions:
— d1=(sunny,?,?,hot,?), d2=(sunny,?,?,?,?).
Consider the sets s1, s2 of instances
classified positive by d1 and d2. Because d2
poses fewer constraints on the instances, s1
is a subset of s2. Correspondingly, we use
the symbol d1<d2. This yields an order of the
components in the version space.
Finding max-general description:
It is possible that there is no `<’ or `>’ relation
between two general descriptions. For a specific
description d, if there is no other description d’ in
a set S of examples satisfying d’>d, then we say
d is the maximally general in S.
We can similarly define the maximally specific
(or minimally general) description in S
We can use intuitive greedy algorithm to find
a max-general description in S based on the
general-to-specific (or its converse) ordering
search in the concept space.
The space of consistent concept descriptions is
completely determined by two sets
L: most specific descriptions that cover all
positive examples and no negative ones
G: most general descriptions that do not cover
any negative examples and all positive ones
— Only L and G need to be maintained and
updated
Candidate- elimination algorithm
Initialize L () and G (?)
For each example e
If e is positive:
Delete all elements from G that do not cover e
For each element r in L that does not cover e:
Replace r by all of its most specific
generalizations that cover e and that are
more specific than some element in G
Remove elements from L that are more general
than some other element in L
If e is negative:
Delete all elements from L that cover e
For each element r in G that covers e:
Replace r by all of its most general
specializations that do not cover e and that
are more general than some element in L
Remove elements from G that are more specific
than some other element in G
Example of Candidate Elimination:
Play
Yes
Yes
No
Yes
Windy
T
T
F
T
Humidity
Normal
High
High
Normal
Temperature
Hot
Hot
Cold
Cold
Outlook
Sunny
Sunny
Rainy
Sunny
L(0)={ (,,,)} L(1)={(sunny,hot,normal,t)}
L(2)={(sunny,hot,?,T)} L(3)=L(2)
L(4)={(sunny,?,?,T)} L(0)<L(1)<L(2)=L(3)<L(4)=L
G(0)={(?,?,?,?)} G(1)=G(0) G(2)=G(1)
G(3)={(sunny,?,?,?), (?,hot,?,?),(?,?,?,T)}
G(4)={(sunny,?,?,?),(?,?,?,T)}=G
Bias: important decisions in learning systems:
The concept description language
The order in which the space is searched
The way that overfitting to the particular
training data is avoided
— These properties form the “bias” of the
search: Language bias,Search bias
and Overfitting- avoidance bias
Language bias
— Most important question: is language
universal or does it restrict what can be
learned?
— Universal language can express arbitrary
subsets of examples
— If language can represent statements
involving logical or (“ disjunctions”) it is
universal
— Example: rule sets
— Domain knowledge can be used to
exclude some concept descriptions
a priori from the search
Search bias
— Search heuristic “Greedy” search:
performing the best single step
“Beam search”: keeping several
alternatives …
— Direction of search, General-to-specific
’ E. g. specializing a rule by adding
conditions, Specific-to-general
’ E. g. generalizing an individual instance
into a rule
Overfitting- avoidance bias
— Can be seen as a form of search bias
— Modified evaluation criterion
E. g. balancing simplicity and number of
errors
— Modified search strategy
E. g. pruning (simplifying a description)
Pre-pruning: stops at a simple description
before search proceeds to an overly
complex one
Post-pruning: generates a complex
description first and simplifies it afterwards
Concepts, Instances, Attributes:
1. Concepts: kinds of things that can be learned.
’ Aim: intelligible and operational concept
description
2. Instances: the individual, independent
examples of a concept.
3. Attributes: measuring aspects of an instance:
nominal and numeric ones
— Practical issue: a file format for the input
Concept description: output of learning
scheme
Concepts in Data Mining:
— Styles of learning:
1. Classification: predicting a discrete class
2. Association: detecting association rules
among features
3. Clustering: grouping similar instances into
clusters
4. Numeric prediction: predicting a numeric
quantity
Reading material: Chapters 2 and 3 of
textbook by Witten, Chapter 1, Sections
3.1,3.2 and 5.2 Of the book by Han
(Reference 2).
Classification learning:
1. Classification learning is the so-called
supervised learning where the scheme will
present a final actual outcome: the class
of the example
Example problems: weather data, contact
lenses, irises, labor negotiations
— Success can be measured on fresh data
for which class labels are known
— In practice success is often measured
subjectively
Association learning is the learning
where no class is specified and any kind of
structure is considered “interesting”
Difference to classification learning:
predicting any attribute’s value, not just the
class, and more than one attribute’s value at
a time
There are far more association rules than
classification rules
To measure the success of an
association rule, we introduce two
notions
Coverage: instances that can be covered
by the rule
Accuracy: correctly predicted instances.
Minimum coverage and accuracy are posed
in learning to avoid too many useless rules.
Clustering is to find groups of items
that are similar to each other.
Clustering is unsupervised where the
class of an example is not known
Each group can be assigned as a
class. Success of clustering often
measured subjectively: how useful are
these groups to the user?
— Example: iris data without class
Numeric prediction: classification with
numeric “class”: supervised learning.
Success is measured on test data (or
subjectively if concept description is
intelligible)
— Example: weather data with numeric
attributes, the performance of CPU…
Issues in Data Mining:
In methodologies and interactions:
 Mining different kind of knowledge in
databases;
 Interactive mining at various level;
 Incorporating domain knowledge;
 Query languages and ad hoc mining;
 Presentation and visualization;
 Dealing with noisy and incomplete data.
Performance Issues:

Efficiency and scalability of the
algorithm;

Parallel, distributed and incremental
mining algorithm.
Issues relevant to database:

Handling relational and complex
data;

Mining from heterogeneous
databases and global information
system.
Download