Data Mining and Machine Learning: concepts, techniques, and

advertisement
Advanced Topics in Computer
Systems: Machine Learning and
Data Mining Systems
Winter 2007
Stan Matwin
Professor
School of Information Technology and Engineering/
École d’ingénierie et de technologie de l’information
University of Ottawa
Canada
1
Goals of this course
• Dual seminar/tutorial structure
• The tutorial part will teach basic concepts of
Machine Learning (ML) and Data Mining (DM)
• The seminar part will
– introduce interesting areas of current and future research
– Introduce successful applications
• Preparation to enable advanced self-study on
ML/DM
2
Course outline
1. Machine Learning/Data Mining: basic terminology.
2. Symbolic learning: Decision Trees;
3. Basic Performance Evaluation
4. Introduction to the WEKA system
5. Probabilistic learning: Bayesian learning.
6. Text classification
3
7. Kernel-based methods: Support Vector Machines
8. Ensemble-based methods: boosting
9. Advanced Performance Evaluation: ROC curves
10. Applications in bioinformatics
11. Data mining concepts and techniques: Association Rules
12. Feature selection and discretization
4
Machine Learning / Data Mining:
basic terminology
• Machine Learning:
– given a certain task, and a data set that constitutes the
task,
– ML provides algorithms that resolve the task based on
the data, and the solution improves with time
• Examples:
–
–
–
–
predicting lottery numbers next Saturday
detecting oil spills on sea surface
assigning documents to a folder
identifying people likely to want a new credit card (cross
selling)
5
• Data Mining: extracting regularities from a
VERY LARGE dataset/database as part of a
business/application cycle
• examples:
–
–
–
–
cell phone fraud detection
customer churn
direct mail targeting/ cross sell
prediction of aircraft component failures
6
Basic ML tasks
• Supervised learning
– classification/concept learning
– estimation: essentially, extrapolation
• Unsupervised learning:
– clustering: finding groups of “similar” objects
– associations: in a database, finding that some values of
attributes go with some other
7
Concept learning (also known as
classification): a definition
• the concept learning problem:
• given
– a set E = {e1, e2, …, en} of training instances of concepts,
each labeled with the name of a concept C1, …,Ck to
which it belongs
• determine
– definitions of each of C1, …,Ck which correctly cover E.
Each definition is a concept description
8
Dimensions of concept learning
• representation;
– data
» symbolic
» numeric
– concept description
» attribute-value (propositional logic)
» relational (first order logic)
• Language of examples and hypotheses
– Attribute-value (AV) = propositional representation
– Relational (ILP) = first-order logic representation
• method of learning
– top-down
– bottom-up (covering)
– different search algorithms
9
2. Decision Trees
A decision tree as a concept representation:
wage incr. 1st yr

working hrs


statutory holidays



good
good
contribution to hlth plan
wage incr. 1st yr


bad
good
bad
bad
good
10
building a univariate (single attribute is tested)
decision tree from a set T of training cases for
a concept C with classes C1,…Ck
Consider three possibilities:
• T contains 1 or more cases all belonging to
the same class Cj. The decision tree for T is a
leaf identifying class Cj
• T contains no cases. The tree is a leaf, but the
label is assigned heuristically, e.g. the
majority class in the parent of this node
11
• T contains cases from different classes. T is
divided into subsets that seem to lead
towards collections of cases. A test t based
on a single attribute is chosen, and it
partitions T into subsets {T1,…,Tn}. The
decision tree consists of a decision node
identifying the tested attribute, and one
branch for ea. outcome of the test. Then, the
same process is applied recursively to ea.Ti
12
Choosing the test
• why not explore all possible trees and choose
the simplest (Occam’s razor)? But this is an
NP complete problem. E.g. in the ‘union’
example there are millions of trees
consistent with the data
• notation: S: set of the training examples;
freq(Ci, S) = number of examples in S that
belong to Ci;
• information measure (in bits) of a message is
- log2 of the probability of that message
• idea: to maximize the difference between the
info needed to identify a class of an example
in T, and the the same info after T has been
partitioned in accord. with a test X
13
selecting 1 case and announcing its class has info meas.
- log2(freq(Ci, S)/|S|) bits
to find information pertaining to class membership in all
classes: info(S) = - (freq(Ci, S)/|S|)*log2(freq(Ci, S)/|S|)

after partitioning according to outcome of test X:
infoX(T) = |Ti|/|T|*info(Ti)
gain(X) = info(T) - infoX(T) measures the gain from
partitioning T according to X
We select X to maximize this gain
14
Data for learning the weather
(play/don’t play) concept (Witten p. 10)
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
Sunny
Sunny
Ovcst
Rain
Rain
Rain
Ovcst
Sunny
Sunny
Rain
Sunny
Ovcst
Ovcst
Rain
Temp
Hot
Hot
Hot
Mild
Col
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
Play?
No
No
Yes
Yes
yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Info(S) = 0.940
15
Selecting the attribute
•
•
•
•
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temp) = 0.029
• Choose Outlook as the top test
16
How does info gain work?
17
Gain ratio
• info gain favours tests with many outcomes
(patient id example)
• consider split info(X) = |Ti|/|T|*log(|Ti|/|T|)
measures potential info. generated by dividing T
into n classes (without considering the class info)
gain ratio(X) = gain(X)/split info(X)
shows the proportion of info generated by the split
that is useful for classification: in the example
(Witten p. 96), log(k)/log(n)
maximize gain ratio
18
Partition of
cases and
corresp. tree
19
In fact, learning DTs with the gain ratio heuristic
is a search:
20
continuous attrs
• a simple trick: sort examples on the values of
the attribute considered; choose the midpoint
between ea two consecutive values. For m
values, there are m-1 possible splits, but they
can be examined linearly
• cost?
21
From trees to rules:
traversing a decision tree from root to leaf gives
a rule, with the path conditions as the
antecedent and the leaf as the class
rules can then be simplified by removing
conditions that do not contribute to
discriminate the nominated class from other
classes
rulesets for a whole class are simplified by
removing rules that do not contribute to the
accuracy of the whole set
22
Geometric interpretation of decision trees:
axis-parallel area
a1


b > b1
n
a > a1


a2
y



a < a2



b1
23
Decision rules can be obtained from decision trees
(1)if b>b1 then class is -
(2)if b <= b1 and a > a1 then
class is +
b > b1
n
y
(3)if b <= b1 a < a2 then class is +
a > a1
(1)
(2)
a < a2
(4)
(4)if b <= b1 and a2 <= a <= a1 then
class is -
(3)
notice the inference involved in rule (3)
24
1R
For ea ch at tribute,
For each value o f that attribute, ma ke a rule as fo llows:
co unt how of ten each class appears
fi nd th e most frequent c lass
ma ke th e rule assign that class to t his a ttribute-value.
Calculate t he error r ate o f the r ules.
Choose t he rules w ith t he smallest error ra te.
25
26
lots of datasets can be obtained from
ftp ics.uci.edu
cd pub/machine-learning-databases
contents are described in the file README in
the
dir machine-learning-databases at Irvine
27
Empirical evaluation of
accuracy in classification tasks
• the usual approach:
– partition the set E of all labeled examples (examples
with their classification labels) into a training set and a
testing set
– use the training set for learning, obtain a hypothesis H,
set acc := 0
– for ea. element t of the testing set,
apply H on t; if H(t) = label(t) then acc := acc+1
– acc := acc/|testing set|
28
Testing - cont’d
• Given a dataset, how do we split it between
the training set and the test set?
• cross-validation (n-fold)
– partition E into n groups
– choose n-1 groups from n, perform learning on their
union
– repeat the choice n times
– average the n results
– usually, n = 3, 5, 10
• another approach - learn on all but one
example, test that example.
“Leave One Out”
29
Confusion matrix
classifier-determined
classifier-determined
positive label
negative label
true positive
label
a
b
true negative
label
c
d
Accuracy = (a+d)/(a+b+c+d)
a = true positives
b = false negatives
c = false positives
d = true negatives
30
• Precision = a/(a+c)
• Recall = a/(a+b)
• F-measure combines Recall and Precision:
• Fb = (b2+1)*P*R / (b2 P + R)
• Refelects importance of Recall versus
Precision; eg F0 = P
31
Cost matrix
• Is like confusion matrix, except costs of errors are
assigned to the elements outside the diagonal (misclassifications)
• this may be important in applications, e.g. when the
classifier is a diagnosis rule
• see
http://ai.iit.nrc.ca/bibliographies/cost-sensitive.html
for a survey of learning with misclassification costs
32
Bayesian learning
• incremental, noise-resistant method
• can combine prior Knowledge (the K is
probabilistic)
• predictions are probabilistic
33
Bayes’ law of conditional probability:
P ( D | h) P ( h )
P ( h | D) 
P ( D)
results in a simple “learning rule”: choose
the most likely (Maximum APosteriori)hypothesis
hMA P  arg max P(D|h)P(h)
hH
Example:
Two hypo:
(1) the patient has cancer
(2) the patient is healthy
34
Priors: 0.8% of the population has cancer;
 is 98% reliable : it returns positive in 98% of cases when the
the disease is present , and returns 97% negative
when the disease is actually absent .
P(cancer) = .008
P( + |cancer) = .98
P(+|not cancer) = .03
P(not cancer) = .992
P( - |cancer) = .02
P(-|not cancer) = .97
We observe a new patient with a positive test.
How should they be diagnosed?
P(cancer|+) = P(+|cancer)P(cancer) = .98*.008 = .0078
P(not cancer|+) = P(+|not cancer)P(not cancer) =
.03*.992=.0298
35
Minimum Description Length
revisiting the def. of hMAP:
hMA P  arg max P(D|h)P(h)
hH
we can rewrite it as:
hMAP  arg max log 2 P(D|h)  log 2 P(h)
hH
or
hMA P  arg min  log 2 P(D|h)  log 2 P(h)
hH
But the first log is the cost of coding the data given the theory,
and the second - the cost of coding the theory
36
Observe that:
for data, we only need to code the
exceptions; the others are correctly
predicted by the theory
MAP principles tells us to choose the theory
which encodes the data in the shortest
manner
the MDL states the trade-off between the
complexity of the hypo. and the number of
errors
37
Bayes optimal classifier
• so far, we were looking at the “most probable
hypothesis, given a priori probabilities”. But
we really want the most probable
classification
• this we can get by combining the predictions
of all hypotheses, weighted by their posterior
probabilities:
P(v j | D)   P(v j |hi )P(hi | D)
hi
• this is the bayes optimal classifier BOC:
arg max
v j V
 P(v |h )P(h | D)
h i H
j
i
i
Example of hypotheses
h1, h2, h3 with posterior probabilities
.4, .3. .3
A new instance is classif. pos. by h1 and
neg. by h2, h3
38
Bayes optimal classifier
V = {+, -}
P(h1|D) = .4, P(-|h1) = 0, P(+|h1) = 1
…
Classification is ” –” (show details!)
39
• Captures probability
dependencies
• ea node has probability
distribution: the task is
to determine the join
probability on the data
• In an appl. a model is
designed manually and
forms of probability
distr. Are given
•Training set is used to
fut the model to the data
•Then probabil.
Inference can be carried
out, eg for prediction
First five variables are observed, and the model is
Used to predict diabetes
P(A, N, M, I, G, D)=P(A)*P(n)*P(M|A, n)*P(D|M, A, N)*P(I|D)*P(G|I,D)
40
• how do we specify
prob. distributions?
• discretize variables
and represent
probability distributions
as a table
•Can be approximated
from frequencies, eg
table P(M|A, N) requires
24parameters
•For prediction, we want
(D|A, n, M, I, G): we need
a large table to do that
41
42
• no other classifier using the same hypo. spac e and prior K
can outperform BOC
• the BOC has mostly a theoretical interest; practically, we
will not have the required probabilities
• another approach, Naive Bayes Classifier (NBC)
vMAP  arg max P(v j | a1 , an )  arg max
v j V
v j V
arg max P(a1 , an | v j ) P(v j )
v j V
P(a1 , an | v j ) P(v j )
P(a1 , an )

To estimate this, we need (#of possible
values)*(#of possible instances) examples
under a simplifying assumption of independence of the
attribute values given the class value:
v NB  arg max P(v j ) (ai | v j )
v j V
i
43
44
• in NBC, the conditional probabilities
are estimated from training data
simply as normalized frequencies:
how many times a given attribute
value is associated with a given
class
• no search!
• example
• m-estimate
45
Example (see the Dec. Tree sec. in these notes):
we are trying to predict yes or no for
Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong
vNB  arg max P(v j ) (ai | v j )  arg max P(v j ) P(Outlook  sunny | v j )
v j [ yes , no]
i
v j [ yes , no]
P(Temperature  cool | v j ) P( Humidity  high | v j ) P(Wind  strong | v j )
P(yes)=9/14 P(no)=5/14
P(Wind=strong|yes)=3/9 P(Wind=strong|no)=3/5 etc.
P(yes)P(sunny|yes)P(cool|yes)P(high|yes)Pstrong|yes)=.0053
P(yes)P(sunny|no)P(cool|no)P(high|no)Pstrong|no)=.0206
so we will predict no compare to 1R!
46
• Further, we can not only have a decision, but also the
prob. of that decision:
.0206
.0206  .0053
 .795
nc
• we rely on
for the conditional probability
n
• if the conditional probability is very small, and n is
small too, then we should assume that nc is 0. But this
biases too strongly the NBC.
• So: smoothen; see textbook p. 85
nc  mp
• Instead, we will use the estimate
nm
where p is the prior estimate of probability,
m is equivalent sample size. If we do not know
otherwise, p=1/k for k values of the attribute; m has the
effect of augmenting the number of samples of class ;
large value of m means that priors p are important wrt
training data when probability estimates are computed,
small – less important
47
Text Categorization
• Representations of text are very high dimensional (one
feature for each word).
• High-bias algorithms that prevent overfitting in highdimensional space are best.
• For most text categorization tasks, there are many
irrelevant and many relevant features.
• Methods that sum evidence from many or all features
(e.g. naïve Bayes, KNN, neural-net) tend to work better
than ones that try to isolate just a few relevant features
(decision-tree or rule induction).
48
Naïve Bayes for Text
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a vocabulary V =
{w1, w2,…wm} based on the probabilities P(wj | ci).
• Smooth probability estimates with Laplace
mestimates assuming a uniform distribution over all
words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in each category
exactly once.
49
Text Naïve Bayes Algorithm
(Train)
Let V be the vocabulary of all words in the documents in D
For each category ci  C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj  V
Let nij be the number of occurrences of wj in Ti
Let P(wi | ci) = (nij + 1) / (ni + |V|)
50
Text Naïve Bayes Algorithm
(Test)
Given a test document X
Let n be the number of word occurrences in X
Return the category:
n
argmax P c
ci C
i
P a c
i
1
i
i
where ai is the word occurring the ith position in X
51
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|))
where Ld is the
average length of a document in D.
– Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld) time during
one pass through all of the data.
– Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt)
where Lt is the
average length of a test document.
• Very efficient overall, linearly proportional to the time needed
to just read in all the data.
• Similar to Rocchio time complexity.
52
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
53
Naïve Bayes Posterior
Probabilities
• Classification results of naïve Bayes (the class
with maximum posterior probability) are usually
fairly accurate.
• However, due to the inadequacy of the
conditional independence assumption, the actual
posterior-probability numerical estimates are not.
– Output probabilities are generally very close to 0 or 1.
54
Textual Similarity Metrics
• Measuring similarity of two texts is a well-studied
problem.
• Standard metrics are based on a “bag of words”
model of a document that ignores word order and
syntactic structure.
• May involve removing common “stop words” and
stemming to reduce words to their root form.
• Vector-space model from Information Retrieval (IR) is
the standard approach.
• Other metrics (e.g. edit-distance) are also used.
55
The Vector-Space Model
• Assume t distinct terms remain after preprocessing; call
them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a realvalued weight, wij.
• Both documents and queries are expressed as
tdimensional vectors:
dj = (w1j, w2j, …, wtj)
56
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
5
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2
3
T1
D2 = 3T1 + 7T2 + T3
T2
7
• Is D1 or D2 more similar to Q?
• How to measure the degree of
similarity? Distance? Angle?
Projection?
57
Document Collection
• A collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.
D1
D2
:
:
Dn
T1 T2
w11 w21
w12 w22
: :
: :
w1n w2n
….
…
…
…
Tt
wt1
wt2
:
:
wtn
58
Term Weights: Term
Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by dividing
by the frequency of the most common term in the
document:
tfij = fij / maxi{fij}
59
Term Weights: Inverse Document
Frequency
• Terms that appear in many different documents are less
indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
60
TF-IDF Weighting
• A typical combined term importance indicator is tf-idf
weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but rarely
in the rest of the collection is given high weight.
• Many other ways of determining term weights have
been proposed.
• Experimentally, tf-idf has been found to work well.
61
Computing TF-IDF -- An
Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3
B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3
C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
62
Similarity Measure
• A similarity measure is a function that computes the degree
of similarity between two vectors.
• Using a similarity measure between the query and each
document:
– It is possible to rank the retrieved documents in the order of presumed
relevance.
– It is possible to enforce a certain threshold so that the size of the
retrieved set can be controlled.
63
•
Similarity Measure - Inner
Product
Similarity between vectors for the document di and query q can be
computed as the vector inner product:
t
sim(dj,q) = dj•q =
wij · wiq
i 1
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
•
•
For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
64
Properties of Inner Product
• The inner product is unbounded.
• Favors long documents with a large number of
unique terms.
• Measures how many terms matched but not how
many terms are not matched.
65
Inner Product -- Examples
Binary:
–
D = 1,
1,
1, 0,
1,
1,
0
–
Q = 1,
0 , 1, 0,
0,
1,
1
sim(D, Q) = 3
Size of vector = size of vocabulary = 7
0 means corresponding term not found in
document or query
Weighted:
D1 = 2T1 + 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
D2 = 3T1 + 7T2 + 1T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
66
Cosine Similarity Measure
t3
•
Cosine similarity measures the cosine of
the angle between two vectors.
• Inner product normalized byt the vector
wi j wi q
lengths.
dj q
i
1
CosSim(dj, q) =
d j q
t
2
t
i
1
wi j
i
1

D1
2

Q
wi q
t2
t1
D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
67
K Nearest Neighbor for Text
Training:
For each each training example <x, c(x)>  D
Compute the corresponding TF-IDF vector, dx, for document x
Test instance y:
Compute TF-IDF vector d for document y
For each <x, c(x)>  D
Let sx = cosSim(d, dx)
Sort examples, x, in D by decreasing value of sx
Let N be the first k examples in D. (get most similar neighbors)
Return the majority class of examples in N
68
Illustration of 3 Nearest Neighbor for
Text
69
3 Nearest Neighbor
Comparison
• Nearest Neighbor tends to handle
polymorphic categories better.
70
Nearest Neighbor Time
Complexity
• Training Time: O(|D| Ld) to compose
TF-IDF
vectors.
• Testing Time: O(Lt + |D||Vt|) to compare to all
training vectors.
– Assumes lengths of dx vectors are computed and stored during
training, allowing cosSim(d, dx) to be computed in time proportional
to the number of non-zero entries in d (i.e. |Vt|)
• Testing time can be high for large training sets.
71
Nearest Neighbor with Inverted
Index
• Determining k nearest neighbors is the same as
determining the k best retrievals using the test
document as a query to a database of training
documents.
• Use standard VSR inverted index methods to find the
k nearest neighbors.
• Testing Time: O(B|Vt|)
where B is
the average number of training documents in which a testdocument word appears.
• Therefore, overall classification is O(Lt + B|Vt|)
– Typically B << |D|
72
Support Vector Machines (SVM)
• a new classifier
• Attractive because
– Has sound mathematical foundations
– Performs very well in diverse and difficuly applications
– See textbook (ch. 6.3) and papers by Scholkopf placed
on the class website
73
Review of basic analytical
geometry
•
•
•
•
Dot product of vectors by coordinates and with the angle
If vectors a, b are perpendicular, then (a  b) = 0 (e.g. (0, c) 
(d, 0) = 0
A hyperplane in an n-dimensional space has the property {x|
(w  x) + b = 0}; w is the weight vector, b is the threshold; x =
(x1, …, xn); w = (w1, …, wn)
A hyperplane divides the n-dimensional space into two
subspaces: one is {y| y((w  x) + b) > 0}, the other is
complementary (y| y((w  x) + b) <0)
74
• Lets revisit the general classification problem.
• We want to estimate an unknown function f, all we
know about it is the training set (x1,y1),… (xn,yn)
• The objective is to minimize the expected error (risk)
R[ f ]   l ( f ( x ), y )dP( x , y )
where l is a loss function, eg
l ( f ( x), y )  ( yf ( x))
and (z) = 0 for z<0 and (z)=1 otherwise
• Since we do not know P, we cannot measure risk
• We want to approximate the true error (risk) by the
empirical error (risk):
1 n
Remp [ f ]   l ( f ( x i , yi ))
n i 1
75
• We know from the PAC theory that conditions can
be given on the learning task so that the empirical
risk converges towards the true risk
• We also know that the difficulty of the learning
task depends on the complexity of f (VC
dimension)
• It is known that the following relationship
between the empirical risk and the complexity of
the language (h denotes VC dimension of the class of f) :
is true with probability at least  for n> h
2n
h(ln
 1)  ln(  / 4)
h
R[ f ]  Remp [ f ] 
n
76
SRM
Structural Risk
Minimization (SRM)
chooses the class
of F to find a
balance between
the simplicity of f
(very simple may
result in a large
empirical risk) and
and the empirical
risk (small may
require a class
function with a
large h)
77
Points lying on the margin are called support vectors;
w can be constructed efficiently – quadratic optimization problem.
78
Basic idea of SVM
• Linearly separable problems are easy (quadratic), but
of course most problems are not l. s.
• Take any problem and transform it into a highdimensional space, so that it becomes linearly
separable, but
• Calculations to obtain the separability plane can be
done in the original input space (kernel trick)
79
Basic idea of SVM
80
Original data is mapped into another dot product space called feature space F
via a non-linear map :
N
:R  F
Then linear separable classifier is performed in F
Note that the only operations in F are dot products:
k ( x, y )  (( x)  ( y ))
Consider e.g.
:R  R
2
3
( x1 , x2 )  ( z1 , z2 , z3 )  ( x , 2 x1 x2 , x )
2
1
2
2
81
Lets see that  geometrically, and that it does what we want it to do :
transform a hard classification problem into an easy one, albeit in a
higher dimension
82
But in general quadratic optimization in the
feature space could be very expensive
Consider classifying 16 x 16 pixel pictures, and
5th order monomials
Feature space dimension in this example is

10
) 10
O(  256
5 
83
Here we show that the that transformation from ellipsoidal
decision space to a linear one, requiring dot product in the the
feature space, can be performed by a kernel function in the
input space:
 ( x )   ( y)  ( x12 , 2 x1 x2 , x22 )( y12 , 2 y1 y2 , y22 ) 
(( x1 , x2 )( y1 , y2 )) 2  ( x  y) 2 : k ( x , y)
in general, k(x,y) =(x  y)d computes in the input space
kernels replace computation in FS by computation in the input
space
in fact, the transformation  needs not to be applied when a
kernel is used!
84
Some common kernels used:
Using different kernels we in fact use different
classifiers in the input space: gaussian,
polynomial, 3-layer neural nets, …
85
Simplest kernel
• Is the linear kernel (w  x) + b
• But this only works if the training set is
linearly separable. This may not be the case
– For the linear kernel, or even
– In the feature space
86
The solution for the non-separable case is to optimize not
just the margin, but the margin plus the influence of
training errors i:
n
1 2
min w  C   i
w,b , 2
i 1
87
Classification with SVMs:
• Convert each example x to (x)
• Perform optimal hyperplane algorithm in F; but since
we use the kernel all we need to do is to compute
where xi, yi are training instances, ai are computed as the
solution to the quadratic programming problem
88
Examples of classifiers in the input space
89
Overall picture
90
Geometric interpretation of SVM classifier
• Normalize the weight vector to 1 (
) and set
w 2 the
 1 threshold b
=0
• The set of all w that separate training set is
• But this is the Version Space
• VS has a geometric centre (Bayes Optimal Classifier) near the
gravity point
• If VS has a shape in which SVM solution is far from the VS
centre, SVM works poorly
91
92
93
Applications
•
•
•
•
Text classification
Image analysis – face recognition
Bioinformatics – gene expression
Can the kernel reflect domain knowledge?
94
SVM cont’d
•
•
•
•
•
•
•
•
•
A method of choice when examples are represented by vectors or
matrices
Input space cannot be readily used as attribute-vector (e.g. too
many attrs)
Kernel methods: map data from input space to feature space (FS);
perform learning in FS provided that examples are only used within
dot point (the kernel trick –
 ( x)   (x ' )  k ( x, x' )
SVM but also Perceptron, PCA, NN can be done on that basis
Collectively – kernel-based methods
The kernel defines the classifier
The classifier is independent of the dimensionality of the FS – can
even be infinite (gaussian kernel)
LIMITATION of SVMs: they only work for two-class problems
Remedy: use of ECOC
95
Applications – face detection
[IEEE
INTELLIGENT SYSTEMS]
• The task: to find a rectangle containing a face in an
image applicable in face recognition, surveillance, HCI
etc. Also in medical image processing and structural
defects
• Difficult task – variations that are hard toi represent
explicitly (hair, moustache, glasses)
• Cast as a classification problem: image regions that
are faces and non-faces
• Scanning the image in multiple scales, dividing it into
(overlapping) frames and classifying the frames with
an SVM:
96
Face detection cont’d
SVM performing face
detection –support
vectors are faces and
non-faces
Examples are 19x19 pixels,
class +1 or -1
SVM: 2nd degree polynomial
with slack variables
Representation tricks:
masking out nearboundary area - 361>283, removes noise
illumination correction:
reduction of light and
shadow
Discretization of pixel
brightness by histogram
equalization
97
Face detection – system
architecture
98
Bootstrapping:
using the
system on
images with
no faces and
storing false
positives to
use as
negative
examples in
later training
99
Performance on 2 test sets:
Set A = 313 high quality
Images with 313 faces, set B=
23 images with 155 faces
This results in >4M frames
for A and >5M frames for B.
SVM achieved recall of 97%
on A and 74% on B, with
4 and 20 false positives, resp.
100
SVM in text classification
•
•
•
•
•
Example of classifiers (the Reuters corpus – 13K stories,
118 categories, time split)
Essential in document organization (emails!), indexing
etc.
First comes from a PET; second from and SVM
Text representation: BOW: mapping docs to large vectors
indicating which word occurs in a doc; as many
dimensions as words in the corpus (many more than in a
given doc);
often extended to frequencies (normalized) of stemmed
words
101
Text classification
•
•
•
•
•
Still a large number of features, so a stop list is applied, and
some form of feature selection (e.g. based on info gain, or
tf/idf) is done, down to 300 features
Then a simple, linear SVM is used (experiments with poly.
and RDF kernels indicated they are not much better than
linear). One against all scheme is used
What is a poly (e.g. level 2) kernel representing in text
classification?
Performance measured with micro-averaged break even
point (explain)
SVM obtained the best results, with DT second (on 10 cat.)
and Bayes third. Other authors report better IB performance
(findSim) than here
102
A ROC for the above experiments (class = “grain”)
ROC obtained by varying the threshold
threshold is learned
from values of x  w and discriminates between classes
103
How about another representation?
•
•
•
•
•
N-grams = sequences of N consecutive characters, eg 3grams is ‘support vector’ = sup, upp, ppo, por, …, tor
Language-independent, but a large number of features
(>>|words|)
The more substrings in common between 2 docs, the more
similar the 2 docs are
What if we make these substring non-contiguous? With
weight measuring non-contiguity? car – custard
We will make ea substring a feature, with value depending
on how frequently and how compactly a substring occurs in
the text
104
•
•
The latter is represented by a decay factor 
Example: cat, car, bat, bar
•
Unnormalized K(car,cat)= 4, K(car,car)=K(cat,cat)=2 4 + 6,normalized
K(car,cat)= 4/( 24+ 6)= 1/(2+ 2)
Impractical (too many) for larger substrings and docs, but the kernel
using such features can be calculated efficiently (‘substring kernel’
SSK) – maps strings (a whole doc) to a feature vector indexed by all ktuples
•
105
•
•
•
•
Value of the feature = sum over the occurrences of the k-tuple of a decay
factor of the length of the occurrence
Def of SSK:  is an alphabet; string = finite sequence of elems of  . |s| =
length of s; s[i:j] = substring of s. u is a subsequence of s if there exist indices
i=(i1,…,i|u| ) with 1≤i1<…< i|u| ≤|s| such that uj =sij for j=1,…,|u| (u=s[i] for short).
Length l(i) of of the subsequence in s is i|u| - i1 +1 (span in s)
Feature space mapping  for s is defined by
u ( s ) 
•
l (i )


i:u  s[ i ]
for each u n (set of all finite strings of length n): features measure the
number of occurrences of subsequences in s weighed by their length (1)
The kernel can be evaluated in O(n|s||t|) time (see Lodhi paper)
106
Experimental results with SSK
•
•
•
•
•
•
•
The method is NOT fast, so a subset of Reuters (n=470/380) was
used, and only 4 classes: corn, crude, earn, acquisition
Compared to the BOW representation (see earlier in these notes)
with stop words removed, features weighed by
tf/idf=log(1+tf)*log(n/df)
F1 was used for the evaluation, C set experimentally
Best k is between 4 and 7
Performance comparable to a classifier based on k-grams
(contiguous), and also BOW
 controls the penalty for gaps in substrings: best precision for high
 = 0.7. This seems to result in high similarity score for docs that
share the same but semantically different words - WHY?
Results on full Reuters not as good as with BOW, k-grams; the
conjecture is that the kernel performs something similar to
stemming, which is less important onlarge datasets where there is
enough data to learn the ‘samness’ of different inflections
107
Bioinformatics application
•
•
•
•
•
•
•
Coding sequences in DNA encode proteins.
DNA alphabet: A, C, G, T. Codon = triplet of adjacent nucleotides,
codes for one aminoacid.
Task: identify where in the genome the coding starts (Translation
Initiation Sites). Potential start codon is ATG.
Classification task: does a sequence window around the ATG
indicate a TIS?
Each nucleotide is encoded by 5 bits, exactly one is set to 1,
indicating whether the nucleotide is A, C, G, T, or unknown. So the
dimension n of the input space = 1000 for window size 100 to left
and right of the ATG sequence.
Positive and negaite windows are provided as the training set
This representation is typical for the kind of problem where SVMs
do well
108
•
•
•
What is a good feature space for this problem? how about
including in the kernel some prior – domain – knowledge? Eg:
Dependencies between distant positions are not important or are
known not to exist
Compare, at each sequence position, two sequences locally in a
window of size 2l+1 around that position, with decreasing weight
away from the centre of the window:
l
•
win p ( x, y)  (  p j match p  j (x, y)) d1
l
Where d1 is jthe
order of importance of local (within the window)
correlations, and
is 1 for matching nucleotides at position
p+j, 0 otherwise match p j
109
• Window scores are summed over the length of the sequence,
and correlations between up to d2 windows are taken into
account:
l
k(x, y)  ( win p (x, y) ) d 2
p 1
• Also it is known that the codon below the TIS is a CDS: CDS
shifted by 3 nucleotides is still a TDS
• Trained with 8000 patterns and tested with 3000
110
Results
111
Further results on UCI benchmarks
112
Ensembles of learners
• not a learning technique on its own, but a
method in which a family of “weakly” learning
agents (simple learners) is used for learning
• based on the fact that multiple classifiers that
disagree with one another can be together
more accurate than its component classifiers
• if there are L classifiers, each with an error
rate < 1/2, and the errors are independent,
then the prob. That the majority vote is wrong
is the area under binomial distribution for
more than L/2 hypotheses
113
Boosting as ensemble of
learners
• The very idea: focus on ‘difficult’ parts of the
example space
• Train a number of classifiers
• Combine their decision in a weighed manner
114
115
• Bagging [Breiman] is to learn multiple
hypotheses from different subset of the
training set, and then take majority vote. Each
sample is drawn randomly with replacement
(a boostratrap). Ea. Bootstrap contains, on
avg., 63.2% of the training set
• boosting is a refinement of bagging, where
the sample is drawn according to a
distribution, and that distribution emphasizes
the misclassified examples. Then a vote is
taken.
116
117
• Let’s make sure we understand the makeup of
the final classifier
118
• AdaBoost (Adaptive Boosting) uses the
probability distribution. Either the learning
algorithm uses it directly, or the distribution is
used to produce the sample.
• See
http://www.research.att.com/~yoav/adaboost/index.html
for a Web-based demo.
119
120
Intro. to bioinformatics
• Bioinformatics = collection, archiving,
organization and interpretation of biological
data
• integrated in vitro, in vivo, in silico
• Requires understanding of basic genetics
• Based on
– genomics,
– proteomics,
– transriptomics
121
What is Bioinformatics?
• Bioinformatics is about integrating biological
themes together with the help of computer
tools and biological database.
• It is a “New” field of Science where
mathematics, computer science and biology
combine together to study and interpret
genomic and proteomic information
122
Intro. to bioinformatics
• Bioinformatics = collection, archiving,
organization and interpretation of biological
data
• integrated in vitro, in vivo, in silico
• Requires understanding of basic genetics
• Based on
– genomics,
– proteomics,
– transriptomics
123
Basic biology
• Information in biology: DNA
• Genotype (hereditary make-up of an organism) and
phenotype (physical/behavioral characteristics) (late
19th century)
• Biochemical structure of DNA – double helix – 1953;
nucleotides A, C, G, T
• Progress in biology and IT made it possible to map the
entire genomes: total genetic material of a species
written with DNA code
• For a human, 3*109 long
• Same in all the cells of a person
124
What is a gene?
125
• Interesting to see if there are genes
(functional elements of the genome)
responsible for some aspects of the
phenotype (e.g. an illness)
– Testing
– Cure
• Genes result in proteins:

protein
RNA (transcription)
• Gene
126
What is gene expression?
127
• We say that genes code for proteins
• In simple organisms (prokaryotes), high percentage of
the genome are genes (85%)
• Is eukaryotes this drops: yeast 70%, fruit fly 25%,
flowers 3%
• Databases with gene information: GeneBank/DDBL,
EMBL
• Databases with Protein information:
SwissProt, GenPept, TREMBL, PIR…
128
• Natural interest to find repetitive and/or common
subsequences in genome: BLAST
• For this, it is interesting to study genetic expression
(clustering):
Expression levels
Gene X
Gene Y
Time
deltaX
Y is activated by X
• Activation + and Inhibition –
129
Microarrays
•
•
•
Micro-array give us information about the rate of production
protein of gene during a experiment. Those technologies
give us a lot of information,
Analyzing microarray data tells us how the gene protein
production evolve.
Each data point represents log expression ratio of a
particular gene under two different experimental conditions.
The numerator of each ratio is the expression level of the
gene in the varying condition, whereas the denominator is
the expression level of the gene in some reference
condition. The expression measurement is positive if the
gene expression is induced with respect to the reference
state and negative if it is repressed. We use those values as
derivatives.
130
Microarrays
Kernel
Stranded
DNA to
analyze
Cell
Synthesized
DNA strand
DNA
Micro array
Test-tube with a
solution
131
Microarray technology
132
Scanning
133
Scanning (cont’d)
134
Scanning (cont’d)
135
Hybridization simulation
136
9. Data mining
•
•
•
•
definition
basic concepts
applications
challenges
137
Definition - Data Mining
• extraction of [unknown] patterns from data
• combines methods from:
– databases
– machine learning
– visualization
• involves large datasets
138
Definition - KDD
• Knowledge Discovery in Databases
• consists of:
–
–
–
–
–
selection
preprocessing
transformation
Data Mining
Interpretation/Evaluation
• no explicit req’t of large datasets
139
Model building
•
•
•
•
•
Need to normalize data
data labelling - replacement of the starter
use of other data sources to label
linear regression models on STA
contextual setting of the time interval
140
Associations
Given:
I = {i1,…, im} set of items
D set of transactions (a database), each transaction is a set of items
T2I
Association rule: XY, X I, Y I, XY=0
confidence c: ratio of # transactions that contain both X and Y to # of
all transaction that contain X
support s: ratio of # of transactions that contain both X and Y to # of
transactions in D
Itemset is frequent if its support > 
141
An association rule A  B is a conditional implication
among itemsets A and B, where A  I, B  I and A  B
= .
The confidence of an association rule r: A  B is the
conditional probability that a transaction contains B,
given that it contains A.
The support of rule r is defined as: sup(r) = sup(AB).
The confidence of rule r can be expressed as conf(r) =
sup(AB)/sup(A).
Formally, let A 2I; sup(A)= |{t: t  D, A  t}|/|D|, if R= AB
then sup(R) = SUP(AB), conf(R)= sup(A  B)/sup(A)
142
Associations - mining
Given D, generate all assoc rules with c, s > thresholds
minc, mins
(items are ordered, e.g. by barcode)
Idea:
find all itemsets that have transaction support > mins :
large itemsets
143
Associations - mining
to do that: start with indiv. items with large support
in ea next step, k,
•
use itemsets from step k-1, generate new
•
•
itemset Ck,
count support of Ck (by counting the
candidates which are contained in any t),
prune the ones that are not large
144
Associations - mining
Only keep those that
are contained in some
transaction
145
Candidate generation
Ck = apriori-gen(Lk-1)
146
• From large itemsets to association rules
147
Subset function
Subset(Ck, t) checks if an itemset Ck is in a transaction t
It is done via a tree structure through a series of hashing:
Hash C on every item in t: itemsets not
containing anything from t are ignored
If you got here by hashing item i of t, hash
on all following items of t
set of itemsets
set of itemsets
Check if itemset contained in this leaf
148
Example
L3={{1 2 3}, {1 2 4},{1 3 4},{1 3 5},{2 3 4}}
C4={{1 2 3 4} {1 3 4 5}}
pruning deletes {1 3 4 5} because {1 4 5} is not in L3.
See http://www.almaden.ibm.com/u/ragrawal/pubs.html#associations for details
149
DM: result evaluation
•
•
•
•
•
Accuracy
ROC
lift curves
cost
but also INTERPRETABILITY
150
Feature Selection [sec. 7.1 in Witten, Frank]
•
•
•
•
•
Attribute-vector representation: coordinates of the vector are
referred to as attributes or features
‘curse of dimensionality’ – learning is search, and the search space
increases drastically with the number of attributes
Theoretical justification: We know from PAC theorems that this
increase is exponential – [discuss; e.g. slide 70]
Practical justification: with divide-and-conquer algorithms the
partition sizes decrease and at some point irrelevant attributes may
be selected
The task: find a subset of the original attribute set such that the
classifier will perform at least as well on this subset as on the
original set of attributes
151
Some foundations
•
•
•
•
We are in the classification setting, Xi are the attrs and Y is the class. We
can define relevance of features wrt Optimal Bayes Classifier OBC
Let S be a subset of features, and X a feature not in S
X is strongly relevant if removal of X alone deteriorates the performance of
the OBC.
Xi is weakly relevant if it is not strongly relevant AND performance of
BOC on SX is better than on S
152
three main approaches
• Manually: often unfeasible
• Filters: use the data alone, independent of
the classifier that will be used on this data
(aka scheme-independent selection)
• Wrappers: the FS process is wrapped in the
classifier that will be used on the data
153
Filters - discussion
• Find the smallest attribute set in which all the
instances are distinct. Problem: cost if exhaustive
search used
• But learning and FS are related: in a way, the classifier
already includes the the ‘good’ (separating) attributes.
Hence the idea:
• Use one classifier for FS, then another on the results.
E.g. use a DT, and pass the data on to NB. Or use 1R
for DT.
154
Filters cont’d: RELIEF [Kira, Rendell]
1.
2.
3.
Initialize weight of all atrrs to 0
Sample instances and check the similar ones.
Determine pairs which are in the same class (near hits) and in
different classes (near misses).
4.
For each hit, identify attributes with different values. Decrease
their weight
5.
For each miss, attributes with different values have their weight
increased.
6.
Repeat the sample selection and weighing (2-5) many times
7.
Keep only the attrs with positive weight
Discussion: high variance unless the # of samples very high
Deterministic RELIEF: use all instances and all hits and misses
155
A different approach
• View attribute selection as a search in the space of all
attributes
• Search needs to be driven by some heuristic
(evaluation criterion)
• This could be some measure of the discrimination
ability of the result of search, or
• Cross-validation, on the part of the training set put
aside for that purpose. This means that the classifier is
wrapped in the FS process, hence the name wrapper
(scheme-specific selection)
156
Greedy search example
•
•
A single attribute is added (forward) or deleted (backward)
Could also be done as best-first search or beam search, or some randomized
(e.g. genetic) search
157
Wrappers
•
•
Computationally expensive (k-fold xval at each search step)
backward selection often yields better accuracy
–
–
–
•
x-val is just an optimistic estimation that may stop the search
prematurely –
in backward mode attr sets will be larger than optimal
Forward mode may result in better comprehensibility
Experimentally FS does particularly well with NB on data on
which NB does not do well
–
–
NB is sensitive to redundant and dependent (!) attributes
Forward selection with training set performance does well [Langley and
Sage 94]
158
Discretization
•
•
•
•
•
•
Getting away from numerical attrs
We know it from DTs, where numerical attributes were sorted and
splitting points between each two values were considered
Global (independent of the classifier) and local (different results in
ea tree node) schemes exist
What is the result of discretization: a value of an nominal attribute
Ordering information could be used if the discretized attribute with k
values is converted into k-1 binary attributes – the i-1th attribute =
true represents the fact that the value is <= I
Supervised and unsupervised discretization
159
Unsuprevised discretization
• Fixed –length intervals (equal interval binning):
eg (max-min)/k
– How do we know k?
– May distribute instances unevenly
• Variable-length intervals, ea containing the same
number of intervals (equal frequency binning, or
histogram equalization)
160
Supervised discretization
Example of Temperature attribute in the play/don’t play data
• A recursive algorithm using information measure/ We go for the cut point
with lowest information (cleanest subset)
161
Supervised discretization cont’d
• What’s the right stopping criterion?
• How about MDL? Compare the info to transmit the
label of ea instance before the split, and the split point
in log2(N-1) bits, + info for points below and info for
points above.
• Ea. Instance costs 1 bit before the split, and slightly >
0 bits after the split
• This is the Irani, Fayyad 93 method
162
Error-correcting Output Codes (ECOC)
• Method of combining classifiers from a two-class
problem to a k-class problem
• Often when working with a k-class problem k oneagainst-all classifiers are learned, and then combined
using ECOC
• Consider a 4-class problem, and suppose that there
are 7 classifiers, and classed are coded as follows:
class
class encoding
a
b
c
d
1111111
0000111
0011001
0101010
1000
0100
0010
0001
• Suppose an instance a is classified as 1011111
(mistake in the 2nd classifier).
• But the this classification is the closest to class a in
terms of edit (Hamming) distance. Also note that class
encodings in col. 1 re not error correcting
163
ECOC cont’d
• What makes an encoding error-correcting?
• Depends on the distance between encodings: an encoding with
distance d between encodings may correct up to (d-1)/2 errors
(why?)
• In col. 1, d=2, so this encoding may correct 0 errors
• In col. 2, d=4, so single-bit errors will be corrected
• This example describes row separation; there must also be
column separation (=1 in col. 2); otherwise two classifiers would
make identical errors; this weakens error correction
• Gives good results in practice, eg with SVM (2-class method)
• See the
164
ECOC cont’d
• What makes an encoding error-correcting?
• Depends on the distance between encodings: an encoding with
distance d between encodings may correct up to (d-1)/2 errors
(why?)
• In col. 1, d=2, so this encoding may correct 0 errors
• In col. 2, d=4, so single-bit errors will be corrected
• This example describes row separation; there must also be
column separation (=1 in col. 2); otherwise two classifiers would
make identical errors; this weakens error correction
• For a small number of classes, exhaustive codes as in col. 2 are
used
• See the Dietterich paper on how to design good error-correcting
codes
• Gives good results in practice, eg with SVM, decision trees,
backprop NNs
165
What is ILP?
• Machine learning when instances, results and
background knowledge are represented in First Order
Logic, rather than attribute-value representation:
• Given E+, E-, BK
• Find h such that h  E+, h  E-
/
166
E+:
boss(mary,john). boss(phil,mary).boss(phil,john).
E-:
boss(john,mary). boss(mary,phil). boss(john,phil).
BK:
employee(john, ibm). employee(mary,ibm). employee(phil,ibm).
reports_to(john,mary). reports_to(mary,phil). reports_to(john,phil).
h: boss(X,Y,O):-
employee(X,O), employee(Y,O),reports_to(Y, X).
167
Historical justification of the name:
• From facts and BK, induce a FOL hypothesis
(hypothesis in Logic Programming)
Examples
Background
knowledge
P
R
O
L
O
G
Hypotheses
(rules)
P
R
O
L
O
G
168
Why ILP? - practically
• Constraints of “classical” machine learning:
attribute-value (AV) representation
• instances are represented as rows in a single
table, or must be combined into such table
• This is not the way data comes from
databases
169
PARTICIPANT Table
SUBSCRIPTION Table
NAME
JOB
COMPANY
PARTY
R_NUMBER
adams
researcher
scuf
no
23
NAME
COURSE
blake
president
jvt
yes
5
adams
erm
king
manager
ucro
no
78
adams
so2
miller
manager
jvt
yes
14
adams
srw
scott
researcher
scuf
yes
94
blake
cso
turner
researcher
ucro
no
81
blake
erm
king
cso
king
erm
king
so2
king
srw
COURSE Table
COMPANY Table
COURSE
LENGTH
TYPE
miller
so2
COMPANY
TYPE
cso
2
introductory
scott
erm
jvt
commercial
erm
3
introductory
scott
srw
scuf
university
so2
4
introductory
turner
so2
ucro
university
srw
3
advanced
turner
srw
170
From tables to models to examples and background knowledge
begin(model(adams)).
begin(model(blake)).
participant(researcher,scuf,no,23). begin(background).
participant(president,jvt,yes,5)
.
subscription(erm).
company(jvt,commercial).
subscription(cso).
subscription(so2).
company(scuf,university).
subscription(erm).
subscription(srw).
company(ucro,university).
end(model(blake)).
end(model(adams)).
course(cso,2,introductory).
course(erm,3,introductory).
begin(model(king)).
course(so2,4,introductory).
begin(model(miller)).
participant(manager,ucro,no,78).
course(srw,3,advanced).
participant(manager,jvt,yes,14).
subscription(cso).
job(_J):-participant(_J,_,_,_).
subscription(so2).
subscription(erm).
party(_P):-participant(_,_,_P,_).
end(model(miller)).
company_type(_T):- subscription(so2).
subscription(srw).
participant(_,_C,_,_),company(_C,_T).
begin(model(scott)).
end(model(king)).
course_len(_C,_L):-course(_C,_L,_).
participant(researcher,scuf,yes,94).
course_type(_C,_T):-course(_C,_,_T).
subscription(erm).
begin(model(turner)).
subscription(srw).
participant(researcher,ucro,no,81).
end(model(scott)).
subscription(so2).
…
subscription(srw).
end(background).
end(model(turner)).
Results of learning (in Prolog)
party(yes):-participant(_J, senior, _C)
Party(yes):-participant(president,_S,_C)
171
Why ILP - theoretically
• AV – all examples are the same length
• no recursion…
• How could we learn the concept of
reachability in a graph:
172
Expressive power of relations,
impossible in AV
7
0
3
4
6
8
1
5
2
cannot really be expressed in AV representation, but is very easy in
relational representation:
linked-to: {<0,1>, <0,3>, <1,2>,…,<7,8>}
can-reach(X,Y) :- linked-to(X,Z), can-reach(Z,Y)
173
Another example of recursive
learning:
E+:
boss(mary,john). boss(phil,mary).boss(phil,john).
E-:
boss(john,mary). boss(mary,phil). boss(john,phil).
BK:
employee(john, ibm). employee(mary,ibm). employee(phil,ibm).
reports_to_imm(john,mary). reports_to_imm(mary,phil).
h: boss(X,Y):-
employee(X,O), employee(Y,O),reports_to(Y, X).
reports_to(X,Y):-reports_to_imm(X,Z), reports_to(Z,Y).
reports_to(X,X).
174
How is learning done: covering algorithm
Initialize the training set T
while the global training set contains +ex:
find a clause that describes part of relationship Q
remove the +ex covered by this clause
Finding a clause:
initialize the clause to Q(V1,…Vk) :ex
add to the right-hand side of the
while T contains –
find a literal L to
clause
Finding a literal : greedy search
175
• ‘Find a clause’ loop describes search –
• Need to structure the search space
– generality – semantic and syntactic
• since logical generality is not decidable, a stronger
property of -subsumption
• then search from general to specific (refinement)
176
Refinement
Heuristics:
link to head
new variables
boss(X,Y):-
boss(X,Y):-empl(X,O).
boss(X,Y):-X=Y
…
boss(X,Y):-reports_to(X,Y).
…
boss(X,Y):-empl(X,O),empl(Y,O1).
boss(X,Y):-empl(X,O),empl(Y,O).
boss(X,Y):empl(X,O),empl(Y,O),rep_to(Y,X).
boss(X,Y):empl(X,O),empl(Y,O),rep_to(X,Y).
177
Constructive learning
•
•
•
•
Do we really learn something new?
Hypotheses are in the same language as examples
constructive induction
How do we learn multiplication from examples? We need to
invent plus –we have shown [IJCAI93] that true
constructivism requires recursion, i.e. in
mult(X,s(Y),Z) :- mult(X,Y,T), newp(T,X,Z)
mult(X,0,0).
• Newp – plus - must be recursive.
178
Philosophical motivation
•
Constructive induction is analogical to “revolution” in the methodology
of science
• Kuhn’s Structure of Scientific Revolution:
normal science -> crisis -> revolution -> normal science
• Normal science = learning a “theory” in a fixed language
• Crisis = failure to cope with anomalies observed, due to inadequate
language
• Revolution = introduction of new terms into the language (cannot be done
in AV)
179
Example: predicting colour in flowers
• Language: r, y; a is any red flower, b is any yellow flower; col(X,Y) X
is of colour Y; ch(X,Y) = result of breeding of X and Y
• Observations (that Czech monk and his peas…)
1. col(a,r) % Adam and Eve
2. col(b,y).
3. col(ch(a,a),r). % first generation
4. col(ch(a,b),r).
5. col(ch(b,b),b).
6. col(ch(a,ch(b,b),r).%original and 1st
7. …
8. col(ch(ch(a,b)ch(a,b),y). 1st and 1st
9. ….
10.:-col(ch(a,a),y).
180
col(ch(a,X),r).
col(ch(X,Y),a) :- col(X,r), col(Y,r).
col(ch(b,b),y).
col(ch(X,Y), y) :- col(X,y),col(Y,y).
• But in some generations y and r
produce r, and in some – y
• We need either infinitely many clauses,
or infinitely long clauses
• A revolution is necessary
181
new necessary predicates are invented
col(X,y)  n00(X).
col(X,r)  n11(X).
col(X,r)  n10(X).
n00(ch(X,Y))n00(ch(X,X)),n00(ch(Y,Y))
n10(ch(X,Y))n11(ch(X,X)),n00(ch(Y,Y))
n11(ch(X,Y))n11(ch(X,X)),n11(ch(Y,Y))
type n11 and type n11 produce type n11 only
type n11 and type n00 produce type n10
…
182
• n00 represents purebred flowers with recessive
character, n11 – with dominant, and n10 – hybrid with
dominant
• n00 is recessive, n11 is dominant purebred
• n10 dominant hybrid
• In fact, the invented predicates represent the concept of
a gene! (dominant and recesive)
183
Success story: mutagenicity
• heterogeneous chemical compounds – their structure
requires relational representation
• BK: properties of specific atoms and bonds between
them (relation!) and generic organic chemistry info (e.g.
structure of benzene rings, etc.)
• Regression-unfriendly
A learned rule has been published in Science
conjugated
double bond in
a five-member
ring
184
problems
• Expressivity – efficiency
• Dimensionality reduction
• Therefore, interest in feature selection
(relational selection)
185
Dimensionality Reduction
(Relational Feature Selection)
• Selection relationnelle
• Difficultés
– Longueurs des clauses varient
– Différents littéraux dans différentes clauses
• Notre approche
» Propositionnalisation
» MIP borné
» Selectiond’attributs dans un tel MIP, en présence du bruit
introduit par le CDR
» Retour au relatoinnel
» Résultats empiriques
186
Conclusion ILP
• ILP “contains” normal ML and goes beyond
some of its limitations
• Is necessary if recursive definitions or ‘term
invention’ are to be used
• Has both solid theoretical foundations and
impressive practical achievements
• Lots to be done – e.g. in relational selection
187
Course conclusion
• What we have not talked about:
–
–
–
–
clustering
reinforcement learning
Inductive Logic Programming
Explanation-based learning/speedup
188
Course conclusion
• We have covered the basic inductive learning
paradigms:
–
–
–
–
decision trees
neural networks
bayesian learning
instance-based learning
• we have discussed different ways of
evaluation of the results of learning
• we have connected the learning material to
the data mining material
• we have presented a number of interesting
applications
189
Download