Rodney Nielsen, Human Intelligence & Language Technologies Lab

advertisement
Data Mining
Practical Machine Learning Tools and Techniques
By I. H. Witten, E. Frank and M. A. Hall
6.9: Semi-Supervised Learning
Rodney Nielsen
Many / most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall
Implementation: Real Machine Learning Schemes
• Decision trees
• From ID3 to C4.5 (pruning, numeric attributes, ...)
• Classification rules
• From PRISM to RIPPER and PART (pruning, numeric data, …)
• Association Rules
• Frequent-pattern trees
• Extending linear models
• Support vector machines and neural networks
• Instance-based learning
• Pruning examples, generalized exemplars, distance functions
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Implementation: Real Machine Learning Schemes
• Numeric prediction
• Regression/model trees, locally weighted regression
• Bayesian networks
• Learning and prediction, fast data structures for learning
• Clustering: hierarchical, incremental, probabilistic
• Hierarchical, incremental, probabilistic, Bayesian
• Semisupervised learning
• Clustering for classification, co-training
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Semisupervised Learning
• Semisupervised learning: attempts to use
unlabeled data as well as labeled data
• The aim is to improve classification performance
• Why try to do this? Unlabeled data is often
plentiful and labeling data can be expensive
• Web mining: classifying web pages
• Text mining: identifying names in text
• Video mining: classifying people in the news
• Leveraging the large pool of unlabeled examples
would be very attractive
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Clustering for Classification
• Idea: use Naïve Bayes on labeled examples and
then apply EM
• Build Naïve Bayes model on labeled data
• Until convergence:
• Label unlabeled data based on class probabilities
(“Expectation” step)
• Train new Naïve Bayes model based on all the data
(“Maximization” step)
• Essentially the same as EM for clustering with
fixed cluster membership probabilities for labeled
data and #clusters = #classes
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Comments
• Has been applied successfully to document
classification
• Certain phrases are indicative of classes
• Some of these phrases occur only in the unlabeled
data, some in both sets
• EM can generalize the model by taking advantage
of co-occurrence of these phrases
• Refinement 1: reduce weight of unlabeled data
• Refinement 2: allow multiple clusters per class
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Co-training
• Method for learning from multiple views (multiple sets of
attributes), eg:
• First set of attributes describes content of web page
• Second set of attributes describes links that link to the web
page
• Until stopping criteria:
• Step 1: build model from each view
• Step 2: use models to assign labels to unlabeled data
• Step 3: select those unlabeled examples that were most confidently
predicted (often, preserving ratio of classes)
• Step 4: add those examples to the training set
• Assumption: views are independent
Rodney Nielsen, Human Intelligence & Language Technologies Lab
EM and Co-training
• Like EM for semisupervised learning, but view is
switched in each iteration of EM
• Uses all the unlabeled data (probabilistically labeled) for
training
• Has also been used successfully with support vector
machines
• Using logistic models fit to output of SVMs to estimate a
class probability distribution
• Co-training sometimes also seems to work when
views are chosen randomly!
• Why? Maybe Co-trained classifier is more robust
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Self-Training
• L  L0  <X(0), Y(0)>
• Until stopping-criteria
• h(x)  f(L)
• U*  select(U, h)
• L  L0 + <U*, h(U*)>
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Example Selection
•
•
•
•
Probability
Probability ratio or probability margin
Entropy
Or several other possibilities (e.g., seach
Burr Settles Active Learning Tutorial )
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Stopping Criteria
•
•
•
•
T rounds,
Repeat until convergence,
Use held out validation data, or
k-fold cross validation
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Seed
• Seed Data vs. Seed Classifier
• Training on seed data does not necessarily
result in a classifier that perfectly labels the
seed data
• Training on data output by a seed classifier
does not necessarily result in the same
classifier
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Indelibility
Indelible
Original: Y(U) can change
• L  <X(0), Y(0)>
• L  L0  <X(0), Y(0)>
• Until stopping-criteria • Until stopping-criteria
• h(x)  f(L)
• h(x)  f(L)
• U*  select(U, h)
• U*  select(U, h)
• L  L + <U*, h(U*)> • L  L0 + <U*, h(U*)>
• U  U – U*
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Persistence
Indelible
Persistent: X(L) can’t change
• L  <X(0), Y(0)>
• L  L0  <X(0), Y(0)>
• Until stopping-criteria • Until stopping-criteria
• h(x)  f(L)
• h(x)  f(L)
• U*  select(U, h)
• U*  U*+select(U, h)
• L  L + <U*, h(U*)> • L  L0 + <U*, h(U*)>
• U  U – U*
• U  U – U*
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Throttling
Throttled
Original: Threshold
• L  L0  <X(0), Y(0)> • L  L0  <X(0), Y(0)>
• Until stopping-criteria • Until stopping-criteria
• h(x)  f(L)
• h(x)  f(L)
• U*  select(U, h, k) • U*  select(U, h, θ)
• L  L0+ <U*, h(U*)> • L  L0 + <U*, h(U*)>
Select k examples from U,
Select all examples from U,
with the greatest confidence with confidence > θ
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Balanced
Balanced (&Throttled) Throttled
• L  L0  <X(0), Y(0)> • L  L0  <X(0), Y(0)>
• Until stopping-criteria • Until stopping-criteria
• h(x)  f(L)
• h(x)  f(L)
• U*  select(U, h, k) • U*  select(U, h, k)
• L  L0+ <U*, h(U*)> • L  L0 + <U*, h(U*)>
Select k+ positive & k- negative
exs; often k+=k- or they are
proportional to N+ & N-
Select k examples from U,
with greatest confidence
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Preselection
Preselect Subset of U
Original: Test all of U
• L  L0  <X(0), Y(0)> • L  L0  <X(0), Y(0)>
• Until stopping-criteria • Until stopping-criteria
• h(x)  f(L)
• h(x)  f(L)
• -• U’  select(U, φ)
• U*  select(U’, h, θ) • U*  select(U, h, θ)
• L  L0+ <U*, h(U*)> • L  L0 + <U*, h(U*)>
Select exs from U’, a subset
of U (typically random)
Select exs from all of U
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Co-training
• X = X1 × X2 ; two different views of the
data
• x = (x1, x2) ; i.e., each instance is comprised
of two distinct sets of features and values
• Assume each view is sufficient for correct
classification
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Co-Training Algorithm 1
Rodney Nielsen,
Intelligence
& Language1998
Technologies Lab
Table Human
1: Blum
and Mitchell,
Companionbots
Perceptive, emotive, conversational, healthcare, companion robots
NSF CISE Smart Health & Wellbeing, PI
Collaborators: UC Denver, CU Boulder, BLT, U Denver
Consultants: Columbia, Worcester Polytechnic Institute, UCD Depression Center
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Elderly and Depression
Stats for 65+
Depression
Double in
number by 2030
12  20%
Leading cause
of disability
M/F
All ages
Worldwide
(WHO)
50-58% of
hospital patients
36-50% of
healthcare
expenditures
Doubles cost of
care for chronic
diseases
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Companionbots Architecture
Interpretation
Sensory
Input
Distance
Audio
Fundamental
Recognition Speech
Recognition
Situation
Natural
Understanding
Language
Understanding
User
Modeling
& History
Tracking
Situation
Prediction
Goal
Manager
Action
Language
Vision
Object
Recognition
Scenario
Understanding
Beliefs
Measurement
Radar, IR…
Object
Tracking
Emotion
Understanding
Emotions
Force /
Touch
Emotion
Recognition
Environment
Understanding
Body &
Motion
Dialogue
Prediction
Scenario
Prediction
Emotion
Prediction
Dialogue
Goal
Manager
Scenario
Goal
Manager
Environment
Emotion Goal
Goal
Manager
Manager
Environment
Prediction
Instance selection Mechatronic
Outputs
Location
Time
for
Co-training
in
…
emotion recognitionMechatronic
Control
…
Natural
Behavior
Generation
…
Habits,
Hobbies &
Routines
Health
…
Behavior
Manager
…
Locomotion
Manager
Movement
Visual
Displays
Text to
Speech
Motor
Controls
Other
Mechatronic
Controls
Natural
Language
Generation
Natural
Movement
Generation
Natural
Expression
Generation
Question
Answering
Information
Retrieval /
Extraction
Document
Summarization
Dialogue
Manager
Gesture
Manager
Expression
Manager
Manipulation
Manager
Posture
Manager
Sensor 1
Manager
Tools
…
Health Goal
Manager
Audio
Rodney Nielsen, Human Intelligence & Language Technologies Lab
…
…
…
…
…
Multimodal Emotion Recognition
• Vision
• Speech
• Language
Why does this always have to happen to me
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Co-Training Emotion Recognition
Given a set L of labeled training examples
a set U of unlabeled training examples
Create a pool U' of examples by choosing u examples at random from U
Loop for k iterations:
Use L to train a classifier h1 that considers only
Use L to train a classifier h2 that considers only
Use L to train a classifier h3 that considers only
Why does this always
have to happen to me
Allow h1 to label p1 positive and n1 negative examples from U’
Allow h2 to label p2 positive and n2 negative examples from U'
Allow h3 to label p3 positive and n3 negative examples from U'
Add these self-labeled examples to L
Randomly choose examples from U to replenish U’
(Blum & Mitchell, 1998)
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Semisupervised & Active Learning
• Most common strategy for instance selection
• Based on class probability estimates
• Semisupervised learning
• Select k instances with highest class probabilities
• Active learning
• Select k instances with lowest class probabilities
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Active Learning
• Usually an abundance of unlabeled data
• How much should you label?
• Which instances should you label?
• Does it matter?
• Can the learner benefit from selective labeling?
• Active Learning: incrementally requests
labels for key instances
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Learning Paradigms
random
?
query
?
?
?
?
random
Supervised
Learning
Unsupervised
Learning
Active
Learning
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Active Learning Applications
• Speech Recognition
• 10 mins to annotate words in 1 min of speech
• 7 hrs to annotate phonemes of 1 minute speech
• Named Entity Recognition
• Half an hour for a simple newswire article
• PhD for a bioinformatics article
• Image annotation
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Face/Pedestrian/Object Detection
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Heuristic Active Learning Algorithm
• Start with unlabeled data
• Randomly pick small number of examples to
have labeled
• Repeat
• Train classifier on labeled data
• Query the unlabeled ex that:
random
• Is closest to the boundary
• Has the least certainty
• Minimizes overall uncertainty
?
query
?
?
?
?
random
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Two Gaussians
a) Two classes with Gaussian distributions
b) Logistic Regression on 30 random labeled exs
•
70% accuracy
c) Log. Reg. on 30 exs chosen by Active Learning
•
90% accuracy
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Space of Active Learning
Query Types:
Membership
Stream-based
Pool-based Active
Sampling method
Query Synthesis Selective Sampling
Learning
Uncertainty sampling
Query by committee
Expected model
change
Variance reduction
Est. error reduction
Density weighted
methods
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Space of Active Learning
Query Types:
Membership
Stream-based
Pool-based Active
Sampling method
Query Synthesis Selective Sampling
Learning
Uncertainty sampling
Query by committee
Expected model
change
Variance reduction
Est. error reduction
Density weighted
methods
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Active Learning Query Types
From Burr Settles, 2009, AL Tutorial
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Membership Query Synthesis
• Dynamically construct query instances
based on expected informativeness
• Applications
• Character recognition.
• Robot scientist: find optimal growth medium
for a yeast
• 3x $ decrease vs. cheapest next
• 100x $ decrease vs. random selection
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Stream-based Selective Sampling
• Informativeness measure
• Region of uncertainty / Version space
• Applications
•
•
•
•
POST
Sensor scheduling
IR ranking
WSD
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Pool-based Active Learning
• Informativeness measure
• Applications
•
•
•
•
•
•
Cancer diagnosis
Text classification
IE
Image classfctn & retrieval
Video classfctn & retrieval
Speech recognition
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Pool-based Active Learning Loop
From Burr Settles, 2009, AL Tutorial
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Space of Active Learning
Query Types:
Membership
Stream-based
Pool-based Active
Sampling method
Query Synthesis Selective Sampling
Learning
Uncertainty sampling
Query by committee
Expected model
change
Variance reduction
Est. error reduction
Density weighted
methods
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Questions
• Questions???
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Instance Sampling in Active Learning
Query Types:
Membership
Stream-based
Pool-based Active
Sampling method
Query Synthesis Selective Sampling
Learning
Uncertainty sampling
Query by committee
Expected model
change
Variance reduction
Est. error reduction
Density weighted
methods
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Uncertainty Sampling
• Uncertainty sampling
• Select examples based on confidence in prediction
• Least confident
yˆ = argmax P( y x;q )
x*LC = argmin P( yˆ x;q )
y
x
• Margin sampling
(
x*M = argmin P( yˆ x;q ) - P( y˜ x;q )
x
)
y˜ = argmax P( y x;q )
y,y ¹ yˆ
• Entropy-based models
x*H = argmin å P( y k x;q ) log P ( y k x;q )
x
k
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Query by Committee
C = {h1,h2 ,...,hm }
• Representing different regions of the version space
• Train a committee of hypotheses
• Obtain some measure of (dis)agreement on the
instances in the dataset (e.g., vote entropy)
x
*
VE
= argmin å
x
k
V ( yk )
C
log
V ( yk )
C
• Assume the most informative instance is the one
on which the committee has the most disagreement
• Goal: minimize the version space
• No agreement on size of committee, but even 2-3
provides good results
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Competing Hypotheses
• a
From Burr Settles, 2009, AL Tutorial
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Expected Model Change
• Query the instance that would result in the
largest expected change in h based on the
current model and Expectations
• E.g., the instance that would result in the largest
1
gradient descent in the model parameters
• Prefer the instance x that leads to the most
significant change in the model
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Expected Model Change
• What learning algorithms does this work for
• What are the issues
1
• Can be computationally expensive for large
datasets and feature spaces
• Can be led astray if features aren’t properly scaled
• How do you properly scale the features?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Estimated Error Reduction
• Other models approximate the goal of
minimizing future error by minimizing (e.g.,
uncertainty,…)
• Estimated Error Reduction attempts to directly
minimize E[error]
1
[ (
x*0 /1 = argmin å P( y k x;q ) E Error q + x,yk
x
k
)]
æU
( u) + x,y k
ç
= argmin å P ( y k x;q )çå1 - P y k x ;q
x
k
è u=1
(
Rodney Nielsen, Human Intelligence & Language Technologies Lab
)
ö
÷÷
ø
Estimated Error Reduction
• Often computationally prohibitive
• Binary logistic regression would be O(|U||L|G)
• Where G is the number of gradient descent iterations to
convergence
1
• Conditional Random Fields would be
O(T|Y|T+2|U||L|G)
• Where T is the number of instances in the sequence
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Variance Reduction
• Regression problems
• E[error2] = noise + bias + variance:
[
]
2ù
é
E ( yˆ - y ) x = E ê y - E [ y x] ú
ë
û
2
(
(
)
1
+ E L [ yˆ ] - E [ y x]
[
+ E L ( yˆ - E L [ yˆ ])
)
2
noise
2
]
bias
variance
• Learner can’t change noise or bias so minimize
variance
• Fisher Information Ratio used for classification
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Outlier Phenomenon
• Uncertainty sampling and Query by Committee
might be hindered by querying many outliers
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Density Weighted Methods
• Uncertainty sampling and Query by Committee
might be hindered by querying many outliers
• Density weighted methods overcome this
potential problem by also considering whether
the example is representative of the input dist.
æ1
x* = argmax f A ( x)× çç
x
èU
U
å sim(x,x( ) )
u
u=1
öb
÷÷
ø
• Tends to work better than any of the base
classifiers on their own
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Diversity
• Naïve selection by earlier methods results in
selecting examples that are very similar
• Must factor this in and look for diversity in
the queries
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Active Learning Empirical Results
• Appears to work well, barring publication bias
From Burr Settles, 2009, AL Tutorial
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Labeling Costs
• Are all labels created equal?
• Generating labels by experiments
• Some instances easier to label (eg, shorter sents)
• Can pre-label data for a small savings
• Experimental problems
• Value of information (VOI)
• Considers labeling & estmtd misclassification costs
• Critical to goal of Active Learning
• Divide informativeness by cost?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Batch Mode Active Learning
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Active Learning Evaluation
• Learning curves for text classification: baseball vs. hockey. Curves plot
classification accuracy as a function of the number of documents queried for
two selection strategies: uncertainty sampling (active learning) and random
sampling (passive learning). We can see that the active learning approach is
superior here because its learning curve dominates that of random sampling.
Rodney Nielsen, From
Human
Intelligence
& Language
Technologies Lab
Burr
Settles, 2009,
AL Tutorial
Active Learning Evaluation
• We can conclude that an active learning algorithm is superior to
some other approach (e.g., a random baseline like traditional passive
supervised learning) if it dominates the other for most or all of the
points along their learning curves.
Rodney Nielsen,From
Human
& Language
Technologies Lab
BurrIntelligence
Settles, 2009,
AL Tutorial
Download