COMS 6998-4:
Learning and Empirical Inference
Irina Rish
IBM T.J. Watson Research Center
Outline
Motivation
Active learning approaches
Membership queries
Uncertainty Sampling
Information-based loss functions
Uncertainty-Region Sampling
Query by committee
Applications
Active Collaborative Prediction
Active Bayes net learning
2
Standard supervised learning model
Given m labeled points, want to learn a classifier with misclassification rate <
, chosen from a hypothesis class H with
VC dimension d < 1 .
VC theory: need m to be roughly d/
, in the realizable case.
3
Active learning
In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.
What is the minimum number of labels needed to achieve the target error rate?
4
5
What is Active Learning?
Unlabeled data are readily available; labels are expensive
Want to use adaptive decisions to choose which labels to acquire for a given dataset
Goal is accurate classifier with minimal cost
6
Active learning warning
Choice of data is only as good as the model itself
Assume a linear model, then two data points are sufficient
What happens when data are not linear?
7
Active Learning Flavors
Selective Sampling
Pool Sequential
Myopic Batch
Membership Queries
8
Active Learning Approaches
Membership queries
Uncertainty Sampling
Information-based loss functions
Uncertainty-Region Sampling
Query by committee
9
10
11
Problem
Many results in this framework, even for complicated hypothesis classes.
[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.
Synthetic instances created were incomprehensible to humans!
[Lewis and Gale, 1992] tried training text classifiers.
“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”
12
13
14
15
Uncertainty Sampling
[Lewis & Gale, 1994]
Query the event that the current classifier is most uncertain about
If uncertainty is measured in Euclidean distance: x x x x x x x x x x
Used trivially in SVMs, graphical models, etc.
16
Information-based Loss Function
[MacKay, 1992]
Maximize KL-divergence between posterior and prior
Maximize reduction in model entropy between posterior and prior
Minimize cross-entropy between posterior and prior
All of these are notions of information gain
17
Query by Committee
[Seung et al . 1992, Freund et al . 1997]
Prior distribution over hypotheses
Samples a set of classifiers from distribution
Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x
A B C
18
We Have:
1. Dataset, D
2. Model parameter space, W
3. Query algorithm, q
20
6
7
4
5
8
2
3
0
1 t
D
Sex Age Test A Test B Test C
M 40-50 0 1 1
F 50-60 0
F 30-40 0
F 60+ 1
1
0
1
0
0
1
M 10-20 0
M 40-50 0
F 0-10 0
M 30-40 1
M 20-30 0
1
0
0
1
0
0
0
0
1
1
Disease
?
?
?
?
?
?
?
?
?
21
We Have:
1. Dataset, D
2. Model parameter space, W
3. Query algorithm, q
22
S t
O t
Probabilistic Classifier
Notation
T : Number of examples
O t
: Vector of features of example t
S t
: Class of example t
23
Patient state (S t
)
S t
: DiseaseState
Patient Observations (O t
)
O t1
O t2
: Gender
: Age
: TestA O t3
O t4
O t5
: TestB
: TestC
24
S
Gender
Age
TestA
TestB
TestC
Gender
Age
S
TestB
TestA
TestC
25
Model:
Model Parameters:
S t
O t
P(S t
) P(O t
|S t
)
Generative Model:
Must be able to compute P(S t
=i, O t
=o t
| w)
26
• W = space of possible parameter values
• Prior on parameters: P ( W )
• Posterior over models: P ( W | D )
P ( D | W ) P ( W )
T t
P ( S t
, O t
| W ) P ( W )
27
We Have:
1. Dataset, D
2. Model parameter space, W
3. Query algorithm, q q( W ,D) returns t * , the next sample to label
28
while NotDone
• Learn P( W | D )
• q chooses next example to label
• Expert adds label to D
29
O
1
S
1
O
2
S
2
O
3
S
3 hmm… q
O
4
S
4
O
5
S
5
O
6
S
6
O
7
S
7
30
• Pool
(“random access” to patients)
• Sequential
(must decide as patients walk in the door)
31
q
• Recall: q ( W,D) returns the “most interesting” unlabelled example.
• Well, what makes a doctor curious about a patient?
32
1994
33
uncert
S
t
P
S t
O t
H ( S t
)
i
P ( S t
i ) log P ( S t
i )
34
t Sex Age Test
A
Test
B
Test
C
S t
P(S t
) H(S t
)
1 M 20-
30
2 F 20-
30
0
0
1
1
1
0
3 F 30-
40
1 0 0
4 F 60+ 1 1 0
5 M 10-
20
6 M 20-
30
0
1
1
1
0
1
?
?
?
?
?
0.02
0.01
0.05
0.12
0.01
0.96
0.043
0.024
0.086
0.159
0.024
0.073
35
t Sex Age Test
A
Test
B
Test
C
S t
P(S t
) H(S t
)
1 M 20-
30
2 F 20-
30
0
0
1
1
1
0
3 F 30-
40
1 0 0
4 F 60+ 1 1 0
5 M 10-
20
6 M 20-
30
0
1
1
1
0
1
?
?
?
?
0.01
0.02
0.04
0.00
0.06
0.97
0.024
0.043
0.073
0.00
0.112
0.059
36
GOOD: couldn’t be easier
GOOD: often performs pretty well
BAD: H(S t
) measures information gain about the samples , not the model
Sensitive to noisy samples
37
38
1992
Informative with respect to what?
39
P( W|D )
P( W|D )
P( W|D )
W
H( W ) = high
W
…better…
W
H( W ) = 0
40
• Choose the example that is expected to most reduce H( W )
• I.e., Maximize H( W ) – H( W | S t
)
Current model space entropy
Expected model space entropy if we learn S t
41
score
IG
( S t
)
MI ( S t
; W )
H ( W )
H ( W | S t
)
42
We usually can’t just sum over all models to get H(S t
| W )
H ( W )
w
P ( w ) log P ( w ) dw
…but we can sample from P( W | D )
H ( W )
H
c
C
( C )
P ( c ) log P ( c )
43
H ( W )
w
P ( w ) log P ( w ) dw
H ( W | S t
i )
w
P ( w | S t
i ) log P ( w | S t
i ) dw
H ( W | S t
)
i
P ( S t
i )
w
P ( w | S t
i ) log P ( w | S t
i ) dw
44
score
IG
( S t
)
H ( C )
H ( C | S t
)
45
t Sex Age Test
A
Test
B
Test
C
S t
1 M 20-
30
2 F 20-
30
0
0
1
1
3 F 30-
40
1 0
4 F 60+ 1 1
5 M 10-
20
6 M 20-
30
0
0
1
0
?
?
?
?
?
?
1
1
0
0
0
1
P(S t
)
0.02
0.01
0.05
0.12
0.01
0.02
Score =
H(C) - H(C|S t
)
0.53
0.58
0.40
0.49
0.57
0.52
46
score
IG
( S t
)
H ( C )
H ( C
H ( S t
)
H ( S t
| S t
)
| C )
Familiar?
47
score
Uncertain
( S t
)
H ( S t
) score
InfoGain
( S t
)
H ( S t
)
H ( S t
| C )
48
49
If our objective is to reduce the prediction error, then
“the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries ”
50
Strategy #2:
Query by Committee
Temporary Assumptions:
Pool Sequential
P( W | D ) Version Space
Probabilistic Noiseless
QBC attacks the size of the “Version space”
51
O
1
S
1
O
2
S
2
O
3
S
3
O
4
S
4
O
5
S
5
O
6
S
6
O
7
S
7
FALSE!
FALSE!
Model #1 Model #2
52
O
1
S
1
O
2
S
2
O
3
S
3
O
4
S
4
O
5
S
5
O
6
S
6
O
7
S
7
TRUE!
TRUE!
Model #1 Model #2
53
O
1
S
1
O
2
S
2
O
3
S
3
FALSE!
TRUE!
O
4
S
4
Model #1 Model #2
O
5
S
5
O
6
S
6
O
7
S
7
Ooh, now we’re going to learn something for sure!
One of them is definitely wrong.
54
The Original QBC
Algorithm
As each example arrives…
1.
Choose a committee, C , (usually of size 2) randomly from Version Space
2.
Have each member of C classify it
3.
If the committee disagrees, select it.
55
1992
56
Infogain vs Query by Committee
[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
H:
Which pair of hypotheses is closest? Depends on data distribution P.
Distance measure on H: d(h,h’) = P(h(x)
h’(x))
57
Query-by-committee
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
To keep things simple, say d(h,h’) = Euclidean distance
H:
Error is likely to remain large!
58
Query-by-committee
Elegant scheme which decreases volume in a manner which is sensitive to the data distribution.
Bayesian setting: given a prior
on H
H
1
= H
For t = 1, 2, … receive an unlabeled point x t drawn from P
[informally: is there a lot of disagreement about x t choose two hypotheses h,h’ randomly from (
, H t
) if h(x t
)
h’(x t
): ask for x t
’s label set H t+1 in H t
?]
Problem: how to implement it efficiently?
59
Query-by-committee
For t = 1, 2, … receive an unlabeled point x t drawn from P choose two hypotheses h,h’ randomly from (
, H t
) if h(x t
)
h’(x t
): ask for x t
’s label set H t+1
Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to
(h)
(h’) d(h,h’).
vs.
60 H t
61
Query-by-committee
Label bound:
For H = {linear separators in R d }, P = uniform distribution, just d log 1/
labels to reach a hypothesis with error <
.
Implementation: need to randomly pick h according to (
, H t
).
e.g. H = {linear separators in R d },
= uniform distribution:
H t
How do you pick a random point from a convex body?
62
Sampling from convex bodies
By random walk!
1. Ball walk
2. Hit-and-run
[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC.
63
64
Some challenges
[1] For linear separators, analyze the label complexity for some distribution other than uniform!
[2] How to handle nonseparable data?
Need a robust base learner
+
true boundary
65
Approach: Collaborative Prediction (CP)
QoS measure (e.g. bandwidth)
Client1
Client2
Client3
Client4
Server1
3674
187
3009
Server2
567
703
Server3
18
1688
?
Alina
Gerry
Irina
Raja
Movie Ratings
Matrix
?
4
9
Geisha
1
2
4
Shrek
3
10
Given previously observed ratings R(x,y), where X is a
“user” and Y is a “product”, predict unobserved ratings
- will Alina like “The Matrix”? (unlikely )
- will Client 86 have fast download from Server 39?
- will member X of funding committee approve our project Y?
67
Collaborative Prediction = Matrix Approximation
100 servers
• Important assumption: matrix entries are NOT independent, e.g. similar users have similar tastes
• Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings
(pLSA, MCVQ, SVD, NMF, MMMF, …)
68
2 4 5 1 4 2
User’s ‘weights’ associated with ‘factors’’
Assumptions:
- there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties
Factors
- movies have intrinsic values associated with such factors
- users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values
69
2 4 5 1 4 2
70
2 4 5 1 4 2
71
rank k
2 4 5 1 4 2
3 1 2 2 5
4 2 4 1 3 1
3 3 4 2
2 3 1 4 3 2
2 2 1 4
4 1 3 1 1
Y
Objective: find a factorizable X=UV’ that approximates Y
=
7 2 5 4 5 3 1 4 2
3 1 2 4 2 2 7 5 6
4 3 2 2 4 1 4 3 1
3 1 2 3 4 3 2 4 5
2 3 2 1 3 4 3 5 2
8 2 2 9 1 8 3 4 5
1 2 3 5 1 1 5 6 4
X
X
arg min
X '
Loss ( X ' , Y ) and satisfies some “regularization” constraints (e.g. rank(X) < k)
Loss functions: depends on the nature of your problem
72
Matrix Factorization Approaches
Singular value decomposition (SVD) – low-rank approximation
Assumes fully observed Y and sum-squared loss
In collaborative prediction, Y is only partially observed
Low-rank approximation becomes non-convex problem w/ many local minima
Furthermore, we may not want sum-squared loss , but instead
accurate predictions (0/1 loss, approximated by hinge loss)
cost-sensitive predictions (missing a good server vs suggesting a bad one) ranking cost
(e.g., suggest k ‘best’ movies for a user)
NON-CONVEX PROBLEMS!
Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]
replaces bounded rank constraint by bounded norm of U, V’ vectors convex optimization problem! – can be solved exactly by semi-definite programming
strongly relates to learning max-margin classifiers (SVMs)
Exploit MMMF’s properties to augment it with active sampling ! 73
Key Idea of MMMF
Rows – feature vectors, Columns – linear classifiers
Linear classifiers weight vectors f1 -1
“margin” here =
Dist(sample, line)
X ij
= sign ij x margin ij
Predictor ij
= sign ij
If sign ij
> 0, classify as +1,
Otherwise classify as -1
74
MMMF: Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers
75
Active Learning with MMMF
- We extend MMMF to Active-MMMF using margin-based active sampling
- We investigate exploitation vs exploration trade-offs imposed by different heuristics
-0.3
0.6
-0.5
0.1
0.3
-0.9
-0.5
0.8
0.2
-0.9
0.3
-0.1
0.6
-0.7 -0.5
0.7
-0.9
0.1
0.5
-0.5
0.8
0.5
0.9
0.7
-0.2
-0.5
-0.5
0.6
-0.5
0.2
-0.5
-0.5
0.6
-0.9
-0.6
0.8
-0.4
-0.2
-0.5
-0.4
-0.5
-0.5
0.6
0.1
-0.8
0.7
0.2
-0.1
0.3
0.4
0.9
0.6
-0.8
0.2
0.4
0.3
0.9
0.6
0.5
0.3
-0.6
0.9
-0.5
Margin-based heuristics: min-margin (most-uncertain) min-margin positive (“good” uncertain) max-margin (‘safe choice’ but no info)
Active Max-Margin Matrix Factorization
A-MMMF(M,s)
1. Given s sparse matrix Y, learn approximation X = MMMF(Y)
2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download)
3. Add new samples to Y
4. Repeat 1-3
Issues:
Beyond simple greedy margin-based heuristics?
Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions
(any suggestions??? )
77
Empirical Results
Network latency prediction
Bandwidth prediction (peer-to-peer)
Movie Ranking Prediction
Sensor net connectivity prediction
78
Empirical Results: Latency Prediction
P2Psim data NLANR-AMP data
Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and leastuncertain-next sampling
79
Movie Rating Prediction (MovieLens)
80
Sensor Network Connectivity
81
Introducing Cost: Exploration vs Exploitation
DownloadGrid: bandwidth prediction
PlanetLab: latency prediction
Active sampling
lower prediction errors at lower costs (saves 100s of samples)
(better prediction better server assignment decisions faster downloads
Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain
82
Conclusions
Common challenge in many applications: need for cost-efficient sampling
This talk: linear hidden factor models with active sampling
Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications
Future work:
Better active sampling heuristics?
Theoretical analysis of active sampling performance?
Dynamic Matrix Factorizations: tracking time-varying matrices
Incremental MMMF? (solving from scratch every time is too costly)
83
Some of the most influential papers
• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with
Applications to Text Classification . Journal of Machine Learning Research.
Volume 2, pages 45-66. 2001.
• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm.
Machine Learning, 28:133 —168
• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models , Journal of Artificial Intelligence Research, (4): 129-145,
1996.
• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning , Machine Learning 15(2):201-221, 1994.
• D. J. C. Mackay. Information-Based Objective Functions for Active Data
Selection . Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.
84
• Francis Bach. Active learning for misspecified generalized linear models . NIPS-06
•
Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real . NIPS-05
•
Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman . Active
Learning For Identifying Function Threshold Boundaries . NIPS-05
•
Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning . NIPS-05
• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning . NIPS-05
• Masashi Sugiyama . Active Learning for Misspecified Models . NIPS-05
•
Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models . NIPS-05
•
Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection . NIPS-04
• Sanjoy Dasgupta. Analysis of a greedy active learning strategy . NIPS-04
• T. Jaakkola and H. Siegelmann. Active Information Retrieval . NIPS-01
•
M. K. Warmuth et al. Active Learning in the Drug Discovery Process . NIPS-01
•
Jonathan D. Nelson, Javier R. Movellan . Active Inference in Concept Learning . NIPS-00
• Simon Tong, Daphne Koller . Active Learning for Parameter Estimation in Bayesian Networks . NIPS-00
• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering.
NIPS-97
•
K. Fukumizu . Active Learning in Multilayer Perceptrons . NIPS-95
•
Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE
LEARNING . NIPS-94
•
Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION . NIPS-94
• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS . NIPS-94
• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments . NIPS-91
85
• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning . ICML-06
• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification . ICML-06
• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant
Features . ICML-06
• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design . ICML-06
• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for
Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05
• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning.
ICML-04
• Klaus Brinker. Active Learning of Label Ranking Functions . ICML-04
• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04
• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines , ICML-00
• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text
Classification.
ICML-00.
• COLT papers
• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning . COLT-05.
• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee . COLT-92, pages 287--294.
86
• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning . Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.
• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text
Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.
• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm.
Machine Learning, 28:133--168
• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models ,
Journal of Artificial Intelligence Research, (4): 129-145, 1996.
• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning , Machine
Learning 15(2):201-221, 1994.
• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection . Neural
Comput., vol. 4, no. 4, pp. 590--604, 1992.
• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of
Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113
• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.
• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and
Ranking.
Machine Learning 54:2 2004, 153-178.
87
• http://domino.research.ibm.com/comm/res earch_projects.nsf/pages/nips05workshop.
index.html
88
89
• A measure of information in random event X with possible outcomes {x
1
,…,x n
}
H(x) = -
S i p(x i
) log
2 p(x i
)
• Comments on entropy function:
– Entropy of an event is zero when the outcome is known
– Entropy is maximal when all outcomes are equally likely
• The average minimum yes/no questions to answer some question (connection to binary search)
[Shannon, 1948]
91
• P is the true distribution; Q distribution is used to encode data instead of P
• KL divergence is the expected extra message length per datum that must be transmitted using Q
D
KL
(P || Q) =
S i
=
S i
P(x i
) log (P(x i
)/Q(x i
))
P(x i
) log Q(x i
) – S i
P(x i
) log P(x i
)
= H(P,Q) + H(P)
= -Cross-entropy + entropy
• Measure of how “wrong” Q is with respect to true distribution P 92
Data
+
Prior Knowledge
Learner
• Model Building
• Parameter estimation
• Causal structure discovery
• Passive Learning vs Active Learning
R
E
A
C
B e e
E e e
B b
P(A | E,B)
.9 .1
b b b
.7 .3
.8 .2
.99 .01
93
• Selective Active Learning
• Interventional Active Learning
• Obtain measure of quality of current model
• Choose query that most improves quality
• Update model
94
• Given a BN structure G
• A prior distribution p(θ)
• Learner request a particular instantiation q (Query)
Initial Network
G, p( θ)
E B
Query (q)
+
A
Training data
Active Learner p ´(θ)
Response (x)
How to update parameter density
How to select next query based on p
•Do not update A since we are fixing it
•If we select A then do not update B
•Sampling from P(B|A=a)
P(B)
•If we force A then we can update B
•Sampling from P(B|A:=a) = P(B)*
•Update all other nodes as usual
•Obtain new density p (
θ
| A
a,
X x
)
J
B
A
M
96
*Pearl 2000
• Goal: a single estimate
– instead of a distribution p over p( )
• If we choose and the true model is then we incur some loss, L(
’ ||
)
’
97
• We do not know the true ’
• Density p represents optimal beliefs over ’
• Choose that minimizes the expected loss
–
= argmin
∫ p (
’) L( ’ ||
) d
’
• Call the Bayesian point estimate
• Use the expected loss of the Bayesian point estimate as a measure of quality of p(
):
– Risk(p) = ∫p (
’) L( ’ ||
) d
’
98
• Set the controllable variables so as to minimize the expected posterior risk:
ExPRisk(p | Q=q)
x
P (
X x
|
Q q
)
p (
θ
| x
) K L (
θ
|| ) d
θ
• KL divergence will be used for loss
Conditional KL-divergence
– K L(
||
’ )=∑KL(P
θ
(Xi|Ui)|| P
θ’
(Xi|Ui))
99
• For each potential query q
• Compute
Risk( X | q )
• Choose q for which
Risk( X | q ) is greatest
– Cost of computing
Risk( X | q ):
• Cost of Bayesian network inference
• Complexity: O (| Q |. Cost of inference)
100
Uncertainty sampling
Maintain a single hypothesis, based on labels seen so far.
Query the point about which this hypothesis is most “uncertain”.
Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.
X
-
-
-
+
+
+
+
+
-
-
-
-
-
-
101
102
Region of uncertainty
Current version space: portion of H consistent with labels so far.
“Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
Suppose data lies on circle in R 2 ; hypotheses are linear separators.
(spaces X, H superimposed)
+ current version space
+ region of uncertainty in data space
103
Region of uncertainty
Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query.
current version space
Data and hypothesis spaces, superimposed:
(both are the surface of the unit sphere in R d ) region of uncertainty in data space
104
Region of uncertainty
Number of labels needed depends on H and also on P.
Special case: H = {linear separators in R d }, P = uniform distribution over unit sphere.
Then: just d log 1/
labels are needed to reach a hypothesis with error rate <
.
[1] Supervised learning: d/
labels.
[2] Best we can hope for.
105
Region of uncertainty
Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query.
For more general distributions: suboptimal…
Need to measure quality of a query – or alternatively, size of version space.
106
Expected
Infogain of sample
Uncertainty sampling!
107
108