Active Learning.

Active Learning

COMS 6998-4:

Learning and Empirical Inference

Irina Rish

IBM T.J. Watson Research Center

Outline

 Motivation

 Active learning approaches

 Membership queries

 Uncertainty Sampling

 Information-based loss functions

 Uncertainty-Region Sampling

 Query by committee

 Applications

 Active Collaborative Prediction

 Active Bayes net learning

2

Standard supervised learning model

Given m labeled points, want to learn a classifier with misclassification rate <



, chosen from a hypothesis class H with

VC dimension d < 1 .

VC theory: need m to be roughly d/



, in the realizable case.

3

Active learning

In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.

What is the minimum number of labels needed to achieve the target error rate?

4

5

What is Active Learning?

 Unlabeled data are readily available; labels are expensive

 Want to use adaptive decisions to choose which labels to acquire for a given dataset

 Goal is accurate classifier with minimal cost

6

Active learning warning







Choice of data is only as good as the model itself

Assume a linear model, then two data points are sufficient

What happens when data are not linear?

7

Active Learning Flavors

Selective Sampling

Pool Sequential

Myopic Batch

Membership Queries

8

Active Learning Approaches

 Membership queries

 Uncertainty Sampling

 Information-based loss functions

 Uncertainty-Region Sampling

 Query by committee

9

10

11

Problem

Many results in this framework, even for complicated hypothesis classes.

[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.

Synthetic instances created were incomprehensible to humans!

[Lewis and Gale, 1992] tried training text classifiers.

“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”

12

13

14

15

Uncertainty Sampling

[Lewis & Gale, 1994]

 Query the event that the current classifier is most uncertain about

If uncertainty is measured in Euclidean distance: x x x x x x x x x x

 Used trivially in SVMs, graphical models, etc.

16

Information-based Loss Function

[MacKay, 1992]

 Maximize KL-divergence between posterior and prior

 Maximize reduction in model entropy between posterior and prior

 Minimize cross-entropy between posterior and prior

 All of these are notions of information gain

17

Query by Committee

[Seung et al . 1992, Freund et al . 1997]

 Prior distribution over hypotheses

 Samples a set of classifiers from distribution

 Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x

A B C

18

Infogain-based Active

Learning

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q

20

6

7

4

5

8

2

3

0

1 t

Dataset (

D

) Example

Sex Age Test A Test B Test C

M 40-50 0 1 1

F 50-60 0

F 30-40 0

F 60+ 1

1

0

1

0

0

1

M 10-20 0

M 40-50 0

F 0-10 0

M 30-40 1

M 20-30 0

1

0

0

1

0

0

0

0

1

1

Disease

?

?

?

?

?

?

?

?

?

21

Notation

We Have:

1. Dataset, D


3. Query algorithm, q

22

Model Example

S t

O t

Probabilistic Classifier

Notation

T : Number of examples

O t

: Vector of features of example t

S t

: Class of example t

23

Model Example

Patient state (S t

)

S t

: DiseaseState

Patient Observations (O t

)

O t1

O t2

: Gender

: Age

: TestA O t3

O t4

O t5

: TestB

: TestC

24

Possible Model Structures

S

Gender

Age

TestA

TestB

TestC

Gender

Age

S

TestB

TestA

TestC

25

Model Space

Model:

Model Parameters:

S t

O t

P(S t

) P(O t

|S t

)

Generative Model:

Must be able to compute P(S t

=i, O t

=o t

| w)

26

Model Parameter Space (W)

• W = space of possible parameter values

• Prior on parameters: P ( W )

• Posterior over models: P ( W | D )



P ( D | W ) P ( W )



T  t

P ( S t

, O t

| W ) P ( W )

27

Notation

We Have:

1. Dataset, D


3. Query algorithm, q q( W ,D) returns t * , the next sample to label

28

Game

while NotDone

• Learn P( W | D )

• q chooses next example to label

• Expert adds label to D

29

Simulation

O

1

S

1

O

2

S

2

?

O

3

S

3 hmm… q

O

4

S

4

?

O

5

S

5

?

O

6

S

6

O

7

S

7

30

Active Learning Flavors

• Pool

(“random access” to patients)

• Sequential

(must decide as patients walk in the door)

31

q

?

• Recall: q ( W,D) returns the “most interesting” unlabelled example.

• Well, what makes a doctor curious about a patient?

32

1994

33

Score Function score

uncert

(

S

)

 t

uncertaint y(

P

(

S t

|

O t

))



H ( S t

)

  i

P ( S t

 i ) log P ( S t

 i )

34

Uncertainty Sampling Example

t Sex Age Test

A

Test

B

Test

C

S t

P(S t

) H(S t

)

1 M 20-

30

2 F 20-

30

0

0

1

1

1

0

3 F 30-

40

1 0 0

4 F 60+ 1 1 0

5 M 10-

20

6 M 20-

30

0

1

1

1

0

1

?

?

?

?

?

0.02

0.01

0.05

0.12

0.01

0.96

0.043

0.024

0.086

0.159

0.024

0.073

35

Uncertainty Sampling Example

t Sex Age Test

A

Test

B

Test

C

S t

P(S t

) H(S t

)

1 M 20-

30

2 F 20-

30

0

0

1

1

1

0

3 F 30-

40

1 0 0

4 F 60+ 1 1 0

5 M 10-

20

6 M 20-

30

0

1

1

1

0

1

?

?

?

?

0.01

0.02

0.04

0.00

0.06

0.97

0.024

0.043

0.073

0.00

0.112

0.059

36

Uncertainty Sampling

GOOD: couldn’t be easier

GOOD: often performs pretty well

BAD: H(S t

) measures information gain about the samples , not the model

Sensitive to noisy samples

37

Can we do better than uncertainty sampling?

38

1992

Informative with respect to what?

39

P( W|D )

Model Entropy

P( W|D )

P( W|D )

W

H( W ) = high

W

…better…

W

H( W ) = 0

40

Information-Gain

• Choose the example that is expected to most reduce H( W )

• I.e., Maximize H( W ) – H( W | S t

)

Current model space entropy

Expected model space entropy if we learn S t

41

Score Function

score

IG

( S t

)



MI ( S t

; W )



H ( W )



H ( W | S t

)

42

We usually can’t just sum over all models to get H(S t

| W )

H ( W )

   w

P ( w ) log P ( w ) dw

…but we can sample from P( W | D )

H ( W )





H

  c



C

( C )

P ( c ) log P ( c )

43

Conditional Model Entropy

H ( W )

  w

P ( w ) log P ( w ) dw

H ( W | S t

 i )

  w

P ( w | S t

 i ) log P ( w | S t

 i ) dw

H ( W | S t

)

  i

P ( S t

 i )

 w

P ( w | S t

 i ) log P ( w | S t

 i ) dw

44

Score Function

score

IG

( S t

)



H ( C )



H ( C | S t

)

45

t Sex Age Test

A

Test

B

Test

C

S t

1 M 20-

30

2 F 20-

30

0

0

1

1

3 F 30-

40

1 0

4 F 60+ 1 1

5 M 10-

20

6 M 20-

30

0

0

1

0

?

?

?

?

?

?

1

1

0

0

0

1

P(S t

)

0.02

0.01

0.05

0.12

0.01

0.02

Score =

H(C) - H(C|S t

)

0.53

0.58

0.40

0.49

0.57

0.52

46

Score Function

score

IG

( S t

)





H ( C )



H ( C

H ( S t

)



H ( S t

| S t

)

| C )

Familiar?

47

Uncertainty Sampling &

Information Gain

score

Uncertain

( S t

)



H ( S t

) score

InfoGain

( S t

)



H ( S t

)



H ( S t

| C )

48

But there is a problem…

49

If our objective is to reduce the prediction error, then

“the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries ”

50

Strategy #2:

Query by Committee

Temporary Assumptions:

Pool  Sequential

P( W | D )  Version Space

Probabilistic  Noiseless



QBC attacks the size of the “Version space”

51

O

1

S

1

O

2

S

2

O

3

S

3

O

4

S

4

O

5

S

5

O

6

S

6

O

7

S

7

FALSE!

FALSE!

Model #1 Model #2

52

O

1

S

1

O

2

S

2

O

3

S

3

O

4

S

4

O

5

S

5

O

6

S

6

O

7

S

7

TRUE!

TRUE!

Model #1 Model #2

53

O

1

S

1

O

2

S

2

O

3

S

3

FALSE!

TRUE!

O

4

S

4

Model #1 Model #2

O

5

S

5

O

6

S

6

O

7

S

7

Ooh, now we’re going to learn something for sure!

One of them is definitely wrong.

54

The Original QBC

Algorithm

As each example arrives…

1.

Choose a committee, C , (usually of size 2) randomly from Version Space

2.

Have each member of C classify it

3.

If the committee disagrees, select it.

55

1992

56

Infogain vs Query by Committee

[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]

First idea: Try to rapidly reduce volume of version space?

Problem: doesn’t take data distribution into account.

H:

Which pair of hypotheses is closest? Depends on data distribution P.

Distance measure on H: d(h,h’) = P(h(x)

 h’(x))

57

Query-by-committee

First idea: Try to rapidly reduce volume of version space?

Problem: doesn’t take data distribution into account.

To keep things simple, say d(h,h’) = Euclidean distance

H:

Error is likely to remain large!

58

Query-by-committee

Elegant scheme which decreases volume in a manner which is sensitive to the data distribution.

Bayesian setting: given a prior

 on H

H

1

= H

For t = 1, 2, … receive an unlabeled point x t drawn from P

[informally: is there a lot of disagreement about x t choose two hypotheses h,h’ randomly from (



, H t

) if h(x t

)

 h’(x t

): ask for x t

’s label set H t+1 in H t

?]

Problem: how to implement it efficiently?

59

Query-by-committee

For t = 1, 2, … receive an unlabeled point x t drawn from P choose two hypotheses h,h’ randomly from (



, H t

) if h(x t

)

 h’(x t

): ask for x t

’s label set H t+1

Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to



(h)



(h’) d(h,h’).

vs.

60 H t

61

Query-by-committee

Label bound:

For H = {linear separators in R d }, P = uniform distribution, just d log 1/

 labels to reach a hypothesis with error <



.

Implementation: need to randomly pick h according to (



, H t

).

e.g. H = {linear separators in R d },



= uniform distribution:

H t

How do you pick a random point from a convex body?

62

Sampling from convex bodies

By random walk!

1. Ball walk

2. Hit-and-run

[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC.

63

64

Some challenges

[1] For linear separators, analyze the label complexity for some distribution other than uniform!

[2] How to handle nonseparable data?

Need a robust base learner

+

true boundary

65

Active Collaborative Prediction

Approach: Collaborative Prediction (CP)

QoS measure (e.g. bandwidth)

Client1

Client2

Client3

Client4

Server1

3674

187

3009

Server2

567

703

Server3

18

1688

?

Alina

Gerry

Irina

Raja

Movie Ratings

Matrix

?

4

9

Geisha

1

2

4

Shrek

3

10

Given previously observed ratings R(x,y), where X is a

“user” and Y is a “product”, predict unobserved ratings

- will Alina like “The Matrix”? (unlikely  )

- will Client 86 have fast download from Server 39?

- will member X of funding committee approve our project Y? 

67

Collaborative Prediction = Matrix Approximation

100 servers

• Important assumption: matrix entries are NOT independent, e.g. similar users have similar tastes

• Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings

(pLSA, MCVQ, SVD, NMF, MMMF, …)

68

2 4 5 1 4 2

User’s ‘weights’ associated with ‘factors’’

Assumptions:

- there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties

Factors

- movies have intrinsic values associated with such factors

- users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values

69

2 4 5 1 4 2

70

2 4 5 1 4 2

71

rank k

2 4 5 1 4 2

3 1 2 2 5

4 2 4 1 3 1

3 3 4 2

2 3 1 4 3 2

2 2 1 4

4 1 3 1 1

Y

Objective: find a factorizable X=UV’ that approximates Y

=

7 2 5 4 5 3 1 4 2

3 1 2 4 2 2 7 5 6

4 3 2 2 4 1 4 3 1

3 1 2 3 4 3 2 4 5

2 3 2 1 3 4 3 5 2

8 2 2 9 1 8 3 4 5

1 2 3 5 1 1 5 6 4

X

X

 arg min

X '

Loss ( X ' , Y ) and satisfies some “regularization” constraints (e.g. rank(X) < k)

Loss functions: depends on the nature of your problem

72

Matrix Factorization Approaches

 Singular value decomposition (SVD) – low-rank approximation

 Assumes fully observed Y and sum-squared loss

 In collaborative prediction, Y is only partially observed

Low-rank approximation becomes non-convex problem w/ many local minima

 Furthermore, we may not want sum-squared loss , but instead

 accurate predictions (0/1 loss, approximated by hinge loss)



 cost-sensitive predictions (missing a good server vs suggesting a bad one) ranking cost

(e.g., suggest k ‘best’ movies for a user)

NON-CONVEX PROBLEMS!

 Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]



 replaces bounded rank constraint by bounded norm of U, V’ vectors convex optimization problem! – can be solved exactly by semi-definite programming

 strongly relates to learning max-margin classifiers (SVMs)



Exploit MMMF’s properties to augment it with active sampling ! 73

Key Idea of MMMF

Rows – feature vectors, Columns – linear classifiers

Linear classifiers  weight vectors f1 -1

“margin” here =

Dist(sample, line)

X ij

= sign ij x margin ij

Predictor ij

= sign ij

If sign ij

> 0, classify as +1,

Otherwise classify as -1

74

MMMF: Simultaneous Search for Low-norm

Feature Vectors and Max-margin Classifiers

75

Active Learning with MMMF

- We extend MMMF to Active-MMMF using margin-based active sampling

- We investigate exploitation vs exploration trade-offs imposed by different heuristics

-0.3

0.6

-0.5

0.1

0.3

-0.9

-0.5

0.8

0.2

-0.9

0.3

-0.1

0.6

-0.7 -0.5

0.7

-0.9

0.1

0.5

-0.5

0.8

0.5

0.9

0.7

-0.2

-0.5

-0.5

0.6

-0.5

0.2

-0.5

-0.5

0.6

-0.9

-0.6

0.8

-0.4

-0.2

-0.5

-0.4

-0.5

-0.5

0.6

0.1

-0.8

0.7

0.2

-0.1

0.3

0.4

0.9

0.6

-0.8

0.2

0.4

0.3

0.9

0.6

0.5

0.3

-0.6

0.9

-0.5

Margin-based heuristics: min-margin (most-uncertain) min-margin positive (“good” uncertain) max-margin (‘safe choice’ but no info)

Active Max-Margin Matrix Factorization

A-MMMF(M,s)

1. Given s sparse matrix Y, learn approximation X = MMMF(Y)

2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download)

3. Add new samples to Y

4. Repeat 1-3

Issues:

 Beyond simple greedy margin-based heuristics?

 Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions

(any suggestions???  )

77

Empirical Results

 Network latency prediction

 Bandwidth prediction (peer-to-peer)

 Movie Ranking Prediction

 Sensor net connectivity prediction

78

Empirical Results: Latency Prediction

P2Psim data NLANR-AMP data

Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and leastuncertain-next sampling

79

Movie Rating Prediction (MovieLens)

80

Sensor Network Connectivity

81

Introducing Cost: Exploration vs Exploitation

DownloadGrid: bandwidth prediction

PlanetLab: latency prediction

Active sampling

 lower prediction errors at lower costs (saves 100s of samples)

(better prediction  better server assignment decisions  faster downloads

Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain

82

Conclusions









Common challenge in many applications: need for cost-efficient sampling

This talk: linear hidden factor models with active sampling

Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications

Future work:









Better active sampling heuristics?

Theoretical analysis of active sampling performance?

Dynamic Matrix Factorizations: tracking time-varying matrices

Incremental MMMF? (solving from scratch every time is too costly)

83

References

Some of the most influential papers

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with

Applications to Text Classification . Journal of Machine Learning Research.

Volume 2, pages 45-66. 2001.

• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm.

Machine Learning, 28:133 —168

• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models , Journal of Artificial Intelligence Research, (4): 129-145,

1996.

• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning , Machine Learning 15(2):201-221, 1994.

• D. J. C. Mackay. Information-Based Objective Functions for Active Data

Selection . Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.

84

NIPS papers

• Francis Bach. Active learning for misspecified generalized linear models . NIPS-06

•

Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real . NIPS-05

•

Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman . Active

Learning For Identifying Function Threshold Boundaries . NIPS-05

•

Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning . NIPS-05

• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning . NIPS-05

• Masashi Sugiyama . Active Learning for Misspecified Models . NIPS-05

•

Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models . NIPS-05

•

Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection . NIPS-04

• Sanjoy Dasgupta. Analysis of a greedy active learning strategy . NIPS-04

• T. Jaakkola and H. Siegelmann. Active Information Retrieval . NIPS-01

•

M. K. Warmuth et al. Active Learning in the Drug Discovery Process . NIPS-01

•

Jonathan D. Nelson, Javier R. Movellan . Active Inference in Concept Learning . NIPS-00

• Simon Tong, Daphne Koller . Active Learning for Parameter Estimation in Bayesian Networks . NIPS-00

• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering.

NIPS-97

•

K. Fukumizu . Active Learning in Multilayer Perceptrons . NIPS-95

•

Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE

LEARNING . NIPS-94

•

Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION . NIPS-94

• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS . NIPS-94

• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments . NIPS-91

85

ICML papers

• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning . ICML-06

• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification . ICML-06

• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant

Features . ICML-06

• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design . ICML-06

• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for

Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05

• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning.

ICML-04

• Klaus Brinker. Active Learning of Label Ranking Functions . ICML-04

• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04

• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines , ICML-00

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text

Classification.

ICML-00.

• COLT papers

• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning . COLT-05.

• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee . COLT-92, pages 287--294.

86

Journal Papers

• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning . Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text

Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.

• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm.

Machine Learning, 28:133--168

• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models ,

Journal of Artificial Intelligence Research, (4): 129-145, 1996.

• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning , Machine

Learning 15(2):201-221, 1994.

• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection . Neural

Comput., vol. 4, no. 4, pp. 590--604, 1992.

• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of

Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113

• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.

• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and

Ranking.

Machine Learning 54:2 2004, 153-178.

87

Workshops

• http://domino.research.ibm.com/comm/res earch_projects.nsf/pages/nips05workshop.

index.html

88

Appendix

89

Active Learning of Bayesian

Networks

Entropy Function

• A measure of information in random event X with possible outcomes {x

1

,…,x n

}

H(x) = -

S i p(x i

) log

2 p(x i

)

• Comments on entropy function:

– Entropy of an event is zero when the outcome is known

– Entropy is maximal when all outcomes are equally likely

• The average minimum yes/no questions to answer some question (connection to binary search)

[Shannon, 1948]

91

Kullback-Leibler divergence

• P is the true distribution; Q distribution is used to encode data instead of P

• KL divergence is the expected extra message length per datum that must be transmitted using Q

D

KL

(P || Q) =

S i

=

S i

P(x i

) log (P(x i

)/Q(x i

))

P(x i

) log Q(x i

) – S i

P(x i

) log P(x i

)

= H(P,Q) + H(P)

= -Cross-entropy + entropy

• Measure of how “wrong” Q is with respect to true distribution P 92

Learning Bayesian Networks

Data

+

Prior Knowledge

Learner

• Model Building

• Parameter estimation

• Causal structure discovery

• Passive Learning vs Active Learning

R

E

A

C

B e e

E e e

B b

P(A | E,B)

.9 .1

b b b

.7 .3

.8 .2

.99 .01

93

Active Learning

• Selective Active Learning

• Interventional Active Learning

• Obtain measure of quality of current model

• Choose query that most improves quality

• Update model

94

Active Learning: Parameter

Estimation

[Tong & Koller, NIPS-2000]

• Given a BN structure G

• A prior distribution p(θ)

• Learner request a particular instantiation q (Query)

Initial Network

G, p( θ)

E B

Query (q)

+

A

Training data

Active Learner p ´(θ)

Response (x)

 How to update parameter density

 How to select next query based on p

Updating parameter density

•Do not update A since we are fixing it

•If we select A then do not update B

•Sampling from P(B|A=a) 

P(B)

•If we force A then we can update B

•Sampling from P(B|A:=a) = P(B)*

•Update all other nodes as usual

•Obtain new density p (

θ

| A

 a,

X  x

)

J

B

A

M

96

*Pearl 2000

Bayesian point estimation

• Goal: a single estimate 

– instead of a distribution p over  p(  )

• If we choose  and the true model is then we incur some loss, L(

 ’ || 

)

 ’

97

Bayesian point estimation

• We do not know the true  ’

• Density p represents optimal beliefs over  ’

• Choose  that minimizes the expected loss

– 

= argmin 

∫ p (

 ’) L(  ’ || 

) d

 ’

• Call  the Bayesian point estimate

• Use the expected loss of the Bayesian point estimate as a measure of quality of p(



):

– Risk(p) = ∫p (

 ’) L(  ’ || 

) d

 ’

98

The Querying component

• Set the controllable variables so as to minimize the expected posterior risk:

ExPRisk(p | Q=q)

  x

P (

X  x

|

Q  q

)

 p (

θ

| x

) K L (

θ

|| ) d

θ

• KL divergence will be used for loss

Conditional KL-divergence

– K L(



||

 ’ )=∑KL(P

θ

(Xi|Ui)|| P

θ’

(Xi|Ui))

99

Algorithm Summary

• For each potential query q

• Compute 

Risk( X | q )

• Choose q for which



Risk( X | q ) is greatest

– Cost of computing 

Risk( X | q ):

• Cost of Bayesian network inference

• Complexity: O (| Q |. Cost of inference)

100

Uncertainty sampling

Maintain a single hypothesis, based on labels seen so far.

Query the point about which this hypothesis is most “uncertain”.

Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.

X

-

-

-

+

+

+

+

+

-

-

-

-

-

-

101

102

Region of uncertainty

Current version space: portion of H consistent with labels so far.

“Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

Suppose data lies on circle in R 2 ; hypotheses are linear separators.

(spaces X, H superimposed)

+ current version space

+ region of uncertainty in data space

103


Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

current version space

Data and hypothesis spaces, superimposed:

(both are the surface of the unit sphere in R d ) region of uncertainty in data space

104


Number of labels needed depends on H and also on P.

Special case: H = {linear separators in R d }, P = uniform distribution over unit sphere.

Then: just d log 1/

 labels are needed to reach a hypothesis with error rate <



.

[1] Supervised learning: d/

 labels.

[2] Best we can hope for.

105


Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

For more general distributions: suboptimal…

Need to measure quality of a query – or alternatively, size of version space.

106

Expected

Infogain of sample

Uncertainty sampling!

107

108

Active Learning.

Active Learning

Infogain-based Active

Learning

Notation

Dataset (

) Example

Notation

Model Example

Model Example

Possible Model Structures

Model Space

Model Parameter Space (W)

Notation

Game

Simulation

?

?

?

Active Learning Flavors

?

Score Function score

(

)

uncertaint y(

(

|

))

Uncertainty Sampling Example

Uncertainty Sampling Example

Uncertainty Sampling

Can we do better than uncertainty sampling?

Model Entropy

Information-Gain

Score Function

Conditional Model Entropy

Score Function

Score Function

Uncertainty Sampling &

Information Gain

But there is a problem…

Active Collaborative Prediction

References

NIPS papers

ICML papers

Journal Papers

Workshops

Appendix

Active Learning of Bayesian

Networks

Entropy Function

Kullback-Leibler divergence

Learning Bayesian Networks

Active Learning

Active Learning: Parameter

Estimation

[Tong & Koller, NIPS-2000]

Updating parameter density

Bayesian point estimation

Bayesian point estimation

The Querying component

Algorithm Summary

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib