Active Learning.

advertisement

Active Learning

COMS 6998-4:

Learning and Empirical Inference

Irina Rish

IBM T.J. Watson Research Center

Outline

 Motivation

 Active learning approaches

 Membership queries

 Uncertainty Sampling

 Information-based loss functions

 Uncertainty-Region Sampling

 Query by committee

 Applications

 Active Collaborative Prediction

 Active Bayes net learning

2

Standard supervised learning model

Given m labeled points, want to learn a classifier with misclassification rate <

, chosen from a hypothesis class H with

VC dimension d < 1 .

VC theory: need m to be roughly d/

, in the realizable case.

3

Active learning

In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.

What is the minimum number of labels needed to achieve the target error rate?

4

5

What is Active Learning?

 Unlabeled data are readily available; labels are expensive

 Want to use adaptive decisions to choose which labels to acquire for a given dataset

 Goal is accurate classifier with minimal cost

6

Active learning warning

Choice of data is only as good as the model itself

Assume a linear model, then two data points are sufficient

What happens when data are not linear?

7

Active Learning Flavors

Selective Sampling

Pool Sequential

Myopic Batch

Membership Queries

8

Active Learning Approaches

 Membership queries

 Uncertainty Sampling

 Information-based loss functions

 Uncertainty-Region Sampling

 Query by committee

9

10

11

Problem

Many results in this framework, even for complicated hypothesis classes.

[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.

Synthetic instances created were incomprehensible to humans!

[Lewis and Gale, 1992] tried training text classifiers.

“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”

12

13

14

15

Uncertainty Sampling

[Lewis & Gale, 1994]

 Query the event that the current classifier is most uncertain about

If uncertainty is measured in Euclidean distance: x x x x x x x x x x

 Used trivially in SVMs, graphical models, etc.

16

Information-based Loss Function

[MacKay, 1992]

 Maximize KL-divergence between posterior and prior

 Maximize reduction in model entropy between posterior and prior

 Minimize cross-entropy between posterior and prior

 All of these are notions of information gain

17

Query by Committee

[Seung et al . 1992, Freund et al . 1997]

 Prior distribution over hypotheses

 Samples a set of classifiers from distribution

 Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x

A B C

18

Infogain-based Active

Learning

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q

20

6

7

4

5

8

2

3

0

1 t

Dataset (

D

) Example

Sex Age Test A Test B Test C

M 40-50 0 1 1

F 50-60 0

F 30-40 0

F 60+ 1

1

0

1

0

0

1

M 10-20 0

M 40-50 0

F 0-10 0

M 30-40 1

M 20-30 0

1

0

0

1

0

0

0

0

1

1

Disease

?

?

?

?

?

?

?

?

?

21

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q

22

Model Example

S t

O t

Probabilistic Classifier

Notation

T : Number of examples

O t

: Vector of features of example t

S t

: Class of example t

23

Model Example

Patient state (S t

)

S t

: DiseaseState

Patient Observations (O t

)

O t1

O t2

: Gender

: Age

: TestA O t3

O t4

O t5

: TestB

: TestC

24

Possible Model Structures

S

Gender

Age

TestA

TestB

TestC

Gender

Age

S

TestB

TestA

TestC

25

Model Space

Model:

Model Parameters:

S t

O t

P(S t

) P(O t

|S t

)

Generative Model:

Must be able to compute P(S t

=i, O t

=o t

| w)

26

Model Parameter Space (W)

• W = space of possible parameter values

• Prior on parameters: P ( W )

• Posterior over models: P ( W | D )

P ( D | W ) P ( W )

T  t

P ( S t

, O t

| W ) P ( W )

27

Notation

We Have:

1. Dataset, D

2. Model parameter space, W

3. Query algorithm, q q( W ,D) returns t * , the next sample to label

28

Game

while NotDone

• Learn P( W | D )

• q chooses next example to label

• Expert adds label to D

29

Simulation

O

1

S

1

O

2

S

2

?

O

3

S

3 hmm… q

O

4

S

4

?

O

5

S

5

?

O

6

S

6

O

7

S

7

30

Active Learning Flavors

• Pool

(“random access” to patients)

• Sequential

(must decide as patients walk in the door)

31

q

?

• Recall: q ( W,D) returns the “most interesting” unlabelled example.

• Well, what makes a doctor curious about a patient?

32

1994

33

Score Function score

uncert

(

S

)

 t

uncertaint y(

P

(

S t

|

O t

))

H ( S t

)

  i

P ( S t

 i ) log P ( S t

 i )

34

Uncertainty Sampling Example

t Sex Age Test

A

Test

B

Test

C

S t

P(S t

) H(S t

)

1 M 20-

30

2 F 20-

30

0

0

1

1

1

0

3 F 30-

40

1 0 0

4 F 60+ 1 1 0

5 M 10-

20

6 M 20-

30

0

1

1

1

0

1

?

?

?

?

?

0.02

0.01

0.05

0.12

0.01

0.96

0.043

0.024

0.086

0.159

0.024

0.073

35

Uncertainty Sampling Example

t Sex Age Test

A

Test

B

Test

C

S t

P(S t

) H(S t

)

1 M 20-

30

2 F 20-

30

0

0

1

1

1

0

3 F 30-

40

1 0 0

4 F 60+ 1 1 0

5 M 10-

20

6 M 20-

30

0

1

1

1

0

1

?

?

?

?

0.01

0.02

0.04

0.00

0.06

0.97

0.024

0.043

0.073

0.00

0.112

0.059

36

Uncertainty Sampling

GOOD: couldn’t be easier

GOOD: often performs pretty well

BAD: H(S t

) measures information gain about the samples , not the model

Sensitive to noisy samples

37

Can we do better than uncertainty sampling?

38

1992

Informative with respect to what?

39

P( W|D )

Model Entropy

P( W|D )

P( W|D )

W

H( W ) = high

W

…better…

W

H( W ) = 0

40

Information-Gain

• Choose the example that is expected to most reduce H( W )

• I.e., Maximize H( W ) – H( W | S t

)

Current model space entropy

Expected model space entropy if we learn S t

41

Score Function

score

IG

( S t

)

MI ( S t

; W )

H ( W )

H ( W | S t

)

42

We usually can’t just sum over all models to get H(S t

| W )

H ( W )

   w

P ( w ) log P ( w ) dw

…but we can sample from P( W | D )

H ( W )

H

  c

C

( C )

P ( c ) log P ( c )

43

Conditional Model Entropy

H ( W )

  w

P ( w ) log P ( w ) dw

H ( W | S t

 i )

  w

P ( w | S t

 i ) log P ( w | S t

 i ) dw

H ( W | S t

)

  i

P ( S t

 i )

 w

P ( w | S t

 i ) log P ( w | S t

 i ) dw

44

Score Function

score

IG

( S t

)

H ( C )

H ( C | S t

)

45

t Sex Age Test

A

Test

B

Test

C

S t

1 M 20-

30

2 F 20-

30

0

0

1

1

3 F 30-

40

1 0

4 F 60+ 1 1

5 M 10-

20

6 M 20-

30

0

0

1

0

?

?

?

?

?

?

1

1

0

0

0

1

P(S t

)

0.02

0.01

0.05

0.12

0.01

0.02

Score =

H(C) - H(C|S t

)

0.53

0.58

0.40

0.49

0.57

0.52

46

Score Function

score

IG

( S t

)

H ( C )

H ( C

H ( S t

)

H ( S t

| S t

)

| C )

Familiar?

47

Uncertainty Sampling &

Information Gain

score

Uncertain

( S t

)

H ( S t

) score

InfoGain

( S t

)

H ( S t

)

H ( S t

| C )

48

But there is a problem…

49

If our objective is to reduce the prediction error, then

“the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries ”

50

Strategy #2:

Query by Committee

Temporary Assumptions:

Pool  Sequential

P( W | D )  Version Space

Probabilistic  Noiseless

QBC attacks the size of the “Version space”

51

O

1

S

1

O

2

S

2

O

3

S

3

O

4

S

4

O

5

S

5

O

6

S

6

O

7

S

7

FALSE!

FALSE!

Model #1 Model #2

52

O

1

S

1

O

2

S

2

O

3

S

3

O

4

S

4

O

5

S

5

O

6

S

6

O

7

S

7

TRUE!

TRUE!

Model #1 Model #2

53

O

1

S

1

O

2

S

2

O

3

S

3

FALSE!

TRUE!

O

4

S

4

Model #1 Model #2

O

5

S

5

O

6

S

6

O

7

S

7

Ooh, now we’re going to learn something for sure!

One of them is definitely wrong.

54

The Original QBC

Algorithm

As each example arrives…

1.

Choose a committee, C , (usually of size 2) randomly from Version Space

2.

Have each member of C classify it

3.

If the committee disagrees, select it.

55

1992

56

Infogain vs Query by Committee

[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]

First idea: Try to rapidly reduce volume of version space?

Problem: doesn’t take data distribution into account.

H:

Which pair of hypotheses is closest? Depends on data distribution P.

Distance measure on H: d(h,h’) = P(h(x)

 h’(x))

57

Query-by-committee

First idea: Try to rapidly reduce volume of version space?

Problem: doesn’t take data distribution into account.

To keep things simple, say d(h,h’) = Euclidean distance

H:

Error is likely to remain large!

58

Query-by-committee

Elegant scheme which decreases volume in a manner which is sensitive to the data distribution.

Bayesian setting: given a prior

 on H

H

1

= H

For t = 1, 2, … receive an unlabeled point x t drawn from P

[informally: is there a lot of disagreement about x t choose two hypotheses h,h’ randomly from (

, H t

) if h(x t

)

 h’(x t

): ask for x t

’s label set H t+1 in H t

?]

Problem: how to implement it efficiently?

59

Query-by-committee

For t = 1, 2, … receive an unlabeled point x t drawn from P choose two hypotheses h,h’ randomly from (

, H t

) if h(x t

)

 h’(x t

): ask for x t

’s label set H t+1

Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to

(h)

(h’) d(h,h’).

vs.

60 H t

61

Query-by-committee

Label bound:

For H = {linear separators in R d }, P = uniform distribution, just d log 1/

 labels to reach a hypothesis with error <

.

Implementation: need to randomly pick h according to (

, H t

).

e.g. H = {linear separators in R d },

= uniform distribution:

H t

How do you pick a random point from a convex body?

62

Sampling from convex bodies

By random walk!

1. Ball walk

2. Hit-and-run

[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC.

63

64

Some challenges

[1] For linear separators, analyze the label complexity for some distribution other than uniform!

[2] How to handle nonseparable data?

Need a robust base learner

+

true boundary

65

Active Collaborative Prediction

Approach: Collaborative Prediction (CP)

QoS measure (e.g. bandwidth)

Client1

Client2

Client3

Client4

Server1

3674

187

3009

Server2

567

703

Server3

18

1688

?

Alina

Gerry

Irina

Raja

Movie Ratings

Matrix

?

4

9

Geisha

1

2

4

Shrek

3

10

Given previously observed ratings R(x,y), where X is a

“user” and Y is a “product”, predict unobserved ratings

- will Alina like “The Matrix”? (unlikely  )

- will Client 86 have fast download from Server 39?

- will member X of funding committee approve our project Y? 

67

Collaborative Prediction = Matrix Approximation

100 servers

• Important assumption: matrix entries are NOT independent, e.g. similar users have similar tastes

• Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings

(pLSA, MCVQ, SVD, NMF, MMMF, …)

68

2 4 5 1 4 2

User’s ‘weights’ associated with ‘factors’’

Assumptions:

- there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties

Factors

- movies have intrinsic values associated with such factors

- users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values

69

2 4 5 1 4 2

70

2 4 5 1 4 2

71

rank k

2 4 5 1 4 2

3 1 2 2 5

4 2 4 1 3 1

3 3 4 2

2 3 1 4 3 2

2 2 1 4

4 1 3 1 1

Y

Objective: find a factorizable X=UV’ that approximates Y

=

7 2 5 4 5 3 1 4 2

3 1 2 4 2 2 7 5 6

4 3 2 2 4 1 4 3 1

3 1 2 3 4 3 2 4 5

2 3 2 1 3 4 3 5 2

8 2 2 9 1 8 3 4 5

1 2 3 5 1 1 5 6 4

X

X

 arg min

X '

Loss ( X ' , Y ) and satisfies some “regularization” constraints (e.g. rank(X) < k)

Loss functions: depends on the nature of your problem

72

Matrix Factorization Approaches

 Singular value decomposition (SVD) – low-rank approximation

 Assumes fully observed Y and sum-squared loss

 In collaborative prediction, Y is only partially observed

Low-rank approximation becomes non-convex problem w/ many local minima

 Furthermore, we may not want sum-squared loss , but instead

 accurate predictions (0/1 loss, approximated by hinge loss)

 cost-sensitive predictions (missing a good server vs suggesting a bad one) ranking cost

(e.g., suggest k ‘best’ movies for a user)

NON-CONVEX PROBLEMS!

 Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]

 replaces bounded rank constraint by bounded norm of U, V’ vectors convex optimization problem! – can be solved exactly by semi-definite programming

 strongly relates to learning max-margin classifiers (SVMs)

Exploit MMMF’s properties to augment it with active sampling ! 73

Key Idea of MMMF

Rows – feature vectors, Columns – linear classifiers

Linear classifiers  weight vectors f1 -1

“margin” here =

Dist(sample, line)

X ij

= sign ij x margin ij

Predictor ij

= sign ij

If sign ij

> 0, classify as +1,

Otherwise classify as -1

74

MMMF: Simultaneous Search for Low-norm

Feature Vectors and Max-margin Classifiers

75

Active Learning with MMMF

- We extend MMMF to Active-MMMF using margin-based active sampling

- We investigate exploitation vs exploration trade-offs imposed by different heuristics

-0.3

0.6

-0.5

0.1

0.3

-0.9

-0.5

0.8

0.2

-0.9

0.3

-0.1

0.6

-0.7 -0.5

0.7

-0.9

0.1

0.5

-0.5

0.8

0.5

0.9

0.7

-0.2

-0.5

-0.5

0.6

-0.5

0.2

-0.5

-0.5

0.6

-0.9

-0.6

0.8

-0.4

-0.2

-0.5

-0.4

-0.5

-0.5

0.6

0.1

-0.8

0.7

0.2

-0.1

0.3

0.4

0.9

0.6

-0.8

0.2

0.4

0.3

0.9

0.6

0.5

0.3

-0.6

0.9

-0.5

Margin-based heuristics: min-margin (most-uncertain) min-margin positive (“good” uncertain) max-margin (‘safe choice’ but no info)

Active Max-Margin Matrix Factorization

A-MMMF(M,s)

1. Given s sparse matrix Y, learn approximation X = MMMF(Y)

2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download)

3. Add new samples to Y

4. Repeat 1-3

Issues:

 Beyond simple greedy margin-based heuristics?

 Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions

(any suggestions???  )

77

Empirical Results

 Network latency prediction

 Bandwidth prediction (peer-to-peer)

 Movie Ranking Prediction

 Sensor net connectivity prediction

78

Empirical Results: Latency Prediction

P2Psim data NLANR-AMP data

Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and leastuncertain-next sampling

79

Movie Rating Prediction (MovieLens)

80

Sensor Network Connectivity

81

Introducing Cost: Exploration vs Exploitation

DownloadGrid: bandwidth prediction

PlanetLab: latency prediction

Active sampling

 lower prediction errors at lower costs (saves 100s of samples)

(better prediction  better server assignment decisions  faster downloads

Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain

82

Conclusions

Common challenge in many applications: need for cost-efficient sampling

This talk: linear hidden factor models with active sampling

Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications

Future work:

Better active sampling heuristics?

Theoretical analysis of active sampling performance?

Dynamic Matrix Factorizations: tracking time-varying matrices

Incremental MMMF? (solving from scratch every time is too costly)

83

References

Some of the most influential papers

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with

Applications to Text Classification . Journal of Machine Learning Research.

Volume 2, pages 45-66. 2001.

• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm.

Machine Learning, 28:133 —168

• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models , Journal of Artificial Intelligence Research, (4): 129-145,

1996.

• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning , Machine Learning 15(2):201-221, 1994.

• D. J. C. Mackay. Information-Based Objective Functions for Active Data

Selection . Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.

84

NIPS papers

• Francis Bach. Active learning for misspecified generalized linear models . NIPS-06

Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real . NIPS-05

Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman . Active

Learning For Identifying Function Threshold Boundaries . NIPS-05

Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning . NIPS-05

• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning . NIPS-05

• Masashi Sugiyama . Active Learning for Misspecified Models . NIPS-05

Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models . NIPS-05

Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection . NIPS-04

• Sanjoy Dasgupta. Analysis of a greedy active learning strategy . NIPS-04

• T. Jaakkola and H. Siegelmann. Active Information Retrieval . NIPS-01

M. K. Warmuth et al. Active Learning in the Drug Discovery Process . NIPS-01

Jonathan D. Nelson, Javier R. Movellan . Active Inference in Concept Learning . NIPS-00

• Simon Tong, Daphne Koller . Active Learning for Parameter Estimation in Bayesian Networks . NIPS-00

• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering.

NIPS-97

K. Fukumizu . Active Learning in Multilayer Perceptrons . NIPS-95

Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE

LEARNING . NIPS-94

Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION . NIPS-94

• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS . NIPS-94

• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments . NIPS-91

85

ICML papers

• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning . ICML-06

• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification . ICML-06

• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant

Features . ICML-06

• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design . ICML-06

• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for

Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05

• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning.

ICML-04

• Klaus Brinker. Active Learning of Label Ranking Functions . ICML-04

• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04

• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines , ICML-00

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text

Classification.

ICML-00.

• COLT papers

• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning . COLT-05.

• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee . COLT-92, pages 287--294.

86

Journal Papers

• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning . Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.

• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text

Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.

• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm.

Machine Learning, 28:133--168

• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models ,

Journal of Artificial Intelligence Research, (4): 129-145, 1996.

• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning , Machine

Learning 15(2):201-221, 1994.

• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection . Neural

Comput., vol. 4, no. 4, pp. 590--604, 1992.

• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of

Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113

• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.

• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and

Ranking.

Machine Learning 54:2 2004, 153-178.

87

Workshops

• http://domino.research.ibm.com/comm/res earch_projects.nsf/pages/nips05workshop.

index.html

88

Appendix

89

Active Learning of Bayesian

Networks

Entropy Function

• A measure of information in random event X with possible outcomes {x

1

,…,x n

}

H(x) = -

S i p(x i

) log

2 p(x i

)

• Comments on entropy function:

– Entropy of an event is zero when the outcome is known

– Entropy is maximal when all outcomes are equally likely

• The average minimum yes/no questions to answer some question (connection to binary search)

[Shannon, 1948]

91

Kullback-Leibler divergence

• P is the true distribution; Q distribution is used to encode data instead of P

• KL divergence is the expected extra message length per datum that must be transmitted using Q

D

KL

(P || Q) =

S i

=

S i

P(x i

) log (P(x i

)/Q(x i

))

P(x i

) log Q(x i

) – S i

P(x i

) log P(x i

)

= H(P,Q) + H(P)

= -Cross-entropy + entropy

• Measure of how “wrong” Q is with respect to true distribution P 92

Learning Bayesian Networks

Data

+

Prior Knowledge

Learner

• Model Building

• Parameter estimation

• Causal structure discovery

• Passive Learning vs Active Learning

R

E

A

C

B e e

E e e

B b

P(A | E,B)

.9 .1

b b b

.7 .3

.8 .2

.99 .01

93

Active Learning

• Selective Active Learning

• Interventional Active Learning

• Obtain measure of quality of current model

• Choose query that most improves quality

• Update model

94

Active Learning: Parameter

Estimation

[Tong & Koller, NIPS-2000]

• Given a BN structure G

• A prior distribution p(θ)

• Learner request a particular instantiation q (Query)

Initial Network

G, p( θ)

E B

Query (q)

+

A

Training data

Active Learner p ´(θ)

Response (x)

 How to update parameter density

 How to select next query based on p

Updating parameter density

•Do not update A since we are fixing it

•If we select A then do not update B

•Sampling from P(B|A=a) 

P(B)

•If we force A then we can update B

•Sampling from P(B|A:=a) = P(B)*

•Update all other nodes as usual

•Obtain new density p (

θ

| A

 a,

X  x

)

J

B

A

M

96

*Pearl 2000

Bayesian point estimation

• Goal: a single estimate 

– instead of a distribution p over  p(  )

• If we choose  and the true model is then we incur some loss, L(

 ’ || 

)

 ’

97

Bayesian point estimation

• We do not know the true  ’

• Density p represents optimal beliefs over  ’

• Choose  that minimizes the expected loss

– 

= argmin 

∫ p (

 ’) L(  ’ || 

) d

 ’

• Call  the Bayesian point estimate

• Use the expected loss of the Bayesian point estimate as a measure of quality of p(

):

– Risk(p) = ∫p (

 ’) L(  ’ || 

) d

 ’

98

The Querying component

• Set the controllable variables so as to minimize the expected posterior risk:

ExPRisk(p | Q=q)

  x

P (

X  x

|

Q  q

)

 p (

θ

| x

) K L (

θ

|| ) d

θ

• KL divergence will be used for loss

Conditional KL-divergence

– K L(

||

 ’ )=∑KL(P

θ

(Xi|Ui)|| P

θ’

(Xi|Ui))

99

Algorithm Summary

• For each potential query q

• Compute 

Risk( X | q )

• Choose q for which

Risk( X | q ) is greatest

– Cost of computing 

Risk( X | q ):

• Cost of Bayesian network inference

• Complexity: O (| Q |. Cost of inference)

100

Uncertainty sampling

Maintain a single hypothesis, based on labels seen so far.

Query the point about which this hypothesis is most “uncertain”.

Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.

X

-

-

-

+

+

+

+

+

-

-

-

-

-

-

101

102

Region of uncertainty

Current version space: portion of H consistent with labels so far.

“Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)

Suppose data lies on circle in R 2 ; hypotheses are linear separators.

(spaces X, H superimposed)

+ current version space

+ region of uncertainty in data space

103

Region of uncertainty

Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

current version space

Data and hypothesis spaces, superimposed:

(both are the surface of the unit sphere in R d ) region of uncertainty in data space

104

Region of uncertainty

Number of labels needed depends on H and also on P.

Special case: H = {linear separators in R d }, P = uniform distribution over unit sphere.

Then: just d log 1/

 labels are needed to reach a hypothesis with error rate <

.

[1] Supervised learning: d/

 labels.

[2] Best we can hope for.

105

Region of uncertainty

Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

For more general distributions: suboptimal…

Need to measure quality of a query – or alternatively, size of version space.

106

Expected

Infogain of sample

Uncertainty sampling!

107

108

Download