2010 - University of Florida

advertisement
Multiple Instance Hidden Markov
Model: Application to Landmine
Detection in GPR Data
Jeremy Bolton, Seniha Yuksel, Paul Gader
CSI Laboratory
University of Florida
CSI Laboratory
Highlights
• Hidden Markov Models (HMMs) are useful tools
for landmine detection in GPR imagery
• Explicitly incorporating the Multiple Instance
Learning (MIL) paradigm in HMM learning is
intuitive and effective
• Classification performance is improved when
using the MI-HMM over a standard HMM
• Results further support the idea that explicitly
accounting for the MI scenario may lead to
improved learning under class label uncertainty
2010
2/31
CSI Laboratory
Outline
I. HMMs for Landmine detection in GPR
I. Data
II. Feature Extraction
III. Training
II. MIL Scenario
III.MI-HMM
IV.Classification Results
2010
3/31
HMMs for landmine detection
CSI Laboratory
GPR Data
• GPR data
– 3d image cube
• Dt, xt, depth
– Subsurface objects
are observed as
hyperbolas
2010
5/31
CSI Laboratory
GPR Data Feature Extraction
• Many features extracted from in GPR data
measure the occurrence of an “edge”
– For the typical HMM algorithm (Gader et al.),
• Preprocessing techniques are used to emphasize edges
• Image morphology and structuring elements can be used to
extract edges
Image
Preprocessed
Edge Extraction
2010
6/31
CSI Laboratory
4-d Edge Features
Edge Extraction
2010
7/31
CSI Laboratory
Concept behind the HMM for GPR
• Using the extracted features (an observation sequence
when scanning from left to right in an image) we will
attempt to estimate some hidden states
2010
8/31
CSI Laboratory
Concept behind the HMM for GPR
2010
9/31
CSI Laboratory
HMM Features
• Current AIM viewer by Smock
Image
Feature Image
Rising Edge Feature
Falling Edge Feature
2010
10/31
CSI Laboratory
Sampling HMM Summary
• Feature Calculation
– Dimensions (Not always relevant whether positive or
negative diagonal is observed …. Just simply a diagonal is
observed)
• HMMSamp: 2d
– Down sampling depth
• HMMSamp: 4
• HMM Models
– Number of States
• HMMSamp : 4
– Gaussian components per state (Fewer total components
for probability calculation)
• HMMSamp : 1 (recent observation)
2010
11/31
CSI Laboratory
Training the HMM
• Xuping Zhang proposed a Gibbs Sampling algorithm for HMM
learning
– But, given an image(s) how do we choose the training sequences?
– Which sequence(s) do we choose from each image?
• There is an inherent problem in many image analysis settings due to
class label uncertainty per sequence
• That is, each image has a class label associated with it, but each
image has multiple instances of samples or sequences. Which
sample(s) is truly indicative of the target?
– Using standard training techniques this translates to identifying the
optimal training set within a set of sequences
– If an image has N sequences this translates to a search of 2N possibilities
2010
12/31
CSI Laboratory
Training Sample Selection Heuristic
TM46-MB @ 1"
%Change in LL: 0.0017
-4000
-5000
-6000
• Currently, an MRF
approach (Collins et al.) is
used to bound the search
to a localized area within
the image rather than
search all sequences within
the image.
– Reduces search space, but
multiple instance problem
still exists
-7000
-8000
-9000
10
20
30
H0 Segmentation
H1 Segmentation
Original Data
Data + Bounding Box
50
50
100
100
150
150
200
200
250
0
20
40
60
250
2010
20
40
40
60
13/31
Multiple Instance Learning
CSI Laboratory
Standard Learning vs. Multiple
Instance Learning
• Standard supervised learning
– Optimize some model (or learn a target concept) given training
samples and corresponding labels
• MIL
X  {x1,...,x n }, Y  { y1,...,y n }
– Learn a target concept given multiple sets of samples and
corresponding labels for the sets.
– Interpretation: Learning with uncertain labels / noisy teacher
X i  {x i1,..., x ini }, Yi  1,{ y i1 ?,..., y ini  ?}
2010
15/31
CSI Laboratory
Multiple Instance Learning (MIL)
• Given:
– Set of I bags
– Labeled + or -
B  {B1 ,..Bi , Bi1,...,BI }
– The ith bag is a set of Ji
samples in some feature space
– Interpretation of labels
Bi  {xi1,...,xiJi }
Bi  j : label( xij )  1
Bi  j, label( xij )  0
• Goal: learn concept
– What characteristic is common to the positive bags that is not
observed in the negative bags
2010
16/31
CSI Laboratory
Standard learning doesn’t always fit: GPR Example
• Standard Learning
EHD: Feature Vector
y1  ?
– Each training sample
(feature vector) must have y 2  
a label
y3  
y4  ?
– But which ones and how
many compose the
optimal training set?
x1
x2
x3
x4
• Arduous task: many feature
vectors per image and multiple
images
• Difficult to label given
GPR echoes, ground
truthing errors, etc …
• Label of each vector may
not be known
yn  
xn
2010
17/31
CSI Laboratory
Learning from Bags
•
•
•
•
In MIL, a label is attached to a set of samples.
A bag is a set of samples
A sample within a bag is called an instance.
A bag is labeled as positive if and only if at least
one of its instances is positive.
NEGATIVE BAGS
(Each bag is an image)
POSITIVE BAGS
(Each bag is an image)
18 2010
18/31
CSI Laboratory
MI Learning: GPR Example
• Multiple Instance
Learning
EHD: Feature Vector
– Each training bag must
have a label
– No need to label all feature
vectors, just identify images
(bags) where targets are
present
– Implicitly accounts for class
label uncertainty …
Y 
x1 , x 2 , x3 , x 4 ,..., x15 
2010
19/31
Multiple Instance Learning HMM:
MI-HMM
CSI Laboratory
MI-HMM
• In MI-HMM, instances are sequences
Direction of
movement
NEGATIVE
BAGS
POSITIVE
BAGS
21 2010
21/31
CSI Laboratory
MI-HMM
• Assuming independence between the bags and
assuming the Noisy-OR (Pearl) relationship
between the sequences within each bag
• where
2010
22/31
CSI Laboratory
MI-HMM learning
• Due to the cumbersome nature of the
noisy-OR, the parameters of the HMM are
learned using Metropolis – Hastings
sampling.
23 2010
23/31
CSI Laboratory
Sampling
• HMM parameters are sampled from Dirichlet
• A new state is accepted or rejected based on the
ratio r at iteration t + 1
• where P is the noisy-or model.
24 2010
24/31
CSI Laboratory
Discrete Observations
• Note that since we have chosen a Metropolis Hastings
sampling scheme using Dirichlets, our observations must
be discretized.
16
10
14
20
12
30
10
40
50
8
60
6
70
4
80
90
2
2
4
6
2010
8
10
12
14
25/31
CSI Laboratory
MI-HMM Summary
• Feature Calculation
– Dimensions
• HMMSamp: 2d
• MI-HMM: 2d features are descretized into 16 symbols
– Down sampling depth
• HMMSamp: 4
• MI-HMM: 4
• HMM Models
– Number of States
• HMMSamp : 4
• MI-HMM: 4
– Components per state (Fewer total components for probability
calculation)
• HMMSamp : 1 Gaussian
• MI-HMM: Discrete mixture over 16 symbols
2010
26/31
Classification Results
CSI Laboratory
MI-HMM vs Sampling HMM
• Small Millbrook
HMM Samp (12,000)
MI-HMM (100)
2010
28/31
CSI Laboratory
What’s the deal with HMM Samp?
2010
29/31
Concluding Remarks
CSI Laboratory
Concluding Remarks
• Explicitly incorporating the Multiple Instance
Learning (MIL) paradigm in HMM learning is
intuitive and effective
• Classification performance is improved when
using the MI-HMM over a standard HMM
– More effective and efficient
• Future Work
– Construct bags without using MRF heuristic
– Apply to EMI data: spatial uncertainty
2010
31/31
Back up Slides
CSI Laboratory
2010
33/31
CSI Laboratory
Standard Learning vs. Multiple
Instance Learning
• Standard supervised learning
– Optimize some model (or learn a target concept) given training
samples and corresponding labels
• MIL
X  {x1,...,x n }, Y  { y1,...,y n }
– Learn a target concept given multiple sets of samples and
corresponding labels for the sets.
– Interpretation: Learning with uncertain labels / noisy teacher
X i  {x i1,..., x ini }, Yi  1,{ y i1 ?,..., y ini  ?}
2010
34/31
CSI Laboratory
Multiple Instance Learning (MIL)
• Given:
– Set of I bags
– Labeled + or -
B  {B1 ,..Bi , Bi1,...,BI }
– The ith bag is a set of Ji
samples in some feature space
– Interpretation of labels
Bi  {xi1,...,xiJi }
Bi  j : label( xij )  1
Bi  j, label( xij )  0
• Goal: learn concept
– What characteristic is common to the positive bags that is not
observed in the negative bags
2010
35/31
CSI Laboratory
MIL Application: Example GPR
EHD: Feature Vector
• Collaboration: Frigui,
Collins, Torrione
• Construction of bags
– Collect 15 EHD feature
vectors from the 15
depth bins
– Mine images = + bags
– FA images = - bags
x1 , x 2 , x 3 , x 4 ,..., x15 
2010
36/31
CSI Laboratory
Standard vs. MI Learning: GPR Example
• Standard Learning
– Each training sample
(feature vector) must
have a label
• Arduous task
– many feature vectors per
image and multiple
images
– difficult to label given
GPR echoes, ground
truthing errors, etc …
– label of each vector may
not be known
EHD: Feature Vector
y1  
y2  
y3  
y4  
x1
x2
x3
x4
yn  
xn
2010
37/31
CSI Laboratory
Standard vs MI Learning: GPR Example
• Multiple Instance
Learning
EHD: Feature Vector
– Each training bag must
have a label
– No need to label all feature
vectors, just identify images
(bags) where targets are
present
– Implicitly accounts for class
label uncertainty …
Y 
x1 , x 2 , x3 , x 4 ,..., x15 
2010
38/31
Random Set Framework for Multiple
Instance Learning
CSI Laboratory
Random Set Brief
• Random Set

(, B(), P)
( , B( ), P)

R
(, B())
(R, B(R) )
2010
40/31
CSI Laboratory
How can we use Random Sets for MIL?
• Random set for MIL: Bags are sets
X  {x1 ,..., x n }
– Idea of finding commonality of positive bags inherent in random
set formulation
• Sets have an empty intersection or non-empty intersection relationship
• Find commonality using intersection operator
• Random sets governing functional is based on intersection operator
– Capacity functional : T
A.K.A. : Noisy-OR gate (Pearl 1988)
T ( X )  1   1  T ( x) 
xX
It is NOT the case that EACH
element is NOT the
target concept
2010
41/31
CSI Laboratory
Random Set Functionals
• Capacity functionals for intersection calculation
P(  X   )  T ( X )
• Use germ and grain model to model random set
Random Set model
parameters
{ , }
– Multiple (J) Concepts
J
   ({ j }   j )
Germ
Grain
j 1
– Calculate probability of intersection given X and germ and grain pairs:

T ( X )  1   1  T j ( x)
j
xX

– Grains are governed by random radii with assumed cumulative:
T j ({x})  P( R j  rj )  1  P( R j  rj )  2 
2
, rj  x   j
T
2010
1  exp(rj  j rj )
42/31
CSI Laboratory
RSF-MIL: Germ and Grain Model
• Positive Bags
= blue
• Negative Bags
= orange
• Distinct
shapes =
distinct bags
x
T
x
T
T
x
x


Tx
x
x
x
T
x
2010
43/31
Multiple Instance Learning with
Multiple Concepts
CSI Laboratory
Multiple Concepts:
Disjunction or Conjunction?
• Disjunction
– When you have multiple types of concepts
– When each instance can indicate the presence of a
target
• Conjunction
– When you have a target type that is composed of
multiple (necessary concepts)
– When each instance can indicate a concept, but not
necessary the composite target type
2010
45/31
CSI Laboratory
Conjunctive RSF-MIL
• Previously Developed Disjunctive RSF-MIL (RSF-MIL-d)

T ( X )  1   1  T j ( x)
j
xX

Noisy-OR combination
across concepts and samples
Standard noisy-OR
for one concept j
• Conjunctive RSF-MIL (RSF-MIL-c)




T ( X )   1   1  T j ( x) 
j 
xX

Noisy-AND combination
across concepts
2010
46/31
CSI Laboratory
Synthetic Data Experiments
•
AUC (AUC when initialized near solution)
Extreme Conjunct data
set requires that a
target bag exhibits two
distinct concepts
rather than one or
none
2010
47/31
Application to Remote Sensing
CSI Laboratory
Disjunctive Target Concepts
Noisy
OR
Target Concept
Type 1
Noisy
OR
Target Concept
Type 2
…
O
R
Noisy
OR
•
Target Concept
Type n
Using Large overlapping bins (GROSS
Extraction) the target concept can be
encapsulated within 1 instance: Therefore a
disjunctive relationship exists
2010
Target
Concept
Present?
49/31
CSI Laboratory
What if we want features with finer
granularity
Noisy
OR
Constituent Concept 1
(top of hyperbola)
…
AND
Noisy
OR
•
Target
Concept
Present?
Constituent Concept 2
(wings of hyperbola)
Fine Extraction
– More detail about image and more shape
information, but may loose disjunctive
nature between (multiple) instances
Our features have more granularity,
therefore our concepts may be
constituents of a target, rather than
encapsulating the target concept
2010
50/31
CSI Laboratory
GPR Experiments
• Extensive GPR Data set
– ~800 targets
– ~ 5,000 non-targets
• Experimental Design
– Run RSF-MIL-d (disjunctive) and RSF-MIL-c
(conjunctive)
– Compare both feature extraction methods
• Gross extraction: large enough to encompass target concept
• Fine extraction: Non-overlapping bins
• Hypothesis
– RSF-MIL will perform well when using gross extraction whereas
RSF-MIL-c will perform well using Fine extraction2010
51/31
CSI Laboratory
Experimental Results
• Highlights
– RSF-MIL-d using gross extraction performed best
– RSF-MIL-c performed better than RSF-MIL-d when using fine
extraction
– Other influencing factors: optimization methods for RSF-MIL-d
and RSF-MIL-c are not the same
Gross Extraction
Fine Extraction
2010
52/31
CSI Laboratory
Future Work
•
•
•
•
Implement a general form that can learn
disjunction or conjunction relationship from
the data
Implement a general form that can learn the
number of concepts
Incorporate spatial information
Develop an improved optimization scheme for
RSF-MIL-C
2010
53/31
CSI Laboratory
2010
54/31
CSI Laboratory
HMM Model Visualization
Points =
Gaussian Component means
DTXTHMM
2
Falling
Diagonal
1
0
0
0.5
1
1.5
2
Rising Diagonal
Initial Probs
1
Initial probabilities
State index1
State index 2
State index 3
1
2
3
0.5
0
Transition Probs
Color =
State Index
1
2
3
1
2
3
Transition probabilities
from state to state (red
= high probability)
Pattern Characterized
2010
55/31
CSI Laboratory
2010
56/31
CSI Laboratory
2010
57/31
CSI Laboratory
2010
58/31
CSI Laboratory
2010
59/31
CSI Laboratory
2010
60/31
CSI Laboratory
2010
61/31
CSI Laboratory
2010
62/31
Backup Slides
CSI Laboratory
MIL Example (AHI Imagery)
• Robust learning tool
– MIL tools can learn target signature with
limited or incomplete ground truth
Which spectral
signature(s) should we
use to train a target
model or classifier?
1. Spectral mixing
2. Background signal
3. Ground truth not exact
2010
64/31
CSI Laboratory
MI-RVM
• Addition of set observations and inference
using noisy-OR to an RVM model
K

P( y  1 | X )  1   1   ( wT x j )

j 1
1
 ( z) 
1  exp( z )
• Prior on the weight w
p(w)  N (w | 0, A1 )
2010
65/31
CSI Laboratory
SVM review
• Classifier structure
y(x)  wTφ(x)  b
• Optimization
1
2
min w  C  i
w,b 2
i
st i : ti (wT φ(xi )  b)  1  i , i  0,
2010
66/31
CSI Laboratory
MI-SVM Discussion
• RVM was altered to fit MIL problem by changing
the form of the target variable’s posterior to
model a noisy-OR gate.
• SVM can be altered to fit the MIL problem by
changing how the margin is calculated
– Boost the margin between the bag (rather than
samples) and decision surface
– Look for the MI separating linear discriminant
• There is at least one sample from each bag in the half space
2010
67/31
CSI Laboratory
mi-SVM
• Enforce MI scenario using extra
constraints
1
2
min min w  C  i
{ti } w,b 2
i
st i : ti (wT φ(xi )  b)  1  i , i  0, ti {1,1}
At least one sample in
each positive bag must
have a label of 1.
All samples in each
negative bag must
have a label of -1.
ti  1
 1, I : TI  1,

2
iI
ti  1, I : TI  1
Mixed integer
program: Must find
optimal hyperplane
and optimal labeling
set
2010
68/31
CSI Laboratory
Current Applications
I. Multiple Instance Learning
I. MI Problem
II. MI Applications
II.Multiple Instance Learning: Kernel Machines
I. MI-RVM
II. MI-SVM
III. Current Applications
I. GPR imagery
II. HSI imagery
2010
69/31
CSI Laboratory
HSI: Target Spectra Learning
• Given labeled areas of interest: learn
target signature
• Given test areas of interest: classify set of
samples
2010
70/31
CSI Laboratory
Overview of MI-RVM Optimization
• Two step optimization
1. Estimate optimal w, given posterior of w
• There is no closed form solution for the parameters
of the posterior, so a gradient update method is
used
• Iterate until convergence. Then proceed to step 2.
2. Update parameter on prior of w
• The distribution on the target variable has no
specific parameters.
• Until system convergence, continue at step 1.
2010
71/31
CSI Laboratory
1) Optimization of w
• Optimize posterior (Bayes’ Rule) of w
wˆ MAP  arg maxlog p( X | w)  log p( w)
w
• Update weights using Newton-Raphson
method
wt 1  wt H 1 g
2010
72/31
CSI Laboratory
2) Optimization of Prior
• Optimization of covariance of prior
Aˆ  arg max p( X | A)  arg max p( X | w) p(w | A)dw
A
A
• Making a large number of assumptions,
diagonal elements of A can be estimated
ainew 
1
wi2  H ii1
2010
73/31
CSI Laboratory
Random Sets: Multiple Instance
Learning
• Random set framework for multiple instance
learning
– Bags are sets
– Idea of finding commonality of positive bags inherent
in random set formulation
• Find commonality using intersection operator
• Random sets governing functional is based on intersection
operator
T ( K )  P(  K   )
2010
74/31
CSI Laboratory
MI issues
• MIL approaches
– Some approaches are biased to believe only
one sample in each bag caused the target
concept
– Some approaches can only label bags
– It is not clear whether anything is gained over
supervised approaches
2010
75/31
CSI Laboratory
RSF-MIL
• MIL-like
• Positive
Bags = blue
• Negative
Bags =
orange
• Distinct
shapes =
distinct bags
x
T
x
T
T
x
x


x
x
Tx
x
T
x
2010
76/31
CSI Laboratory
Side Note: Bayesian Networks
• Noisy-OR Assumption
– Bayesian Network representation of Noisy-OR
– Polytree: singly connected DAG
2010
77/31
CSI Laboratory
Side Note
• Full Bayesian network may be intractable
– Occurrence of causal factors are rare (sparse co-occurrence)
• So assume polytree
• So assume result has boolean relationship with causal factors
– Absorb I, X and A into one node, governed by randomness of I
• These assumptions greatly simplify inference calculation
• Calculate Z based on probabilities rather than constructing a
distribution using X
P(Z  1 | {X1 , X 2 , X 3 , X 4 })  1   1  P(Z  1 | X j )
j
2010
78/31
CSI Laboratory
Diverse Density (DD)
• Probabilistic Approach
– Goal:
• Standard statistics approaches identify areas in a feature space with high
density of target samples and low density of non-target samples
• DD: identify areas in a feature space with a high “density” of samples from
EACH of the postitive bags (“diverse”), and low density of samples from
negative bags.
– Identify attributes or characteristics similar to positive bags, dissimilar with
negative bags
– Assume t is a target characterization
– Goal:
arg max P B1 ,..., Bn , B1 ,..., Bm | t
t


– Assuming the bags are conditionally independent
arg max
t

P Bi | t
i
PB

j
| t

j
2010
79/31
CSI Laboratory
Diverse Density
• Calculation (Noisy-OR Model):

P(t | Bi )  1   1  P(t | Bij )

Bi  {xi1,...,xiJi }
j
2
2




P (t | B )  exp   Bij  t   exp   xij  t 





ij
P(t | Bi ) 
It is NOT the case that EACH
element is NOT the
target concept
1 P(t | B )

ij
j
arg max
t

P Bi | t
i
PB
j

j
| t

2010
80/31
CSI Laboratory
Random Set Brief
• Random Set
(, B(), P)
( , B( ), P)

R
(, B())
(R, B(R) )
2010
81/31
CSI Laboratory
Random Set Functionals
• Capacity and avoidance
functionals
T ( K )  P(  K   )
T ( K )  1  Q ( K )
Q ( K )  P(  K   )
i 
ni
({ }  
ij
ij )
j 1
P({x} |  ij )  T ij ({x}) 
2
P( Rij a
 rgerm
)

1

P
(
R

r
)

– Given
and
grain
model
ij
ij
T
1  exp(rij ij rij )
, rij  x   ij
2010
82/31
CSI Laboratory
When disjunction makes sense
Target
Concept
Present
OR
• Using Large overlapping bins the target
concept can be encapsulated within 1
instance: Therefore a disjunctive
relationship exists
2010
83/31
CSI Laboratory
Theoretical and Developmental
Progress
• Previous Optimization:
•
Previous TO DO list
Improve Existing Code
•
• Did not necessarily promote
Develop joint optimization for context
learning and MIL
diverse
density


•
Apply MIL approaches (broad scale)
arg max
T (B )  Q (B )
–


j
 j ,
i
 j ,
i
i
•
•
•
Learn similarities between feature sets of
mines
Aid in training existing algos: find “best”
EHD features for training / testing
Construct set-based classifiers?
arg maxT j , ( Bi )Q j , ( Bi )
• Current optimization

j
i
• Better for context learning and MIL
2010
84/31
CSI Laboratory
How do we impose the MI scenario?:
Diverse Density (Maron et al.)
• Calculation (Noisy-OR Model):
– Inherent in Random Set formulation


P(t | Bi )  1   1  P(t | Bij )
j

P(t | Bi )   1  P(t | Bij )


Bi  {xi1,...,xiJi }
It is NOT the case that EACH
element is NOT the
target concept
2
2




P (t | B )  exp   Bij  t   exp   xij  t 




j

ij
•


arg
max
P
t
|
B
i
Optimization 
t
i
 Pt | B 

j
j
2010
– Combo of exhaustive search and gradient ascent
85/31
CSI Laboratory
How can we use Random Sets for MIL?
• Random set for MIL: Bags are sets
– Idea of finding commonality of positive bags inherent in random
set formulation
• Sets have an empty intersection or non-empty intersection relationship
• Find commonality using intersection operator
• Random sets governing functional is based on intersection operator
• Example:
Bags with target
{l,a,e,i,o,p,u,f}
{f,b,a,e,i,z,o,u}
{a,b,c,i,o,u,e,p,f}
{a,f,t,e,i,u,o,d,v}
intersection
Bags without
target
{s,r,n,m,p,l}
{z,s,w,t,g,n,c}
{f,p,k,r}
{q,x,z,c,v}
{p,l,f}
union
Target concept =
{a,e,i,o,u,f} \
{f,s,r,n,m,p,l,z,w,g,n,c,v,q,k} = {a,e,i,o,u}
2010
86/31
Download