Multiple Instance Hidden Markov Model: Application to Landmine Detection in GPR Data Jeremy Bolton, Seniha Yuksel, Paul Gader CSI Laboratory University of Florida CSI Laboratory Highlights • Hidden Markov Models (HMMs) are useful tools for landmine detection in GPR imagery • Explicitly incorporating the Multiple Instance Learning (MIL) paradigm in HMM learning is intuitive and effective • Classification performance is improved when using the MI-HMM over a standard HMM • Results further support the idea that explicitly accounting for the MI scenario may lead to improved learning under class label uncertainty 2010 2/31 CSI Laboratory Outline I. HMMs for Landmine detection in GPR I. Data II. Feature Extraction III. Training II. MIL Scenario III.MI-HMM IV.Classification Results 2010 3/31 HMMs for landmine detection CSI Laboratory GPR Data • GPR data – 3d image cube • Dt, xt, depth – Subsurface objects are observed as hyperbolas 2010 5/31 CSI Laboratory GPR Data Feature Extraction • Many features extracted from in GPR data measure the occurrence of an “edge” – For the typical HMM algorithm (Gader et al.), • Preprocessing techniques are used to emphasize edges • Image morphology and structuring elements can be used to extract edges Image Preprocessed Edge Extraction 2010 6/31 CSI Laboratory 4-d Edge Features Edge Extraction 2010 7/31 CSI Laboratory Concept behind the HMM for GPR • Using the extracted features (an observation sequence when scanning from left to right in an image) we will attempt to estimate some hidden states 2010 8/31 CSI Laboratory Concept behind the HMM for GPR 2010 9/31 CSI Laboratory HMM Features • Current AIM viewer by Smock Image Feature Image Rising Edge Feature Falling Edge Feature 2010 10/31 CSI Laboratory Sampling HMM Summary • Feature Calculation – Dimensions (Not always relevant whether positive or negative diagonal is observed …. Just simply a diagonal is observed) • HMMSamp: 2d – Down sampling depth • HMMSamp: 4 • HMM Models – Number of States • HMMSamp : 4 – Gaussian components per state (Fewer total components for probability calculation) • HMMSamp : 1 (recent observation) 2010 11/31 CSI Laboratory Training the HMM • Xuping Zhang proposed a Gibbs Sampling algorithm for HMM learning – But, given an image(s) how do we choose the training sequences? – Which sequence(s) do we choose from each image? • There is an inherent problem in many image analysis settings due to class label uncertainty per sequence • That is, each image has a class label associated with it, but each image has multiple instances of samples or sequences. Which sample(s) is truly indicative of the target? – Using standard training techniques this translates to identifying the optimal training set within a set of sequences – If an image has N sequences this translates to a search of 2N possibilities 2010 12/31 CSI Laboratory Training Sample Selection Heuristic TM46-MB @ 1" %Change in LL: 0.0017 -4000 -5000 -6000 • Currently, an MRF approach (Collins et al.) is used to bound the search to a localized area within the image rather than search all sequences within the image. – Reduces search space, but multiple instance problem still exists -7000 -8000 -9000 10 20 30 H0 Segmentation H1 Segmentation Original Data Data + Bounding Box 50 50 100 100 150 150 200 200 250 0 20 40 60 250 2010 20 40 40 60 13/31 Multiple Instance Learning CSI Laboratory Standard Learning vs. Multiple Instance Learning • Standard supervised learning – Optimize some model (or learn a target concept) given training samples and corresponding labels • MIL X {x1,...,x n }, Y { y1,...,y n } – Learn a target concept given multiple sets of samples and corresponding labels for the sets. – Interpretation: Learning with uncertain labels / noisy teacher X i {x i1,..., x ini }, Yi 1,{ y i1 ?,..., y ini ?} 2010 15/31 CSI Laboratory Multiple Instance Learning (MIL) • Given: – Set of I bags – Labeled + or - B {B1 ,..Bi , Bi1,...,BI } – The ith bag is a set of Ji samples in some feature space – Interpretation of labels Bi {xi1,...,xiJi } Bi j : label( xij ) 1 Bi j, label( xij ) 0 • Goal: learn concept – What characteristic is common to the positive bags that is not observed in the negative bags 2010 16/31 CSI Laboratory Standard learning doesn’t always fit: GPR Example • Standard Learning EHD: Feature Vector y1 ? – Each training sample (feature vector) must have y 2 a label y3 y4 ? – But which ones and how many compose the optimal training set? x1 x2 x3 x4 • Arduous task: many feature vectors per image and multiple images • Difficult to label given GPR echoes, ground truthing errors, etc … • Label of each vector may not be known yn xn 2010 17/31 CSI Laboratory Learning from Bags • • • • In MIL, a label is attached to a set of samples. A bag is a set of samples A sample within a bag is called an instance. A bag is labeled as positive if and only if at least one of its instances is positive. NEGATIVE BAGS (Each bag is an image) POSITIVE BAGS (Each bag is an image) 18 2010 18/31 CSI Laboratory MI Learning: GPR Example • Multiple Instance Learning EHD: Feature Vector – Each training bag must have a label – No need to label all feature vectors, just identify images (bags) where targets are present – Implicitly accounts for class label uncertainty … Y x1 , x 2 , x3 , x 4 ,..., x15 2010 19/31 Multiple Instance Learning HMM: MI-HMM CSI Laboratory MI-HMM • In MI-HMM, instances are sequences Direction of movement NEGATIVE BAGS POSITIVE BAGS 21 2010 21/31 CSI Laboratory MI-HMM • Assuming independence between the bags and assuming the Noisy-OR (Pearl) relationship between the sequences within each bag • where 2010 22/31 CSI Laboratory MI-HMM learning • Due to the cumbersome nature of the noisy-OR, the parameters of the HMM are learned using Metropolis – Hastings sampling. 23 2010 23/31 CSI Laboratory Sampling • HMM parameters are sampled from Dirichlet • A new state is accepted or rejected based on the ratio r at iteration t + 1 • where P is the noisy-or model. 24 2010 24/31 CSI Laboratory Discrete Observations • Note that since we have chosen a Metropolis Hastings sampling scheme using Dirichlets, our observations must be discretized. 16 10 14 20 12 30 10 40 50 8 60 6 70 4 80 90 2 2 4 6 2010 8 10 12 14 25/31 CSI Laboratory MI-HMM Summary • Feature Calculation – Dimensions • HMMSamp: 2d • MI-HMM: 2d features are descretized into 16 symbols – Down sampling depth • HMMSamp: 4 • MI-HMM: 4 • HMM Models – Number of States • HMMSamp : 4 • MI-HMM: 4 – Components per state (Fewer total components for probability calculation) • HMMSamp : 1 Gaussian • MI-HMM: Discrete mixture over 16 symbols 2010 26/31 Classification Results CSI Laboratory MI-HMM vs Sampling HMM • Small Millbrook HMM Samp (12,000) MI-HMM (100) 2010 28/31 CSI Laboratory What’s the deal with HMM Samp? 2010 29/31 Concluding Remarks CSI Laboratory Concluding Remarks • Explicitly incorporating the Multiple Instance Learning (MIL) paradigm in HMM learning is intuitive and effective • Classification performance is improved when using the MI-HMM over a standard HMM – More effective and efficient • Future Work – Construct bags without using MRF heuristic – Apply to EMI data: spatial uncertainty 2010 31/31 Back up Slides CSI Laboratory 2010 33/31 CSI Laboratory Standard Learning vs. Multiple Instance Learning • Standard supervised learning – Optimize some model (or learn a target concept) given training samples and corresponding labels • MIL X {x1,...,x n }, Y { y1,...,y n } – Learn a target concept given multiple sets of samples and corresponding labels for the sets. – Interpretation: Learning with uncertain labels / noisy teacher X i {x i1,..., x ini }, Yi 1,{ y i1 ?,..., y ini ?} 2010 34/31 CSI Laboratory Multiple Instance Learning (MIL) • Given: – Set of I bags – Labeled + or - B {B1 ,..Bi , Bi1,...,BI } – The ith bag is a set of Ji samples in some feature space – Interpretation of labels Bi {xi1,...,xiJi } Bi j : label( xij ) 1 Bi j, label( xij ) 0 • Goal: learn concept – What characteristic is common to the positive bags that is not observed in the negative bags 2010 35/31 CSI Laboratory MIL Application: Example GPR EHD: Feature Vector • Collaboration: Frigui, Collins, Torrione • Construction of bags – Collect 15 EHD feature vectors from the 15 depth bins – Mine images = + bags – FA images = - bags x1 , x 2 , x 3 , x 4 ,..., x15 2010 36/31 CSI Laboratory Standard vs. MI Learning: GPR Example • Standard Learning – Each training sample (feature vector) must have a label • Arduous task – many feature vectors per image and multiple images – difficult to label given GPR echoes, ground truthing errors, etc … – label of each vector may not be known EHD: Feature Vector y1 y2 y3 y4 x1 x2 x3 x4 yn xn 2010 37/31 CSI Laboratory Standard vs MI Learning: GPR Example • Multiple Instance Learning EHD: Feature Vector – Each training bag must have a label – No need to label all feature vectors, just identify images (bags) where targets are present – Implicitly accounts for class label uncertainty … Y x1 , x 2 , x3 , x 4 ,..., x15 2010 38/31 Random Set Framework for Multiple Instance Learning CSI Laboratory Random Set Brief • Random Set (, B(), P) ( , B( ), P) R (, B()) (R, B(R) ) 2010 40/31 CSI Laboratory How can we use Random Sets for MIL? • Random set for MIL: Bags are sets X {x1 ,..., x n } – Idea of finding commonality of positive bags inherent in random set formulation • Sets have an empty intersection or non-empty intersection relationship • Find commonality using intersection operator • Random sets governing functional is based on intersection operator – Capacity functional : T A.K.A. : Noisy-OR gate (Pearl 1988) T ( X ) 1 1 T ( x) xX It is NOT the case that EACH element is NOT the target concept 2010 41/31 CSI Laboratory Random Set Functionals • Capacity functionals for intersection calculation P( X ) T ( X ) • Use germ and grain model to model random set Random Set model parameters { , } – Multiple (J) Concepts J ({ j } j ) Germ Grain j 1 – Calculate probability of intersection given X and germ and grain pairs: T ( X ) 1 1 T j ( x) j xX – Grains are governed by random radii with assumed cumulative: T j ({x}) P( R j rj ) 1 P( R j rj ) 2 2 , rj x j T 2010 1 exp(rj j rj ) 42/31 CSI Laboratory RSF-MIL: Germ and Grain Model • Positive Bags = blue • Negative Bags = orange • Distinct shapes = distinct bags x T x T T x x Tx x x x T x 2010 43/31 Multiple Instance Learning with Multiple Concepts CSI Laboratory Multiple Concepts: Disjunction or Conjunction? • Disjunction – When you have multiple types of concepts – When each instance can indicate the presence of a target • Conjunction – When you have a target type that is composed of multiple (necessary concepts) – When each instance can indicate a concept, but not necessary the composite target type 2010 45/31 CSI Laboratory Conjunctive RSF-MIL • Previously Developed Disjunctive RSF-MIL (RSF-MIL-d) T ( X ) 1 1 T j ( x) j xX Noisy-OR combination across concepts and samples Standard noisy-OR for one concept j • Conjunctive RSF-MIL (RSF-MIL-c) T ( X ) 1 1 T j ( x) j xX Noisy-AND combination across concepts 2010 46/31 CSI Laboratory Synthetic Data Experiments • AUC (AUC when initialized near solution) Extreme Conjunct data set requires that a target bag exhibits two distinct concepts rather than one or none 2010 47/31 Application to Remote Sensing CSI Laboratory Disjunctive Target Concepts Noisy OR Target Concept Type 1 Noisy OR Target Concept Type 2 … O R Noisy OR • Target Concept Type n Using Large overlapping bins (GROSS Extraction) the target concept can be encapsulated within 1 instance: Therefore a disjunctive relationship exists 2010 Target Concept Present? 49/31 CSI Laboratory What if we want features with finer granularity Noisy OR Constituent Concept 1 (top of hyperbola) … AND Noisy OR • Target Concept Present? Constituent Concept 2 (wings of hyperbola) Fine Extraction – More detail about image and more shape information, but may loose disjunctive nature between (multiple) instances Our features have more granularity, therefore our concepts may be constituents of a target, rather than encapsulating the target concept 2010 50/31 CSI Laboratory GPR Experiments • Extensive GPR Data set – ~800 targets – ~ 5,000 non-targets • Experimental Design – Run RSF-MIL-d (disjunctive) and RSF-MIL-c (conjunctive) – Compare both feature extraction methods • Gross extraction: large enough to encompass target concept • Fine extraction: Non-overlapping bins • Hypothesis – RSF-MIL will perform well when using gross extraction whereas RSF-MIL-c will perform well using Fine extraction2010 51/31 CSI Laboratory Experimental Results • Highlights – RSF-MIL-d using gross extraction performed best – RSF-MIL-c performed better than RSF-MIL-d when using fine extraction – Other influencing factors: optimization methods for RSF-MIL-d and RSF-MIL-c are not the same Gross Extraction Fine Extraction 2010 52/31 CSI Laboratory Future Work • • • • Implement a general form that can learn disjunction or conjunction relationship from the data Implement a general form that can learn the number of concepts Incorporate spatial information Develop an improved optimization scheme for RSF-MIL-C 2010 53/31 CSI Laboratory 2010 54/31 CSI Laboratory HMM Model Visualization Points = Gaussian Component means DTXTHMM 2 Falling Diagonal 1 0 0 0.5 1 1.5 2 Rising Diagonal Initial Probs 1 Initial probabilities State index1 State index 2 State index 3 1 2 3 0.5 0 Transition Probs Color = State Index 1 2 3 1 2 3 Transition probabilities from state to state (red = high probability) Pattern Characterized 2010 55/31 CSI Laboratory 2010 56/31 CSI Laboratory 2010 57/31 CSI Laboratory 2010 58/31 CSI Laboratory 2010 59/31 CSI Laboratory 2010 60/31 CSI Laboratory 2010 61/31 CSI Laboratory 2010 62/31 Backup Slides CSI Laboratory MIL Example (AHI Imagery) • Robust learning tool – MIL tools can learn target signature with limited or incomplete ground truth Which spectral signature(s) should we use to train a target model or classifier? 1. Spectral mixing 2. Background signal 3. Ground truth not exact 2010 64/31 CSI Laboratory MI-RVM • Addition of set observations and inference using noisy-OR to an RVM model K P( y 1 | X ) 1 1 ( wT x j ) j 1 1 ( z) 1 exp( z ) • Prior on the weight w p(w) N (w | 0, A1 ) 2010 65/31 CSI Laboratory SVM review • Classifier structure y(x) wTφ(x) b • Optimization 1 2 min w C i w,b 2 i st i : ti (wT φ(xi ) b) 1 i , i 0, 2010 66/31 CSI Laboratory MI-SVM Discussion • RVM was altered to fit MIL problem by changing the form of the target variable’s posterior to model a noisy-OR gate. • SVM can be altered to fit the MIL problem by changing how the margin is calculated – Boost the margin between the bag (rather than samples) and decision surface – Look for the MI separating linear discriminant • There is at least one sample from each bag in the half space 2010 67/31 CSI Laboratory mi-SVM • Enforce MI scenario using extra constraints 1 2 min min w C i {ti } w,b 2 i st i : ti (wT φ(xi ) b) 1 i , i 0, ti {1,1} At least one sample in each positive bag must have a label of 1. All samples in each negative bag must have a label of -1. ti 1 1, I : TI 1, 2 iI ti 1, I : TI 1 Mixed integer program: Must find optimal hyperplane and optimal labeling set 2010 68/31 CSI Laboratory Current Applications I. Multiple Instance Learning I. MI Problem II. MI Applications II.Multiple Instance Learning: Kernel Machines I. MI-RVM II. MI-SVM III. Current Applications I. GPR imagery II. HSI imagery 2010 69/31 CSI Laboratory HSI: Target Spectra Learning • Given labeled areas of interest: learn target signature • Given test areas of interest: classify set of samples 2010 70/31 CSI Laboratory Overview of MI-RVM Optimization • Two step optimization 1. Estimate optimal w, given posterior of w • There is no closed form solution for the parameters of the posterior, so a gradient update method is used • Iterate until convergence. Then proceed to step 2. 2. Update parameter on prior of w • The distribution on the target variable has no specific parameters. • Until system convergence, continue at step 1. 2010 71/31 CSI Laboratory 1) Optimization of w • Optimize posterior (Bayes’ Rule) of w wˆ MAP arg maxlog p( X | w) log p( w) w • Update weights using Newton-Raphson method wt 1 wt H 1 g 2010 72/31 CSI Laboratory 2) Optimization of Prior • Optimization of covariance of prior Aˆ arg max p( X | A) arg max p( X | w) p(w | A)dw A A • Making a large number of assumptions, diagonal elements of A can be estimated ainew 1 wi2 H ii1 2010 73/31 CSI Laboratory Random Sets: Multiple Instance Learning • Random set framework for multiple instance learning – Bags are sets – Idea of finding commonality of positive bags inherent in random set formulation • Find commonality using intersection operator • Random sets governing functional is based on intersection operator T ( K ) P( K ) 2010 74/31 CSI Laboratory MI issues • MIL approaches – Some approaches are biased to believe only one sample in each bag caused the target concept – Some approaches can only label bags – It is not clear whether anything is gained over supervised approaches 2010 75/31 CSI Laboratory RSF-MIL • MIL-like • Positive Bags = blue • Negative Bags = orange • Distinct shapes = distinct bags x T x T T x x x x Tx x T x 2010 76/31 CSI Laboratory Side Note: Bayesian Networks • Noisy-OR Assumption – Bayesian Network representation of Noisy-OR – Polytree: singly connected DAG 2010 77/31 CSI Laboratory Side Note • Full Bayesian network may be intractable – Occurrence of causal factors are rare (sparse co-occurrence) • So assume polytree • So assume result has boolean relationship with causal factors – Absorb I, X and A into one node, governed by randomness of I • These assumptions greatly simplify inference calculation • Calculate Z based on probabilities rather than constructing a distribution using X P(Z 1 | {X1 , X 2 , X 3 , X 4 }) 1 1 P(Z 1 | X j ) j 2010 78/31 CSI Laboratory Diverse Density (DD) • Probabilistic Approach – Goal: • Standard statistics approaches identify areas in a feature space with high density of target samples and low density of non-target samples • DD: identify areas in a feature space with a high “density” of samples from EACH of the postitive bags (“diverse”), and low density of samples from negative bags. – Identify attributes or characteristics similar to positive bags, dissimilar with negative bags – Assume t is a target characterization – Goal: arg max P B1 ,..., Bn , B1 ,..., Bm | t t – Assuming the bags are conditionally independent arg max t P Bi | t i PB j | t j 2010 79/31 CSI Laboratory Diverse Density • Calculation (Noisy-OR Model): P(t | Bi ) 1 1 P(t | Bij ) Bi {xi1,...,xiJi } j 2 2 P (t | B ) exp Bij t exp xij t ij P(t | Bi ) It is NOT the case that EACH element is NOT the target concept 1 P(t | B ) ij j arg max t P Bi | t i PB j j | t 2010 80/31 CSI Laboratory Random Set Brief • Random Set (, B(), P) ( , B( ), P) R (, B()) (R, B(R) ) 2010 81/31 CSI Laboratory Random Set Functionals • Capacity and avoidance functionals T ( K ) P( K ) T ( K ) 1 Q ( K ) Q ( K ) P( K ) i ni ({ } ij ij ) j 1 P({x} | ij ) T ij ({x}) 2 P( Rij a rgerm ) 1 P ( R r ) – Given and grain model ij ij T 1 exp(rij ij rij ) , rij x ij 2010 82/31 CSI Laboratory When disjunction makes sense Target Concept Present OR • Using Large overlapping bins the target concept can be encapsulated within 1 instance: Therefore a disjunctive relationship exists 2010 83/31 CSI Laboratory Theoretical and Developmental Progress • Previous Optimization: • Previous TO DO list Improve Existing Code • • Did not necessarily promote Develop joint optimization for context learning and MIL diverse density • Apply MIL approaches (broad scale) arg max T (B ) Q (B ) – j j , i j , i i • • • Learn similarities between feature sets of mines Aid in training existing algos: find “best” EHD features for training / testing Construct set-based classifiers? arg maxT j , ( Bi )Q j , ( Bi ) • Current optimization j i • Better for context learning and MIL 2010 84/31 CSI Laboratory How do we impose the MI scenario?: Diverse Density (Maron et al.) • Calculation (Noisy-OR Model): – Inherent in Random Set formulation P(t | Bi ) 1 1 P(t | Bij ) j P(t | Bi ) 1 P(t | Bij ) Bi {xi1,...,xiJi } It is NOT the case that EACH element is NOT the target concept 2 2 P (t | B ) exp Bij t exp xij t j ij • arg max P t | B i Optimization t i Pt | B j j 2010 – Combo of exhaustive search and gradient ascent 85/31 CSI Laboratory How can we use Random Sets for MIL? • Random set for MIL: Bags are sets – Idea of finding commonality of positive bags inherent in random set formulation • Sets have an empty intersection or non-empty intersection relationship • Find commonality using intersection operator • Random sets governing functional is based on intersection operator • Example: Bags with target {l,a,e,i,o,p,u,f} {f,b,a,e,i,z,o,u} {a,b,c,i,o,u,e,p,f} {a,f,t,e,i,u,o,d,v} intersection Bags without target {s,r,n,m,p,l} {z,s,w,t,g,n,c} {f,p,k,r} {q,x,z,c,v} {p,l,f} union Target concept = {a,e,i,o,u,f} \ {f,s,r,n,m,p,l,z,w,g,n,c,v,q,k} = {a,e,i,o,u} 2010 86/31