Learning to distinguish cognitive subprocesses based on fMRI Tom M. Mitchell

advertisement
Learning to distinguish cognitive
subprocesses based on fMRI
Tom M. Mitchell
Center for Automated Learning and Discovery
Carnegie Mellon University
Collaborators: Luis Barrios, Rebecca Hutchinson,
Marcel Just, Francisco Pereira, Jay Pujara, John
Ramish, Indra Rustandi
Can we distinguish brief cognitive processes using fMRI?
Finds sentence
ambiguous or not?
Can we classify/track multiple overlapping processes?
Read sentence
View picture
Decide whether consistent
Observed fMRI:
Observed button press:
Mental Algebra Task
[Anderson, Qin, & Sohn, 2002]
24
3
c
Activity Predicted by ACT-R Model
[Anderson, Qin, & Sohn, 2002]
Typical ACT-R rule:
IF “_ op a = b”
THEN “ _ = <b <inv op> a>”
[Anderson, Qin,
& Sohn, 2002]
Outline
• Training classifiers for short cognitive processes
–
–
–
–
Examples
Classifier learning algorithms
Feature selection
Training across multiple subjects
• Simultaneously classifying multiple overlapping
processes
– Linear Model and classification
– Hidden Processes and EM
Training “Virtual Sensors” of Cognitive Processes
Train classifiers of form:
fMRI(t, t+d)  CognitiveProcess
e.g., fMRI(t, t+8) = {ReadSentence, ViewPicture}
• Fixed set of cognitive processes
• Fixed time interval [t, t+d]
Study 1: Pictures and Sentences
Data from [Keller et al., 2001]
View Picture
Or
Read Sentence
t=0
4 sec.
Read Sentence
Or
View Picture
Fixation
Press Button
Rest
8 sec.
• Subject answers whether sentence describes picture by
pressing button.
• 13 subjects, TR=500msec
It is not true that the star is above the plus.
+
--*
.
• Learn fMRI(t,t+8)  {Picture,Sentence}, for t=0,8
View Picture
Or
Read Sentence
t=0
Fixation
4 sec.
picture or sentence?
Read Sentence
Or
View Picture
Press Button
Rest
8 sec.
picture or sentence?
Difficulties:
only 8 seconds of very noisy data
overlapping hemodynamic responses
additional cognitive processes occuring simultaneously
Learning task formulation:
• Learn fMRI(t, …, t+8)  {Picture, Sentence}
–
–
–
–
40 trials (40 pictures and 40 sentences)
fMRI(t,…t+8) = voxels x time (~ 32,000 features)
Train separate classifier for each of 13 subjects
Evaluate cross-validated prediction accuracy
• Learning algorithms:
–
–
–
–
Gaussian Naïve Bayes
Linear Support Vector Machine (SVM)
k-Nearest Neighbor
Artificial Neural Networks
• Feature selection/abstraction
–
–
–
–
Select subset of voxels (by signal, by anatomy)
Select subinterval of time
Summarize by averaging voxel activities over space, time
…
Learning a Gaussian Naïve Bayes (GNB) classifier for
<f1, … fn>  C
For each class value, ci,
1. Estimate
f1
f2 … fn
2. For each feature fj estimate
modeling distribution for each ci , fj,
as Gaussian,
Applying GNB classifier to new instance
C
Support Vector Machines [Vapnik et al. 1992]
• Method for learning classifiers corresponding
to linear decision surface in high dimensional
spaces
• Chooses maximum margin decision surface
• Useful in many high-dimensional domains
– Text classification
– Character recognition
– Microarray analysis
Support Vector
Machines (SVM)
Linear SVM
Non-linear Support Vector Machines
• Based on applying kernel functions to data points
– Equivalent to projecting data into higher dimensional
space, then finding linear decision surface
– Select kernel complexity (H) to minimize ‘structural risk’
True error
rate
Error on
training data
Variance term related
to kernel H complexity
and number of training
examples m
Generative vs. Discriminative Classifiers
Goal: learn
, equivalently
Discriminative classifier:
• Learn
directly
Generative classifier:
• Learn
• Classify using
Generative vs. Discriminative Classifiers
Discriminative
Generative
What they
estimate:
P(C|data)
P(data|C)
Examples:
SVM’s,
Artificial Neural
Nets
Naïve Bayes,
Bayesian
networks
Robustness to
modeling errors
Typically more
robust
Less robust
Criterion for
estimating
parameters
Minimize
classification
error
Maximize data
likelihood
GNB vs. Logistic regression
[Ng, Jordan NIPS03]
Gaussian naïve Bayes
Logistic regression
• Model P(X|C) as a classconditional Gaussian
• Model P(C|X) as a
logistic function
• Decision surface:
hyperplane
• Decision surface:
hyperplane
• Learning converges in
O(log(n)) examples,
where n is number of
data attributes
• Learning converges in
O(n) examples
• Asymptotic error less or
same as GNB
Accuracy of Trained Pict/Sent Classifier
• Results (leave one out cross validation)
– Guessing  50% accuracy
– SVM: 91% mean accuracy
• Single subject accuracies ranged from 75% to 98%
– GNB: 84% mean accuracy
– Feature selection step important for both
• ~10,000 voxels x 16 time samples = 160,000 features
• Selected only 240 voxels x 16 time samples
Can We Train Subject-Indep Classifiers?
Training Cross-Subject Classifiers for Picture/Sentence
[Wang, Hutchinson, Mitchell. NIPS03]
• Approach1: define “supervoxels” based on
anatomically defined brain regions
– Abstract to seven brain region supervoxels
– Each supervoxel 100’s to 1000’s of voxels
• Train on n-1 subjects, test on nth subject
• Result: 75% prediction accuracy over subjects
outside training set
– Compared to 91% avg. single-subject accuracies
– Significantly better than 50% guessing accuracy
Study 2: Semantic Word Categories
[Francisco Pereira]
Word categories:
• Fish
• Trees
• Vegetables
• Tools
• Dwellings
• Building parts
Experimental setup:
• Block design
• Two blocks per
category
• Each block begins
by presenting
category name,
then 20 words
• Subject indicates
whether word fits
category
Learning task formulation
• Learn fMRI(t, …, t+32)  WordCategory
– fMRI(t,…t+32) represented by mean fMRI image
– Train on presentation 1, test on presentation 2 (and vice versa)
• Learning algorithm:
– 1-Nearest Neighbor, based on spatial correlation [after Haxby]
• Feature selection/abstraction
– Select most ‘object selective’ voxels, based on multiple regression
on boxcars convolved with gamma function
– 300 voxels in ventral temporal cortex produced greatest accuracy
Results predicting word semantic category
Mean pairwise prediction accuracy averaged over 8
subjects:
• Ventral temporal: 77% (low: 57%, high 88%)
• Parietal: 70%
• Frontal: 67%
Random guess: 50%
Mean Activation
per Voxel Vegetables
for Word
Categories
P(fMRI | WordCategory)
Tools
one horizontal slice,
ventral temporal
cortex
[Pereira, et al 2004]
Dwellings
Plot of single-voxel classification accuracies.
Gaussian naïve Bayes classifier (yellow and red are
most predictive). Images from three different subjects
show similar regions with highly informative voxels.
Subject 1
Subject 2
Subject 3
Single-voxel GNB classification error vs.
p value from T-statistic
N=10^6, P < 0.0001, Error = 0.51
N=10^3, P < 0.0001, Error = 0.01
Cross validated prediction error is unbiased estimate of the
Bayes optimal error – the area under the intersection
Question:
Do different people’s brains
‘encode’ semantic categories
using the same spatial patterns?
No.
But, there are cross-subject
regularities in “distances”
between categories, as
measured by classifier
error rates.
Six-Category Study: Pairwise Classification Errors
(ventral temporal cortex)
* Worst * Best
Subj1
Sub2
Sub3
Sub4
Sub5
Sub6
Sub7
Mean
Fish
Vegetables
Tools
Dwellings Trees
Bldg Parts
.20
.10 *
.20
.15
.60 *
.20
.15
.23
.55 *
.55 *
.35 *
.45 *
.55
.25
.55 *
.46
.20
.35
.15 *
.15
.25
.00 *
.15
.18
.15
.20
.20
.15
.20
.30 *
.25
.21
.05 *
.30
.20
.05 *
.15 *
.05
.05 *
.12
.15
.10 *
.20
.25
.15 *
.30 *
.15
.19
LDA classification of semantic categories
of photographs.
[Carlson, et al., J. Cog. Neurosci, 2003]
Cox & Savoy, Neuroimage 2003
Trained SVM
and LDA
classifiers for
semantic
photo
categories.
Classifiers
applied to
same subject
a week later
were equally
accurate
Lessons Learned
Yes, one can train machine learning classifiers to
distinguish a variety of cognitive processes
–
–
–
–
Comprehend Picture vs. Sentence
Read ambiguous sentence vs. unambiguous
Read Noun vs. Verb
Read Nouns about “tools” vs. “building parts”
Failures too:
– True vs. false sentences
– Negative vs. affirmative sentences
Which Machine Learning Method Works Best?
• GNB and SVM tend to outperform KNN
• Feature selection important
Average per-subject classification error
No
Yes
No
Yes
No
Yes
No
Yes
Which Feature Selection Works Best?
Wish to learn F: <x1,x2,…xn>  {A,B}
• Conventional wisdom: pick features xi that best
distinguish between classes A and B
– E.g., sort xi by mutual information, choose the top n
• Surprise:
Alternative strategy worked much better
The learning setting
Class A
Class B
Voxel discriminability
Voxel activity
Voxel activity
Rest / Fixation
GNB Classifier Errors: Feature Selection
feature selection method
fMRI study
Picture
Syntactic
Sentence Ambiguity
Nouns vs.
Verbs
Word
Categories
.29
.26
.43
.34
.36
.36
.10
.10
Active
.16
.25
.34
.08
ROI Active
.18
.21
.27
.27
.31
.23
.09
NA
All features
Discriminate
target classes
ROI Active
Average
“Zero Signal” learning setting.
Select features based on discrim(X1,X2) or discrim(Z,Xi)?
Class 1
observations
Class 2
observations
X1=S1+N1
X2=S2+N2
Goal: learn f: XY or P(Y|X)
Given:
1. Training examples <Xi, Yi>
where Xi = Si + Ni ,
signal Si ~ P(S|Y= Yi),
noise Ni ~ Pnoise
Z = N0
Zero signal
(fixation)
2. Observed noise with zero
signal N0 ~ Pnoise
“Zero Signal” learning setting
Conjecture: feature selection using
discrim(Z,Xi) will improve relative to
discrim(X1,X2) as:
• # of features increases
• # of training examples decreases
• signal/noise ratio decreases
• fraction of relevant features decreases
2. Can we classify/track multiple overlapping processes?
Input stimuli:
Read sentence
View picture
?
Observed fMRI:
Observed button press:
Decide whether consistent
Bayes Net related State-Space Models
HMM’s, DBNs, etc. e.g., [Ghahramani, 2001]
Cognitive
subprocesses
/ state
variables:
fMRI:
see [Hojen-Sorensen et al, NIPS99]
Hidden Process Model [with Rebecca Hutchinson]
Each process defined by:
– ProcessID: <comprehend sentence>
– Maximum HDR duration: R
– EmissionDistribution: [ W(v,t) ]
Interpretation Z of data: set of process instances
– Desire max likelihood { <ProcessIDi, StartTimei>}
– Where data likelihood is
Generative model for classifying overlapping hidden processes
Classifying Processes with HPMs
Start time known:
Start time unknown: consider candidate times S
GNB classifier is a special case of HPM classifier
View Picture
Or
Read Sentence
t=0
4 sec.
Read Sentence
Or
View Picture
Fixation
Press Button
Rest
8 sec.
GNB:
picture or sentence?
HPM:
picture or sentence?
16 sec.
picture or sentence?
picture or sentence?
Learning HPMs
• Known start times:
Least squares regression, eg. see Dale[HMB,
1999]
• Unknown start times:
EM algorithm
– Repeat:
• Estimate P(S|Y,W)
• W’  arg max
Currently implement M step with gradient
ascent
OLS learns 2 processes, overlapping in time, 1 voxel, zero noise, start times
known, 10 trials
[Indra Rustandi]
Observed data
Reconstructed data
Learned process 1
Learned process 2
Estimates:
-0
0.25
0.5
0.75
1
0.75
0.5
0.25
3.5108e17
-4.7535e17
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
OLS learns 2 processes, overlapping in time, 1 voxel, noise 0.2, start times
known, 10 trials
[Indra Rustandi]
Observed data
Reconstructed data
Learned process 1
Learned process 2
Estimates:
0.005495
6
0.32446
0.48847
0.83317
0.99872
0.86555
0.55624
0.23633
0.050592
0.017376
0.36435
0.36134
0.4856
0.60143
0.46168
0.54137
0.47466
0.52419
Estimate Noun and Verb impulse responses
[Indra Rustandi]
Phase II, Words every 3 seconds.
Mean LFEF, subj 08179
Verb impulse response
estimated from above
Verb impulse response
“ground truth” from nonoverlapping stimuli
Can we classify/track multiple overlapping processes?
Read sentence
View picture
Decide whether consistent
Observed fMRI:
Observed button press:
Learned HPM with 3 processes (S,P,D), and R=13sec (TR=500msec).
Learned models
S
P
S
D?
P
D?
S
observed
P
S
S
D
P
P
D
reconstructed
D start time picked to be trailStart+18
D
Initial results: HPM’s on PictSent
• EM chooses start time = 18 for hidden D
process
• Classification accuracy for heldout PS/SP
trials = 15/20 = 0.75
• Heldout classification accuracy same for 2
process (P,S) and 3 process (P,S,D) models
• Data likelihood over heldout data slightly
better for 3 process (P,S,D)
Further reading
•
Carlson, et al., J. Cog. Neurosci, 2003
•
Cox, D.D. and R.L. Savoy, Functional magnetic resonance imaging (fMRI) ``brain reading'':
detecting and classifying distributed patterns of fMRI activity in human visual cortex.
NeuroImage, Volume 19, Pages 261--270, 2003.
•
Kjems, U., L. Hansen, J. Anderson, S. Frutiger, S. Muley, J. Sidtis, D. Rottenberg, and S. C.
Strother. The quantitative evalutation of functional neuroimaging experiments: mutual
information learning curves, NeuroImage 15, pp. 772--786, 2002.
•
Mitchell, T.M., R. Hutchinson, M. Just, S. R. Niculescu, F. Pereira, X. Wang, Classifying
Instantaneous Cognitive States from fMRI Data. Proceedings of the 2003 Americal Medical
Informatics Association Annual Symposium, Washington D.C., November 2003.
•
Mitchell, T.M., R. Hutchinson, S. R. Niculescu, F. Pereira, X. Wang, , M. Just, S. Newman.
Learning to Decode Cognitive States from Brain Images, Machine Learning, 2004.
•
Strother S.C., J. Anderson, L.Hansen, U.Kjems, R.Kustra, J. Siditis, S. Frutiger, S. Muley, S.
LaConte, and D. Rottenberg. The quantitative evaluation of functional neuroimaging
experiments: The NPAIRS data analysis framework. NeuroImage 15:747-771, 2002.
•
Wang, X., R. Hutchinson, and T.~M. Mitchell. Training fMRI Classifiers to Detect Cognitive
States across Multiple Human Subjects. Proceedings of the 2003 Conference on Neural
Information Processing Systems, Vancouver, December 2003.
Download