mining

advertisement
Searching for structure
in random field data
Keith J. Worsley12,
Thomas W. Yee3, Russell B. Millar3
1Department
of Mathematics and Statistics, McGill University,
2McConnell Brain Imaging Centre, Montreal Neurological Institute,
Montreal, Canada, and
3Department of Statistics, University of Auckland, New Zealand
www.math.mcgill.ca/keith
What is Data Mining?
The June 26, 2000, issue of TIME predicted that
one of the 10 hottest jobs of the 21st century
will be Data Mining:
“ … research gurus will be on hand to extract
useful tidbits from mountains of data,
pinpointing behaviour patterns for marketers
and epidemiologists alike.”
Some definitions:
• Data mining is the process of selecting, exploring, and modeling large
amounts of data to uncover previously unknown patterns for business
advantage (SAS 1998 Annual Report, p51)
• Data mining is the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data (Fayyad)
• Data mining is the process of discovering advantageous patterns in data
(John)
• Data mining is the computer automated exploratory data analysis of
(usually) large complex data sets (Freidman, 1998)
• Data mining is the search for valuable information in large volumes of data
(Weiss and Indurkhya, 1998)
• In contrast, Statistics is the science of collecting, organizing and presenting
data.
Why is it called “Data Mining”?
• Plentiful data can be mined for nuggets of gold (i.e. truth
/insight/knowledge) by sifting through vast amounts of raw data.
• Some statisticians have criticized it as “data dredging” or a
“fishing expedition” in the search of publishable P-values, or
“torturing the data until it confesses”.
• Many DM methods are heuristic, complex, computer intensive,
so their statistical properties are usually not tractable.
• The focus of DM is often prediction and not statistical inference.
• I understand mining to be a very carefully planned search for
valuables hidden out of sight, not a haphazard ramble. Mining is
thus rewarding, but, of course, a dangerous activity. (D.R. Cox,
in the discussion of Chatfield, 1995).
Striking fool’s gold
• “The Bible Code”, a best-selling book by Michael
Drosnin, claims to find hidden messages in the Bible
about dinosaurs, Bill Clinton, the Rabin assassination
etc. from searches of arrays of letters …
• In 1992, ProCyte Corp. was dismayed when a newly
developed drug, lamin, failed to promote general
healing of diabetic ulcer wounds. So the company
searched through subsets of data and found that lamin
appeared to work on certain foot wounds. But that
was a statistical fluke, as it turned out after an
expensive clinical trial. Not allowed drug status,
lamin is now sold as a wound dressing …
Confirming vs. Discovering
There are two types of DM:
1. Hypothesis testing (aka top-down approach)
2. Knowledge Discovery in Databases (KDD)
(aka bottom-up approach)
• Directed KDD: want to explain the value of some
particular variable in terms of other variables
• Undirected KDD: identifies patterns in the data.
Undirected KDD recognizes relationships in data;
Directed KDD explains those relationships once they
have been found.
Mining the miners
DM so far has been largely a commercial enterprise. As
in most gold rushes of the past, the goal is to “mine
the miners”. The largest profits are made by selling
the tools to the miners, rather than in doing the actual
mining:
• Hardware manufacturers emphasize high
computational requirements of DM.
• Software developers emphasize competitive edge
“Your competitor is doing it, so you had better keep
up.”
Some commercial software
• SAS “Enterprise Miner”
• SPSS “Clementine”, “Neural Connection”
and “AnswerTree”
• IBM “Intelligent Miner”
• SGI “MineSet”
• NeoVista Software “ASIC”
• Mathsoft “S-PLUS” (for small data sets)
Some methods
• Hypothesis testing: Regression, analysis of
variance, time series analysis.
• Directed KDD: Classification, discrimination,
structural equation modeling, supervised
neural networks.
• Undirected KDD: Cluster analysis, tree
methods (AID, CHAID, CART), principal
components analysis (PCA), independent
components analysis (ICA), unsupervised
neural networks.
Allied fields
• Exploratory Data Analysis (EDA): Tukey defined
statistics in terms of problems rather than tools.
• Informatics “is research on, development of, and use of technological, sociological, and
organizational tools and applications for the dynamic acquisition, indexing, dissemination, storage,
querying, retrieval, visualization, integration, analysis, synthesis, sharing (which includes electronic means
of collaboration), and publication of data such that economic and other benefits may be derived from the
information by users of all sections of society.”
• Pattern recognition: given some examples of complex signals
and the correct decisions for them, make decisions
automatically for a stream of future examples, e.g. identify
plants, tumors, decide to buy or sell stocks.
• Machine learning “is the study of computer algorithms that improve
automatically through experience. Applications range from data mining programs
that discover rules in large data sets, to information filtering systems that
automatically learn users’ interests.” (Mitchell, 1997).
• Meta-Analysis is the statistical analysis of a large collection of
analysis results from individual studies for the purpose of
integrating the findings.
Brain mapping data
• We have huge data bases of brain images (MRI,
fMRI, PET, EEG, MEG …) together with patient
information (age, sex, psychological tests, disease,
genotype …)
• The novelty is that the image variables are 3D images
rather than single numbers (such as blood pressure,
cholesterol level …)
• These images can themselves be mined for interesting
information, e.g. peaks or clusters of activated
regions
Some data mining tools already
used in brain mapping
• Regression, analysis of variance, time series
• Cluster analysis (e.g. clustering of fMRI time
courses)
• PCA and ICA of voxels × scans matrix
• Structural equation modeling to analyze
connectivity
• Pattern recognition to segment gray/white/CSF
• Meta-analysis to combine locations of
activation from different studies
Tree methods: Automatic
Interaction Detection (AID)
Morgan, J.N. and Sonquist, J.A. (1963). Problems in the
analysis of survey data, and a proposal. Journal of the
American Statistical Association, 58, 415-434.
Kass, G.V. (1980). An exploratory technique for
investigating large quantities of categorical data.
Applied Statistics, 29, 119-127.
Worsley, K.J. (1978). Significance testing in Automatic
Interaction Detection (AID). PhD Thesis, University
of Auckland.
How AID works:
• Split observations into two groups according to the
values of a predictor
• Two types of predictors:
– Monotonic: split by thresholding:
{predictor ≤ x} | {predictor > x}
– Free: split into any two subsets, e.g. if predictor
takes values {x1, …, x7}:
{x1, x5, x6} | {x2, x3, x4, x7}
• Choose the split that maximizes a test statistic for the
difference in dependent or target variable
• Repeat on two subgroups until some stopping
criterion is reached (split is not ‘significant’ or
subgroup size is too small)
SPSS example: credit risk data
Dependant or target
M
M
F
F
Predictors
M
M
F
F
M
M
M = monotonic (split by thresholding), F = free (split into any two subsets)
Brain mapping example: cortical thickness
Dependant or target
M
M
Predictor
M
M
M
Subject Node1 Node2 Node3 Node4 … Node40962 Sex
1
2
3
4
5
6
7
8
9
:
321
3.73
2.95
2.30
2.64
2.39
3.26
2.68
3.60
3.27
:
4.10
3.05
1.17
1.23
2.19
2.76
1.85
2.52
3.66
1.43
:
2.67
3.93
3.33
2.56
2.57
2.51
3.31
3.23
2.90
2.88
:
2.83
2.30
2.75
1.20
2.25
2.82
1.70
2.30
2.25
1.81
:
1.78
…
…
…
…
…
…
…
…
…
…
1.59
1.03
1.46
1.29
1.02
1.65
1.47
1.79
2.14
:
1.70
m
f
f
m
f
f
m
m
f
:
f
Misclassification matrix:
cortical thickness
Actual category
Male
Female
Predicted Male
145
18
category Female
18
140
fMRI data: 120 scans, 3 scans each of hot, rest, warm, rest, hot, rest, …
First scan of fMRI data
Highly significant effect, T=6.59
1000
hot
rest
warm
890
880
870
500
0
100
200
300
No significant effect, T=-0.74
820
hot
rest
warm
0
800
T statistic for hot - warm effect
5
0
-5
T = (hot – warm effect) / S.d.
~ t110 if no effect
0
100
0
100
200
Drift
300
810
800
790
200
Time, seconds
300
Brain mapping example: fMRI
Dependant or target
M
Frame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
:
117
Voxel1
1.1
-0.59
1.06
1.63
2.3
1.27
1.18
0.98
1.46
0.07
0.39
0.04
-0.06
-0.48
-0.09
-0.24
-1.52
-0.07
-1.4
:
-0.01
M
Voxel2
1.66
0.23
1.57
1.79
1.96
1.36
1.33
0.9
1.25
0.7
0.68
-0.04
0.2
-0.26
-0.39
0.02
-1.11
0.1
-0.57
:
0.5
Predictor
M
M
Voxel3
1.53
0.38
1.56
0.88
1.41
0.73
1.35
0.47
0.77
1.29
1.13
-0.18
0.29
-0.19
-0.84
0.51
-1.44
-0.07
0.01
:
0.74
M
Voxel4 Voxel30786 Stimulus
0.77 ... -0.12
hot
-0.43 ... -1.73
hot
1.14 ... 0.64
hot
-0.22 ... -0.07
hot
1.33 ... 1.76
hot
0.24 ... 1.22
warm
1.3 ... 0.88
warm
0.18 ...
0.6
warm
0.73 ...
1.3
warm
1.96 ... 2.04
warm
1.81 ...
1.8
warm
0.37 ... 1.63
hot
0.49 ...
0.7
hot
-0.16 ... -0.42
hot
-0.94 ... -0.68
hot
1.2 ... 1.38
hot
-1.88 ... -1.11
hot
-0.24 ... 0.17
warm
0.3 ... 0.41
warm
:
:
:
0.83 ... 0.99
warm
Misclassification matrix:
fMRI
Predicted Hot
category Warm
Actual category
Hot
Warm
51
1
7
58
Splitting the SPM itself:
Dependant or target
?
Voxel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
:
30786
x
1.1719
3.5156
5.8594
1.1719
3.5156
5.8594
1.1719
3.5156
5.8594
1.1719
3.5156
5.8594
1.1719
3.5156
5.8594
1.1719
3.5156
5.8594
:
.
Predictor
?
y
-10.5469
-10.5469
-10.5469
-8.2031
-8.2031
-8.2031
-5.8594
-5.8594
-5.8594
-10.5469
-10.5469
-10.5469
-8.2031
-8.2031
-8.2031
-5.8594
-5.8594
-5.8594
:
.
?
z
7.2921
7.2921
7.2921
7.2921
7.2921
7.2921
7.2921
7.2921
7.2921
14.2921
14.2921
14.2921
14.2921
14.2921
14.2921
14.2921
14.2921
14.2921
:
.
T statistic
5.4852
5.9170
5.0115
6.1082
6.4825
5.7299
6.7113
7.3540
6.5934
5.4519
6.3674
6.3184
6.2774
6.5888
6.2456
6.3583
6.4093
5.8665
:
.
How do we split on a spatial predictor?
Splits can be regarded as models with
different means for the two groups:
SPM model
SPM model
Monotonic predictor
Free predictor
Smoothed SPM model
Unsmoothed SPM model
Smooth SPM
with a filter
that matches
the model
Spatial predictor
‘Free’ predictor
So …
• Treating spatial location as a free predictor (for
the smoothed SPM) is equivalent to simply
thresholding the smoothed SPM
• We can choose the threshold to control the
false splitting rate to P < 0.05 using Bonferroni
corrections or random field theory
• If model width is unknown, we can make filter
width another parameter of the model, which
leads to scale space:
Scale space: smooth X(t) with a range of filter widths, s
= continuous wavelet transform
adds an extra dimension to the random field: X(t, s)
Scale space, no signal
S = FWHM (mm, on log scale)
34
8
6
4
2
0
-2
22.7
15.2
10.2
6.8
-60
-40
-20
0
20
One 15mm signal
40
60
34
8
6
4
2
0
-2
22.7
15.2
10.2
6.8
-60
-40
-20
0
t (mm)
20
40
60
15mm signal best detected with a ~15mm smoothing filter
Matched Filter Theorem (= Gauss-Markov Theorem):
“to best detect a signal + white noise, filter should match signal”
10mm and 23mm signals
S = FWHM (mm, on log scale)
34
8
6
4
2
0
-2
22.7
15.2
10.2
6.8
-60
-40
-20
0
20
40
Two 10mm signals 20mm apart
60
34
8
6
4
2
0
-2
22.7
15.2
10.2
6.8
-60
-40
-20
0
t (mm)
20
40
60
But if the signals are too close together they are
detected as a single signal half way between them
Scale space can even separate two signals at the same location!
8mm and 150mm signals at the same location
10
5
S = FWHM (mm, on log scale)
0
-60
170
-40
-20
0
20
40
60
113.7
20
76
50.8
15
34
10
22.7
15.2
5
10.2
6.8
-60
-40
-20
0
t (mm)
20
40
60
FWHM = 6.8mm
FWHM = 9mm
FWHM = 11mm
FWHM = 15mm
FWHM = 20mm
FWHM = 26mm
FWHM = 34mm
FWHM
FWHM
FWHM
FWHM
FWHM
FWHM
FWHM
FWHM
Functional connectivity
• Measured by the correlation between residuals
at every pair of voxels (6D data!)
Activation only
Voxel 2
++
+ +++
Correlation only
Voxel 2
Voxel 1
+
+
++
+
Voxel 1
+
• Local maxima are larger than all 12 neighbours
• P-value can be calculated using random field
theory
• Good at detecting focal connectivity, but
• PCA of residuals x voxels is better at detecting
large regions of co-correlated voxels
|Correlations| > 0.7,
P<10-10 (corrected)
First Principal
Component > threshold
False Discovery Rate (FDR)
Benjamini and Hochberg (1995), Journal of the Royal Statistical Society
Benjamini and Yekutieli (2001), Annals of Statistics
Genovese et al. (2001), NeuroImage
• FDR controls the expected proportion of false
positives amongst the discoveries, whereas
• Bonferroni / random field theory controls the
probability of any false positives
• No correction controls the proportion of false
positives in the volume
Signal + Gaussian
white noise
Signal
Noise
P < 0.05 (uncorrected), T > 1.64
5% of volume is false +
4
4
2
2
0
-2
-4
FDR < 0.05, T > 2.82
5% of discoveries is false +
True +
False +
0
-2
-4
P < 0.05 (corrected), T > 4.22
5% probability of any false +
4
4
2
2
0
0
-2
-2
-4
-4
Comparison of thresholds
• FDR depends on the ordered P-values:
P1 < P2 < … < Pn. To control the FDR at a = 0.05, find
K = max {i : Pi < (i/n) a}, threshold the P-values at PK
Proportion of true + 1
0.1 0.01 0.001 0.0001
Threshold T 1.64 2.56 3.28 3.88 4.41
• Bonferroni thresholds the P-values at a/n:
Number of voxels 1
10 100 1000 10000
Threshold T 1.64 2.58 3.29 3.89 4.42
• Random field theory: resels = volume / FHHM3:
Number of resels
0
1
10 100 1000
Threshold T 1.64 2.82 3.46 4.09 4.65
P < 0.05 (uncorrected), T > 1.64
5% of volume is false +
FDR < 0.05, T > 2.66
5% of discoveries is false +
P < 0.05 (corrected), T > 4.90
5% probability of any false +
Download