Searching for structure in random field data Keith J. Worsley12, Thomas W. Yee3, Russell B. Millar3 1Department of Mathematics and Statistics, McGill University, 2McConnell Brain Imaging Centre, Montreal Neurological Institute, Montreal, Canada, and 3Department of Statistics, University of Auckland, New Zealand www.math.mcgill.ca/keith What is Data Mining? The June 26, 2000, issue of TIME predicted that one of the 10 hottest jobs of the 21st century will be Data Mining: “ … research gurus will be on hand to extract useful tidbits from mountains of data, pinpointing behaviour patterns for marketers and epidemiologists alike.” Some definitions: • Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage (SAS 1998 Annual Report, p51) • Data mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad) • Data mining is the process of discovering advantageous patterns in data (John) • Data mining is the computer automated exploratory data analysis of (usually) large complex data sets (Freidman, 1998) • Data mining is the search for valuable information in large volumes of data (Weiss and Indurkhya, 1998) • In contrast, Statistics is the science of collecting, organizing and presenting data. Why is it called “Data Mining”? • Plentiful data can be mined for nuggets of gold (i.e. truth /insight/knowledge) by sifting through vast amounts of raw data. • Some statisticians have criticized it as “data dredging” or a “fishing expedition” in the search of publishable P-values, or “torturing the data until it confesses”. • Many DM methods are heuristic, complex, computer intensive, so their statistical properties are usually not tractable. • The focus of DM is often prediction and not statistical inference. • I understand mining to be a very carefully planned search for valuables hidden out of sight, not a haphazard ramble. Mining is thus rewarding, but, of course, a dangerous activity. (D.R. Cox, in the discussion of Chatfield, 1995). Striking fool’s gold • “The Bible Code”, a best-selling book by Michael Drosnin, claims to find hidden messages in the Bible about dinosaurs, Bill Clinton, the Rabin assassination etc. from searches of arrays of letters … • In 1992, ProCyte Corp. was dismayed when a newly developed drug, lamin, failed to promote general healing of diabetic ulcer wounds. So the company searched through subsets of data and found that lamin appeared to work on certain foot wounds. But that was a statistical fluke, as it turned out after an expensive clinical trial. Not allowed drug status, lamin is now sold as a wound dressing … Confirming vs. Discovering There are two types of DM: 1. Hypothesis testing (aka top-down approach) 2. Knowledge Discovery in Databases (KDD) (aka bottom-up approach) • Directed KDD: want to explain the value of some particular variable in terms of other variables • Undirected KDD: identifies patterns in the data. Undirected KDD recognizes relationships in data; Directed KDD explains those relationships once they have been found. Mining the miners DM so far has been largely a commercial enterprise. As in most gold rushes of the past, the goal is to “mine the miners”. The largest profits are made by selling the tools to the miners, rather than in doing the actual mining: • Hardware manufacturers emphasize high computational requirements of DM. • Software developers emphasize competitive edge “Your competitor is doing it, so you had better keep up.” Some commercial software • SAS “Enterprise Miner” • SPSS “Clementine”, “Neural Connection” and “AnswerTree” • IBM “Intelligent Miner” • SGI “MineSet” • NeoVista Software “ASIC” • Mathsoft “S-PLUS” (for small data sets) Some methods • Hypothesis testing: Regression, analysis of variance, time series analysis. • Directed KDD: Classification, discrimination, structural equation modeling, supervised neural networks. • Undirected KDD: Cluster analysis, tree methods (AID, CHAID, CART), principal components analysis (PCA), independent components analysis (ICA), unsupervised neural networks. Allied fields • Exploratory Data Analysis (EDA): Tukey defined statistics in terms of problems rather than tools. • Informatics “is research on, development of, and use of technological, sociological, and organizational tools and applications for the dynamic acquisition, indexing, dissemination, storage, querying, retrieval, visualization, integration, analysis, synthesis, sharing (which includes electronic means of collaboration), and publication of data such that economic and other benefits may be derived from the information by users of all sections of society.” • Pattern recognition: given some examples of complex signals and the correct decisions for them, make decisions automatically for a stream of future examples, e.g. identify plants, tumors, decide to buy or sell stocks. • Machine learning “is the study of computer algorithms that improve automatically through experience. Applications range from data mining programs that discover rules in large data sets, to information filtering systems that automatically learn users’ interests.” (Mitchell, 1997). • Meta-Analysis is the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings. Brain mapping data • We have huge data bases of brain images (MRI, fMRI, PET, EEG, MEG …) together with patient information (age, sex, psychological tests, disease, genotype …) • The novelty is that the image variables are 3D images rather than single numbers (such as blood pressure, cholesterol level …) • These images can themselves be mined for interesting information, e.g. peaks or clusters of activated regions Some data mining tools already used in brain mapping • Regression, analysis of variance, time series • Cluster analysis (e.g. clustering of fMRI time courses) • PCA and ICA of voxels × scans matrix • Structural equation modeling to analyze connectivity • Pattern recognition to segment gray/white/CSF • Meta-analysis to combine locations of activation from different studies Tree methods: Automatic Interaction Detection (AID) Morgan, J.N. and Sonquist, J.A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58, 415-434. Kass, G.V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29, 119-127. Worsley, K.J. (1978). Significance testing in Automatic Interaction Detection (AID). PhD Thesis, University of Auckland. How AID works: • Split observations into two groups according to the values of a predictor • Two types of predictors: – Monotonic: split by thresholding: {predictor ≤ x} | {predictor > x} – Free: split into any two subsets, e.g. if predictor takes values {x1, …, x7}: {x1, x5, x6} | {x2, x3, x4, x7} • Choose the split that maximizes a test statistic for the difference in dependent or target variable • Repeat on two subgroups until some stopping criterion is reached (split is not ‘significant’ or subgroup size is too small) SPSS example: credit risk data Dependant or target M M F F Predictors M M F F M M M = monotonic (split by thresholding), F = free (split into any two subsets) Brain mapping example: cortical thickness Dependant or target M M Predictor M M M Subject Node1 Node2 Node3 Node4 … Node40962 Sex 1 2 3 4 5 6 7 8 9 : 321 3.73 2.95 2.30 2.64 2.39 3.26 2.68 3.60 3.27 : 4.10 3.05 1.17 1.23 2.19 2.76 1.85 2.52 3.66 1.43 : 2.67 3.93 3.33 2.56 2.57 2.51 3.31 3.23 2.90 2.88 : 2.83 2.30 2.75 1.20 2.25 2.82 1.70 2.30 2.25 1.81 : 1.78 … … … … … … … … … … 1.59 1.03 1.46 1.29 1.02 1.65 1.47 1.79 2.14 : 1.70 m f f m f f m m f : f Misclassification matrix: cortical thickness Actual category Male Female Predicted Male 145 18 category Female 18 140 fMRI data: 120 scans, 3 scans each of hot, rest, warm, rest, hot, rest, … First scan of fMRI data Highly significant effect, T=6.59 1000 hot rest warm 890 880 870 500 0 100 200 300 No significant effect, T=-0.74 820 hot rest warm 0 800 T statistic for hot - warm effect 5 0 -5 T = (hot – warm effect) / S.d. ~ t110 if no effect 0 100 0 100 200 Drift 300 810 800 790 200 Time, seconds 300 Brain mapping example: fMRI Dependant or target M Frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 : 117 Voxel1 1.1 -0.59 1.06 1.63 2.3 1.27 1.18 0.98 1.46 0.07 0.39 0.04 -0.06 -0.48 -0.09 -0.24 -1.52 -0.07 -1.4 : -0.01 M Voxel2 1.66 0.23 1.57 1.79 1.96 1.36 1.33 0.9 1.25 0.7 0.68 -0.04 0.2 -0.26 -0.39 0.02 -1.11 0.1 -0.57 : 0.5 Predictor M M Voxel3 1.53 0.38 1.56 0.88 1.41 0.73 1.35 0.47 0.77 1.29 1.13 -0.18 0.29 -0.19 -0.84 0.51 -1.44 -0.07 0.01 : 0.74 M Voxel4 Voxel30786 Stimulus 0.77 ... -0.12 hot -0.43 ... -1.73 hot 1.14 ... 0.64 hot -0.22 ... -0.07 hot 1.33 ... 1.76 hot 0.24 ... 1.22 warm 1.3 ... 0.88 warm 0.18 ... 0.6 warm 0.73 ... 1.3 warm 1.96 ... 2.04 warm 1.81 ... 1.8 warm 0.37 ... 1.63 hot 0.49 ... 0.7 hot -0.16 ... -0.42 hot -0.94 ... -0.68 hot 1.2 ... 1.38 hot -1.88 ... -1.11 hot -0.24 ... 0.17 warm 0.3 ... 0.41 warm : : : 0.83 ... 0.99 warm Misclassification matrix: fMRI Predicted Hot category Warm Actual category Hot Warm 51 1 7 58 Splitting the SPM itself: Dependant or target ? Voxel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 : 30786 x 1.1719 3.5156 5.8594 1.1719 3.5156 5.8594 1.1719 3.5156 5.8594 1.1719 3.5156 5.8594 1.1719 3.5156 5.8594 1.1719 3.5156 5.8594 : . Predictor ? y -10.5469 -10.5469 -10.5469 -8.2031 -8.2031 -8.2031 -5.8594 -5.8594 -5.8594 -10.5469 -10.5469 -10.5469 -8.2031 -8.2031 -8.2031 -5.8594 -5.8594 -5.8594 : . ? z 7.2921 7.2921 7.2921 7.2921 7.2921 7.2921 7.2921 7.2921 7.2921 14.2921 14.2921 14.2921 14.2921 14.2921 14.2921 14.2921 14.2921 14.2921 : . T statistic 5.4852 5.9170 5.0115 6.1082 6.4825 5.7299 6.7113 7.3540 6.5934 5.4519 6.3674 6.3184 6.2774 6.5888 6.2456 6.3583 6.4093 5.8665 : . How do we split on a spatial predictor? Splits can be regarded as models with different means for the two groups: SPM model SPM model Monotonic predictor Free predictor Smoothed SPM model Unsmoothed SPM model Smooth SPM with a filter that matches the model Spatial predictor ‘Free’ predictor So … • Treating spatial location as a free predictor (for the smoothed SPM) is equivalent to simply thresholding the smoothed SPM • We can choose the threshold to control the false splitting rate to P < 0.05 using Bonferroni corrections or random field theory • If model width is unknown, we can make filter width another parameter of the model, which leads to scale space: Scale space: smooth X(t) with a range of filter widths, s = continuous wavelet transform adds an extra dimension to the random field: X(t, s) Scale space, no signal S = FWHM (mm, on log scale) 34 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 -20 0 20 One 15mm signal 40 60 34 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 -20 0 t (mm) 20 40 60 15mm signal best detected with a ~15mm smoothing filter Matched Filter Theorem (= Gauss-Markov Theorem): “to best detect a signal + white noise, filter should match signal” 10mm and 23mm signals S = FWHM (mm, on log scale) 34 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 -20 0 20 40 Two 10mm signals 20mm apart 60 34 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 -20 0 t (mm) 20 40 60 But if the signals are too close together they are detected as a single signal half way between them Scale space can even separate two signals at the same location! 8mm and 150mm signals at the same location 10 5 S = FWHM (mm, on log scale) 0 -60 170 -40 -20 0 20 40 60 113.7 20 76 50.8 15 34 10 22.7 15.2 5 10.2 6.8 -60 -40 -20 0 t (mm) 20 40 60 FWHM = 6.8mm FWHM = 9mm FWHM = 11mm FWHM = 15mm FWHM = 20mm FWHM = 26mm FWHM = 34mm FWHM FWHM FWHM FWHM FWHM FWHM FWHM FWHM Functional connectivity • Measured by the correlation between residuals at every pair of voxels (6D data!) Activation only Voxel 2 ++ + +++ Correlation only Voxel 2 Voxel 1 + + ++ + Voxel 1 + • Local maxima are larger than all 12 neighbours • P-value can be calculated using random field theory • Good at detecting focal connectivity, but • PCA of residuals x voxels is better at detecting large regions of co-correlated voxels |Correlations| > 0.7, P<10-10 (corrected) First Principal Component > threshold False Discovery Rate (FDR) Benjamini and Hochberg (1995), Journal of the Royal Statistical Society Benjamini and Yekutieli (2001), Annals of Statistics Genovese et al. (2001), NeuroImage • FDR controls the expected proportion of false positives amongst the discoveries, whereas • Bonferroni / random field theory controls the probability of any false positives • No correction controls the proportion of false positives in the volume Signal + Gaussian white noise Signal Noise P < 0.05 (uncorrected), T > 1.64 5% of volume is false + 4 4 2 2 0 -2 -4 FDR < 0.05, T > 2.82 5% of discoveries is false + True + False + 0 -2 -4 P < 0.05 (corrected), T > 4.22 5% probability of any false + 4 4 2 2 0 0 -2 -2 -4 -4 Comparison of thresholds • FDR depends on the ordered P-values: P1 < P2 < … < Pn. To control the FDR at a = 0.05, find K = max {i : Pi < (i/n) a}, threshold the P-values at PK Proportion of true + 1 0.1 0.01 0.001 0.0001 Threshold T 1.64 2.56 3.28 3.88 4.41 • Bonferroni thresholds the P-values at a/n: Number of voxels 1 10 100 1000 10000 Threshold T 1.64 2.58 3.29 3.89 4.42 • Random field theory: resels = volume / FHHM3: Number of resels 0 1 10 100 1000 Threshold T 1.64 2.82 3.46 4.09 4.65 P < 0.05 (uncorrected), T > 1.64 5% of volume is false + FDR < 0.05, T > 2.66 5% of discoveries is false + P < 0.05 (corrected), T > 4.90 5% probability of any false +