From Spatio-Temporal Data to a Weighted and Lagged Network Between Functional Domains: Applications in Climate and Neuroscience Ilias Fountalis PhD Thesis Defense Examination 2016 Spatio-Temporal Data Climate Human Brain Ecological Social Economical 2 Spatio-Temporal Data Applications 3 Spatio-Temporal Data Representation • Embedded in a two-or-three dimensional grid • Grid cells contain measurements (time-series) for the variables of interest • Grid cells to not correspond to functionally distinct units 4 Spatio-Temporal Data Functional Components • Spatio-temporal systems are modular • Functional components: – – – – Spatially contiguous Functionally homogeneous Possibly overlapping Weighted and lagged interactions 5 Thesis Overview • Geo Cluster (presented in PhD proposal) – Identifies functional components of a spatio-temporal system – Spatially contiguous non-overlapping areas – Models their interactions as a complete and weighted network • Spatio-temporal network analysis for studying climate patterns (Fountalis et al., Clym. Dyn. 2014) • ENSO in CMIP5 simulations (presented in PhD proposal) – Evaluation of cutting edge climate models – Ranking models in terms of their ability to reproduce the climate of the past – Investigating model trajectories under future climate warming scenarios • ENSO in CMIP5 simulations: network connectivity from the recent past to the twenty-third century (Fountalis, et al., Clym. Dyn. 2015) 6 Thesis Overview • δ-MAPS (the focus of this talk) – Identifies domains: Distinct semi-autonomous components of the system • Spatially contiguous, possibly overlapping regions – Infers their potentially lagged and weighted interactions • Applied to – Climate data – Resting state fMRI • δ-MAPS: From spatio-temporal data to a weighted and lagged network between functional domains (Fountalis et al., submitted to KDD’16) 7 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 8 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 9 Methods for Analyzing Spatio-Temporal Data • Objective – Infer the functional components of a spatio-temporal system and study their interactions • Methods – Multivariate statistical methods • Principal Component Analysis (PCA)/Empirical Orthogonal Function (EOF) Analysis • Independent Component Analysis (ICA) – Clustering • Spatial contiguity constraints • Drawbacks – Illustration of limitations in a data set for which we know the ground truth 10 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 11 Synthetic Data Generation Setup • Synthetic Component – Modeled as a circle of radius rp – Core of radius rc – Time series at core modulated by a factor • Decay Function: • Connecting components i,j: • Final steps: – Given source signals yi(t), yj(t) – xi(t) = (1-α)yi(t) ± αyj (t+τ) – α: controls the strength of the connection – Additive superimposition of component time series – Addition of white Gaussian noise N(0,1) 12 Synthetic Data Generation Setup 13 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 14 Dimensionality Reduction: PCA/EOF • PCA/EOF analysis: – Identifies orthogonal components of high-energy content in terms of the variance of the signal – Variance of the field dominated by few “modes of variability”, masking weaker regions of interest – Orthogonality constraint difficult to be interpreted physically – See also: A cautionary note on the interpretation of EOFs (Dommenget and Latif, J. Clym. 2002) 15 Dimensionality Reduction: ICA • ICA – Separates a mixed signal into independent non Gaussian subcomponents – No orthogonality constraints – Cannot determine the variance, sign or correct ordering of the independent components – Identified components are noisy – Difficult to interpret functional structure if we do not know the ground truth – See also: Modulation of temporally coherent brain networks estimated using ICA at rest and 16 during cognitive tasks (Calhoun et. al., Humm. Brain Mapp. 2008) Dimensionality Reduction: Clustering • Clustering – – – – – Many flavors (Spectral, Agglomerative, Region Growing) Typically require as an input # of clusters Each grid cell belongs to a cluster No spatial contiguity guarantees Normalized cut group clustering of resting state fMRI data (Van De Heuvel et al., PLoS ONE, 2008) • K-Means example – Clusters correspond to noise – Separate components are joined to the same cluster – Cannot separate between local diffusion and remote interactions 17 Dimensionality Reduction: Spatial-Clustering • Geo Cluster: – Spatial contiguous clustering – Automatically identifying number of underlying components – Clusters are used as the nodes of a functional network – Applied extensively to investigate the Earth’s climate and evaluate climate models • Spatio-temporal network analysis for studying climate patterns (Fountalis et al., Clym. Dyn. 2014) • ENSO in CMIP5 simulations: network connectivity from the recent past to the twenty-third century (Fountalis, et al., Clym. Dyn. 2015) – Does not allow overlap between identified clusters18 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 19 δ-MAPS: Method Overview 20 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 21 δ-MAPS: Notations • Spatio-temporal field X(t) • Embedded in a grid – Modeled as a planar graph G(V,E) • Similarity between grid cells i,j – Pearson correlation: 22 δ-MAPS: Domain Constraints • Domain: Spatially contiguous set of grid cells that participate in the same function • Homogeneity of a domain • Homogeneity constraint – > than a threshold δ 23 δ-MAPS: Domain Constraints • Domains have an epicenter of action • K-neighborhood ΓK(i) of grid cell i – K nearest grid cells to i, including i • Local homogeneity of grid cell i: • Domain core, cell at which local homogeneity is – Local maximum – Larger than δ 24 δ-MAPS: Problem Statement • A domain is spatially contiguous (IG(A) = 1) if it forms a connected component in G • Given cell c: core of domain A • A homogeneity threshold δ • Domain must satisfy: (1) • • Exact boundaries of a domain are unknown Domain identification problem: – Given field X(t) on spatial grid G, core cell c and threshold δ – Identify domain A as the maximum-sized set of cells that satisfies (1) 25 δ-MAPS: Problem Statement • Domain must satisfy: (1) • Domain identification problem: – Given field X(t) on spatial grid G, core cell c and threshold δ – Identify domain A as the maximum-sized set of cells that satisfies (1) • Problem is NP-Complete – Reduction of densest connected k-subgraph to domain identification problem – • The complexity of clustering in planar graphs (Keil and Brecht, J. Combin. Math. Combin. Comput ,1991) Greedy algorithm for domain identification – Identify seeds (cores) – Iteratively expand and merge seeds to identify domains 26 δ-MAPS: Seed Selection • Seed: Grid cell including its local neighborhood – Must satisfy: • Local maximum: • a • Single domain can have more than one seeds – Noise – Overlapping regions 27 δ-MAPS: Domain Identification • Input: Sets of seeds S • Iterative process – (1) Merging and (2) Expansion of domains 28 δ-MAPS: Domain Identification • Merging: – Domains can be merged if • Spatially adjacent • a – Merge first the two domains with max – Terminate when no merging is possible 29 δ-MAPS: Domain Identification • Merging: – Domains can be merged if • Spatially adjacent • a – Merge first the two domains with max – Terminate when no merging is possible 30 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 31 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 32 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 33 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 34 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 35 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 36 δ-MAPS: Domain Identification • Expansion: – – – – Domains sorted by homogeneity Expand by considering all adjacent grid cells Expand by adding grid cell with max After each expansion check if merging is possible 37 δ-MAPS: Domain Identification • Termination: – No further merging or expansion is possible 38 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 39 Network Inference Prior Work • Functional components might be correlated at a non-zero lag – Compute Pearson correlation for a range of lags [-τmax ≤ τ ≤ τmax] • Testing for significant correlations1 – T-test given significance level α • Ignores autocorrelation structure – Multiple testing problem • Need to control for the number of false positives • Selecting the appropriate lag2 – Consecutive lags produce almost maximal correlations – Point estimates are not robust • • 1: A new dynamical mechanism for major climate shifts (Tsonis et al., GRL, 2007) 2: Network inference with confidence from multivariate time series (Kramer et al., Phys. Rev. E, 2009) 40 δ-MAPS: Domain Signal • Domain Level Signal XA(t) – Application Specific 41 δ-MAPS: Test for Statistical Significance • Domains might by correlated at a non-zero lag τ • For each pair of domains Α,Β: – Compute Pearson rA,B(τ) correlation for a range of lags [-τmax ≤ τ ≤ τmax] • Statistical significance – Uncorrelated signals can produce spurious correlations if they have a strong autocorrelation structure • Bartlett’s formula – Estimates the variance of rAB(τ) – Null hypothesis: XA(t), XB(t) uncorrelated – E[rAB(τ)] = 0 and – ~ N(0,1) 42 δ-MAPS: Multiple Testing Problem • Multiple testing – For N domains and a max lag τmax – – N = 100, τmax = 5, α = 1%, roughly 550 false positives • False Discovery Rate (FDR) – Select false discovery rate q • q: Controls expected fraction of false positives – Sort the M p-values in ascending order pi-1 < pm < pm+1 – Keep the first m < M p-values • pm < qm/M • pm: m’th lowest p-value 43 δ-MAPS: Lag Inference • Domains are connected if there exists at least one significant correlation • What is the appropriate lag? • Proposed approach – Associate a range of lags – That produce significant correlations – Located within one standard deviation from the max. absolute correlation 44 δ-MAPS: Edge Direction • Edge direction – Lag range positive: A -> B (A precedes B) – Lag range negative: A <-B (A succeeds B) – Lag range includes zero: Bi-directed edge 45 δ-MAPS: Edge Weight • Edge weight: – Covariance between the domain signals – maximum correlation in absolute sense – Edge weight captures the magnitude of the signal of the two domains – Weights can be positive or negative • Final Network – Directed, weighted graph 46 δ-MAPS: Domain Strength • Domain strength: – Sum of the absolute weights of the edges of a domain 47 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 48 Application on Synthetic Data • Revisiting the synthetic data set example 49 Application on Synthetic Data • • δ-MAPS parameters: K=4, δ = 0.55, q = 10%, τmax = 20 More than one seeds in the core, seeds in the overlapping regions • • • Identified domains: Subset of ground truth Correctly identifies overlaps Network – Eventually merged to a single domain – – – Correctly identifies all three edges and their polarity Lag ranges always include the correct value Hierarchy of edge weights is preserved 50 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 51 Applications in Climate: Climate Modes of Variability • Recurring patterns with identifiable characteristics and specific regional effects • Represent the state of the climate • An example: – El Niño Southern Oscillation (ENSO) 52 Applications in Climate: Teleconnections • Teleconnections – Climate anomalies related to each other at large distances • An example: – Variations in temperature in tropical Pacific ( ) – Cause • Rainfall in remote places of the world ( ) • Temperature increase in others ( ) 53 Applications in Climate: Data Description • Data: – Monthly averages of sea surface temperature (SST) from HadISST. – Period: 1956-2005 (50 years, 600 months) • Preprocessing: – Removal of seasonal cycle – Removal of linear trends (Theil-Sen estimator) – Transform to zero-mean • δ-MAPS parameters – Neighborhood size K = 4 grid cells – δ = 0.37 – False discovery rate q = 3% • 30 edges in network (no more than 1 false positive) – Max lag τmax = 12 months 54 Applications in Climate: The Climate Network • From 6000 grid cells to 18 domains • 35% of the grid cells do not belong to a domain • Largest domain ENSO 55 Applications in Climate: The Climate Network • • • Strongest domain: ENSO – Hierarchy in terms of strength in the three ocean basins – Indian Ocean, Horse-shoe pattern, North Atlantic – Domain C ENSO teleconnections ENSO predecessors – • Improved El Nino forecasting by cooperativity detection, Ludescher et a.l, PNAS 2013) • See also: Are Atlantic Ninos enhancing Pacific ENSO events in recent decades (Rodriguez-Fonseca, GRL 2009) Domain Q (South Atlantic), precedes all other domains in the climate network 56 Applications in Climate: Structural Balance • Network decomposed to 5 weakly connected components • Network is structurally balanced – Partitioned into two groups of domains 57 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 58 Applications in Neuroscience: Introduction • Functional Magnetic Resonance Imaging (fMRI) – Blood Oxygen – Level Dependent (BOLD) signal – Measures changes in the level of oxygen concertation in the brain • The brain as a functional network – – – – Spatially distributed regions Each having their own task and function Continuously interacting with each other How is functional connectivity altered due to neurodegenerative diseases? • Resting State fMRI – Subject scanned while at rest – During rest the functional network is not idle – Resting State Networks: Strongly functionally linked subnetworks between different brain regions 59 Applications in Neuroscience: Data Description • Data – HCP, Cortical resting state fMRI – Single subject, 2 scans, ≈ 15 minutes per scan – Time resolution 0.72 seconds • Data Preprocessing – – – – – – HCP ‘’fix-extended’’ minimal preprocessing pipeline Correction for B0 distortions Head motion correction Registration to structural image Masking of non-brain voxels Removal of physiological artifacts… • Resting-state Fmri in the human connectome project (Smith et al., Neuroimage, 2013) – Bandpass-filtering 0.01-0.08Hz 60 Applications in Neuroscience: Network Properties • δ-MAPS parameters: – Neighborhood size K = 6 – δ = 0.37 – false discovery rate q = 10-4 • (expected 1 out of 10k edges false positive), – τmax = 3 (2.2 seconds) • Majority of domains are small (95% < 250 voxels) • Polarity of edges changes across scans (time varying?) • Degree and size of a domain are positively correlated • Networks are assortative 61 Applications in Neuroscience: Resting State Networks • Resting state networks: Regions highly interconnected to each other during rest • Community detection (OSLOM*): – Network communities correspond to well known resting state networks – Consistent between the two scans – * Finding statistically significant communities in networks (Lancichinetti et al., PLoS One, 2011) 62 Applications in Neuroscience: K-Core decomposition • A process that iteratively removes nodes based on their degree • After the removal of the 14th core (16th for scan-2) density of the network increases by a factor of 2 Reveals a few domains, densely interconnected to each other • – After the extraction of the k’th core all nodes have degree > k – Backbone of the functional brain network • Rich-club organization of the human connectome (van den Heuvel and Sporns, Journal of neuroscience, 2011). 63 Outline • • • • • Related Work – – – Methods for Analyzing Spatio-Temporal Data Synthetic Data Method Limitations – – – – Method Overview Domain Identification Network Inference Application on Synthetic Data δ-MAPS Applications in Climate Science Applications in Neuroscience Conclusions & Future Work 64 • δ-MAPS Conclusions – Bridging overlapping community detection and spatial clustering – Identifies the functional components of a spatiotemporal system – Used to study their possible lagged and weighted interactions – Validated against synthetic data – Overpowers traditional dimensionality reduction/network based methods • Applied to climate data – Successfully uncovering known modes of variability and teleconnections • Applications in neuroscience – Successfully uncovers well known resting state networks at a single subject analysis – Identifies the backbone of the brain network 65 Future Work • Climate networks over time – Investigate trajectories of the functional components as reflected by their size/strength • Climate models and controlled perturbation experiments – How do they propagate in the climate network scale? • Effective connectivity – Application of probabilistic graphical models to remove noncausal edges • Dynamic networks using contextual time series detection – Automatic identification of changes between two time series • Structural-functional networks – Combine functional connectivity with structural connectivity • Extension to other spatio-temporal data – Species migration patterns, seismic data 66 Publications – Posters - Talks • Book chapters – • A. Bracco, R.K. Archibald, C. Dovrolis, I. Fountalis, H. Luo and J.D Neelin. The parameter optimization problem in state-of-the-art climate models and network analysis for systematic data mining in model intercomparison projects. CISMCoursesandLectures: TheFluidDynamicsofClimate,Springer Ed. 201 Journal papers – – I. Fountalis, A. Bracco, C. Dovrolis. Spatio temporal network analysis for studying climate patterns. Climate Dynamics 42 3-4 (2014) 879-899 I. Fountalis, A. Bracco, C. Dovrolis. ENSO in CMIP5 simulations: Network connectivity from the recent past to the twenty-third century. Climate Dyncamics (2014) • Under Review • Poster abstracts – I. Fountalis, A.Bracco, B. Dilkina, C. Dovrolis, S. Keilholz. δ-MAPS: From spatio-temporal data to a weighted and lagged network between functional domains (Submitted KDD 2016) – C. Dovrolis, I. Fountalis, B. Dilkina, S. Keilholz. From fMRI data to a weighted network between functional domains (Submitted PRNI 2016) – – • I. Fountalis, A. Bracco, C. Dovrolis. A network based analysis of CMIP5 historical experiments. Climate Informatics (2013). Poster abstract. I. Fountalis, C. Dovrolis, A. Bracco. A network based methodology for the study of climate teleconnections. Second workshop on understanding climate change from Data (2012). Poster abstract. Invited talks – – Evaluation of climate models using network analysis. Complenets 2012. Validation of the CMIP5 models using network analysis. Conference on artificial intelligence 67 applications to environmental sciences (2012) Thank you! 68 Backup Slides 69 Network Based Methods • Related work – Step 1: grid cells -> nodes • Do not correspond to functionally distinct units 70 Network Based Methods • Related work – Step 1: grid cells -> nodes • Do not correspond to functionally distinct units – Step 2: Compute pairwise correlations between all pairs of nodes Pair-wise correlations 71 Network Based Methods • Related work – Step 1: grid cells -> nodes • Do not correspond to functionally distinct units – Step 2: Compute pairwise correlations between all pairs of nodes – Step 3: Threshold to obtain network • Fixed threshold approach – “Community structure and dynamics in climate networks, Tsonis et al., 2011” • Fixed density approach – “The backbone of the climate network, Donges et al., 2009” • Not clear which edges to prune 72 Network Based Methods • Related work – Step 1: grid cells -> nodes • Do not correspond to functionally distinct units – Step 2: Compute pairwise correlations between all pairs of nodes – Step 3: Threshold to obtain network • Fixed threshold approach – “Community structure and dynamics in climate networks, Tsonis et al., 2011” • Fixed density approach – “The backbone of the climate network, Donges et al., 2009” • Not clear which edges to prune – Step 4: Final network • Binary (ignores magnitude and sign of correlations) – “Simple models of human brain functional networks, Vėrtes et al., 2011” – “Improved El Niňo forecasting by cooperativity detection, Ludescher et al., 2013” • Weighted (typically only + correlations are considered) 73 Heuristic to Infer δ • δ determines the minimum degree of homogeneity of a domain • δ heuristic – Start with a random sample of pairs of grid cells – Calculate their zero-lag correlation – Infer the significant correlations for a given significance level α – δ depends on: • Significance level α (input to domain identification) • Underlying correlation distribution • Underlying autocorrelation structure • Intuition – A domain is a spatially contiguous set of grid cells – Mean pair-wise correlation should be higher than mean correlation of randomly picked grid cells 74 Dimensionality Reduction: Community Detection • OSLOM – Automatically identifies hierarchical structure of communities – Accounts for overlaps between communities – INPUT: Pruned cell-level network – Can not distinguish between positive and negative correlations – Cannot infer actual connectivity – Communities are not guaranteed to be spatially contiguous – Finding statistically significant communities in networks (Lancichinetti et al., PLoS One, 2011) 75 Applications in Climate: Lag-consistent Triangles • Lag-consistent triangle: Nodes can be placed in a consistent temporal order • All triangles are consistent with one exception (C,D,G) – However, an alternate path exists 76 Applications in Climate: EOFs, Communities & Spatial Clustering • EOFs: • Communities: • Spatial Clustering: – ENSO dominates, teleconnections to South Atlantic are lost – Grouping spatially disjoint modes of variability together – More areas. No Overlaps. – Spatial extent of an area constrained because no overlaps are allowed 77 Applications in Neuroscience: Data Representation • Volumetric representation – Grid cell (voxel): isomorphic cube – Confounds arise from the variability of the convolutions of the human brain – Neighboring voxels might belong to different sulci/gyri – Functional organization of the cortex is largely two78 dimensional Applications in Neuroscience: Data Representation • Surface based registration – Individual volumetric data are projected to a surface mesh 79