Extreme Scale Analytics on SpatioTemporal Datasets Joel Saltz Center for Comprehensive Informatics & Biomedical Informatics Department Emory University Morphometric Image Analysis Pipeline • • • • Preprocessing: normalization, tiling, etc. Segmentation: identify nuclei as objects Feature Extraction: compute morphometric features Classification: unsupervised learning (k-means) after patient-level aggregation and analysis Satellite Data Analysis for Monitoring and Change Analysis Subsurface Reservoir Management • Numerical models of porous media – Fluids flow from one region of reservoir to another region – Rock and sediment properties change over time • Simulate multiple realizations of multiple models and management strategies • Evaluate geologic uncertainty and management strategies simultaneously • Enable on-demand exploration and comparison of multiple scenarios Core Operation Categories and Patterns Core Operation Category Operations Data Cleaning and Low Level Transformations Transformations to reduce effects of sensor/measurement artifacts. Transform sensor acquired measurements to domain specific variables. Data Subsetting, Filtering, Subsampling Select portions of a dataset corresponding to regions in atlas and/or time intervals. Select portions of a dataset based on value ranges (e.g., regions with temperature larger than X degrees). Subsample data to reduce resolution and data size. Map datasets to an atlas. Resolve data redundancy at tile boundaries to form mosaics. Create composite dataset from multiple spatially co-incident datasets. Create derived dataset from spatially co-incident datasets obtained at different times. Segment “base level” objects such as nuclei, buildings, lakes. Extract features from “base level” objects. Spatio-temporal Mapping and Registration Object Segmentation Object Classification Classify “base level” objects through possibly iterative combination of clustering, machine learning and human input (active learning). Spatio-temporal Aggregation Construct “high level” objects composed of classified “base level” object aggregates, e.g., residential areas vs industrial complexes. Compute timeseries aggregates over a given imaged area. Change Detection, Comparison, and Quantification Quantify changes over time in domain specific low level variables, base level objects and high level objects. Construct “change objects” to describe changes in low level domain specific variables, base level and high level objects. Spatial queries for selecting and comparing segmented regions and objects. Data Access Patterns and Computational Complexity Mainly local and regular data access patterns. Moderate computational complexity. Local data access patterns as well as indexed access. Low to moderate, mainly data intensive computations. Irregular local and global data access patterns. Moderate to high computational complexity. Irregular, but primarily local, data access patterns. High computational complexity. Irregular and global data access patterns. High computational complexity. Primarily local with a crucial global component for aggregation. Moderate/high computation complexity. Compute and data-intensive computations. Mixture of local and global data access patterns as well as indexed access. Challenges • Spatial-temporal disk-resident, on-the-fly, dynamically updated datasets • Access and manipulate multiple datasets generated and stored on multiple, distributed systems • Analysis of raw data can generate millions to trillions of features (e.g., millions of cells and nuclei in high resolution tissue images) to be mined and compared • Take advantage of hardware platforms for analysis – Clusters containing hybrid CPU-GPU nodes – Extreme scale machines consisting of hundreds of thousands of CPU cores – Systems with deep memory and storage hierarchies – Cloud computing platforms Using Hybrid CPU-GPU Systems Data Structures: Region Templates • Describe 2D/3D static and temporal regions. • Provides a container for points, arrays, regions, and object sets within a spatial and temporal bounding box. • A region template can represent collections of spatial areas and objects where these entities vary from one another in size and shape; e.g. regions generated by segmenting cells in microscopy images, man-made structures or hurricanes in satellite imagery. • Primary datasets are defined as point data elements and arrays, and derived datasets as sets of regions and objects. • Region templates may be related to one another in a defined manner. Programming Abstractions and Runtime Middleware Services • Programming abstractions – – – • I/O and Storage Services – – – • Careful management and staging of large data structures across memory hierarchies. Masking data movement costs with computation. Execution Services – – – – • Indexing and metadata management for ensembles of datasets I/O support for retrieving data from multiple storage systems and for streaming data Query capabilities Memory Management – • Multi-level dataflow pipelines MapReduce style programs Spatial query capabilities Distributing and rearranging computations and data to minimize data movement Coordinated scheduling and mapping of analysis operations to heterogeneous and hybrid (CPU cores and GPUs) systems to increase overall application throughput Quality of service/data requirements Function variants Provenance Tracking, Fault-detection and tolerance End