Position Statement

advertisement
Extreme Scale Analytics on SpatioTemporal Datasets
Joel Saltz
Center for Comprehensive Informatics &
Biomedical Informatics Department
Emory University
Morphometric Image Analysis Pipeline
•
•
•
•
Preprocessing: normalization, tiling, etc.
Segmentation: identify nuclei as objects
Feature Extraction: compute morphometric features
Classification: unsupervised learning (k-means) after
patient-level aggregation and analysis
Satellite Data Analysis for Monitoring
and Change Analysis
Subsurface Reservoir Management
• Numerical models of porous media
– Fluids flow from one region of reservoir to another region
– Rock and sediment properties change over time
• Simulate multiple realizations of multiple models and
management strategies
• Evaluate geologic uncertainty and management
strategies simultaneously
• Enable on-demand exploration and comparison of
multiple scenarios
Core Operation Categories and Patterns
Core Operation Category
Operations
Data Cleaning and Low
Level Transformations
Transformations to reduce effects of sensor/measurement artifacts.
Transform sensor acquired measurements to domain specific variables.
Data Subsetting, Filtering,
Subsampling
Select portions of a dataset corresponding to regions in atlas and/or time
intervals. Select portions of a dataset based on value ranges (e.g., regions
with temperature larger than X degrees). Subsample data to reduce
resolution and data size.
Map datasets to an atlas. Resolve data redundancy at tile boundaries to
form mosaics. Create composite dataset from multiple spatially co-incident
datasets. Create derived dataset from spatially co-incident datasets
obtained at different times.
Segment “base level” objects such as nuclei, buildings, lakes. Extract
features from “base level” objects.
Spatio-temporal Mapping
and Registration
Object Segmentation
Object Classification
Classify “base level” objects through possibly iterative combination of
clustering, machine learning and human input (active learning).
Spatio-temporal Aggregation
Construct “high level” objects composed of classified “base level” object
aggregates, e.g., residential areas vs industrial complexes. Compute timeseries aggregates over a given imaged area.
Change Detection,
Comparison, and
Quantification
Quantify changes over time in domain specific low level variables, base
level objects and high level objects. Construct “change objects” to describe
changes in low level domain specific variables, base level and high level
objects. Spatial queries for selecting and comparing segmented regions and
objects.
Data Access Patterns and
Computational Complexity
Mainly local and regular data
access patterns. Moderate
computational complexity.
Local data access patterns as well
as indexed access. Low to
moderate, mainly data intensive
computations.
Irregular local and global data
access patterns. Moderate to high
computational complexity.
Irregular, but primarily local, data
access patterns. High
computational complexity.
Irregular and global data access
patterns. High computational
complexity.
Primarily local with a crucial
global component for aggregation.
Moderate/high computation
complexity.
Compute and data-intensive
computations. Mixture of local
and global data access patterns as
well as indexed access.
Challenges
• Spatial-temporal disk-resident, on-the-fly, dynamically
updated datasets
• Access and manipulate multiple datasets generated and
stored on multiple, distributed systems
• Analysis of raw data can generate millions to trillions of
features (e.g., millions of cells and nuclei in high resolution
tissue images) to be mined and compared
• Take advantage of hardware platforms for analysis
– Clusters containing hybrid CPU-GPU nodes
– Extreme scale machines consisting of hundreds of thousands of CPU
cores
– Systems with deep memory and storage hierarchies
– Cloud computing platforms
Using Hybrid CPU-GPU Systems
Data Structures: Region Templates
• Describe 2D/3D static and temporal regions.
• Provides a container for points, arrays, regions, and object
sets within a spatial and temporal bounding box.
• A region template can represent collections of spatial areas
and objects where these entities vary from one another in
size and shape; e.g. regions generated by segmenting cells
in microscopy images, man-made structures or hurricanes
in satellite imagery.
• Primary datasets are defined as point data elements and
arrays, and derived datasets as sets of regions and objects.
• Region templates may be related to one another in a
defined manner.
Programming Abstractions and Runtime Middleware
Services
•
Programming abstractions
–
–
–
•
I/O and Storage Services
–
–
–
•
Careful management and staging of large data structures across memory hierarchies. Masking data
movement costs with computation.
Execution Services
–
–
–
–
•
Indexing and metadata management for ensembles of datasets
I/O support for retrieving data from multiple storage systems and for streaming data
Query capabilities
Memory Management
–
•
Multi-level dataflow pipelines
MapReduce style programs
Spatial query capabilities
Distributing and rearranging computations and data to minimize data movement
Coordinated scheduling and mapping of analysis operations to heterogeneous and hybrid (CPU cores
and GPUs) systems to increase overall application throughput
Quality of service/data requirements
Function variants
Provenance Tracking, Fault-detection and tolerance
End
Download