Data Sciences at Sandia National Laboratories Sept, 2014 Steven Castillo, Ph.D. ISR Systems Engineering and Decision Support Presentation to University of Illinois at Urbana-Champaign Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Approved for UUR: SAND2014-17938 PE 1 Motivation Slide 2 4 2 4 4 2 9 24 39 1 3 8 Mission Context 7 • The analysis of data is a common point of success or failure • We see exponential growth in the size, complexity and information density of data far exceeding available human analysis capabilities. • This research challenge intersects most Sandia Mission Areas. Why Sandia? Envision, create the future Pathfinder, data-centric systems for high-consequence decision making: Sandia delivers sensors into several mission areas. Sandia has access to relevant data sets and analysts. We live the full data science cycle at the cutting edge as part of executing our broad national security missions. 2 Contributors to current limitations: • Only a small fraction of the data is ever examined by analysts. Workflows are labor intensive and devoid of effective computational tools. • Systems do not exploit the relationship discovery potential of the data or identify meaningful, defensible trends and patterns. 3 Intelligence Information Data Analyst (2014) Signal Increasing Value Decreasing Volume Analyst (2017) Physics Sensors • Increasing density • Increasing data rates • Increasing information density 4 4 Focus: High Consequence Decision Making Human Analyst Centric – Transforming Information into Actionable Intelligence Discovery and Disambiguation Patterns of Life: anomaly detection Patterns of Life: predictive analytics Robust Data Analysis: incomplete, vulnerable and uncertain data Intelligent Data Collection: tasking of sensors, optimal data sets Crosscutting Capabilities: Fundamental mathematics and science transformed into advanced mission capability Sandia World Class S&T: Research Foundations Supporting National Security Missions: Nuclear Weapons, Defense & Intelligence 5 Research Portfolio – An Investment in Next Generation Analytics for the Nation Current portfolio PANTHER Grand Challenge – Pixels to Intelligence: Pattern Analytics Counter Adversarial Data Analytics internal research & development Adversarial Modeling - internal research & development DARPA sponsored graph modeling Customer sponsored streaming/graph algorithms Customer funded cyber defense Portfolio focus on innovative R&D • Foster high-risk projects • Promote R&D impact on mission stakeholders 6 Contrast National Security to Google Google NS Data to Decisions Data Gathering Highly Cooperative/dense, homogeneous (Android, Chrome, Google Searches, etc.) Adversarial Environments/sparse/diver se (can’t see everything all the time) Data Analytics Environment Large, homogeneous computing environment (cloud) Diverse architectures, available throughput, bandwidth Human Analysis Consumer performing “buy” decisions Analyst extracting actionable intelligence Decision Consequences Small – millions of consumers making decisions related to shopping High – tactical and strategic, UQ and validation are critical 7 Grand Challenges – Internal R&D Laboratory Directed Research & Development (LDRD) Sandia initiates 1-2 GC LDRDs/ yr Cross laboratory, interdisciplinary High risk, urgent needs, low TRL (technology readiness level) Three year effort Create a Lasting National Technical Capability Required to assemble highly engaged external advisory board Cooperative development with partners to shape technical vision 8 PANTHER Grand Challenge Pattern Analytics to Support High Performance Exploitation and Reasoning: Crosscutting Innovation in Geospatial-Temporal Analysis PANTHER R&D: Scalable, temporal and geospatial relationship discovery for robust, mission- relevant pattern analysis. How is PANTHER different? Develop new mathematical approaches to this problem with a deep understanding of human capabilities and information landscape. Challenge Problem Categories: Signature Search – enable searches for signatures that are difficult to discretize and subject to interruptions in space and time. Motion & Trajectory Analysis –discover patterns in motion datasets at multiple semantic scales under conditions of intermittent data. 9 Panther Team - Interdisciplinary PI: Kristina Czuchlewski, PhD PM: Bill Hart, PhD Team Leads: Jim Chow, Ph.D., Randy Brost, Ph.D., Laura McNamara, Ph.D. and David Stracuzzi, Ph.D. 10 R&D Challenge Questions Where are chemical processing plants? Signature Search Where are active businesses? Did someone arrive in a car and enter a building? Which one(s)? Which aircraft flights are point-to-point? Trajectory Analysis Are any aircraft flying search patterns? Are any aircraft flying search patterns over sensitive locations? Is there an activity surge? Temporal Patterns Is an aircraft flight pattern departing from normal? What might it do next? What can we infer given limited data? For any of the above, what is the result confidence? Limited Data Uncertainty/Quality 11 Current Paradigm New Paradigm State of the Art Sandia R&D PANTHER Grand Challenge Approach Single frame or target recognition Statistical, pixellevel processing for specific tactical missions Identify patterns/ relationships and activities over multiple target classes and extend to strategic missions. Achieve 107 reduction in data volume in space & time. Spatial Single frame Specific areas (10s of km2) Achieve 104 single frame reduction in data presented. Query and exploit 100s of km2 Temporal Minutes of data, Weeks of real-time and offline. information, available in minutes Pattern based query of up to 1 year’s worth of data. Available in minutes. Example ATR, CCD, matched filters Statistical Normalized Coherence Object-based image analysis: features, entities, patterns and relationships. Ask questions that cannot currently be answered. Will it work? Limited, does not scale. Limited, scales for specific missions. Unproven. Scalable & extensible to other domains. 12 FY14 Highlights Unsupervised computational methods for detecting outliers (with no a priori knowledge) within a TB-scale database… in under 30 minutes. New geometric feature vector enables comparisons. Result: 44 seconds on our Netezza DB machine to compute a trajectory segmentation from points. Result: 20 minutes of wall-clock time to turn 1.2 billion points into 15 million trajectories. Geospatial feature extraction and classification for signature relationship and temporal trending searches. Pixel statistics extracted via new superpixel segmentation algorithm. Unique temporal attributes of coherent changes exploited for labeling. 13 FY14 Highlights, cont. R2 E4 E1 3 B7 E10 G4 E37 P3 R7 6 E2 E34 G3 E32 E30 8 E2 R5 E19 E16 E18 E20 E27 E14 E25 1 E3 E36 E6 G2 E13 B6 E29 P2 R6 B3 B2 E17 E23 B5 E5 R4 E24 E21 B4 E9 E8 2 E1 Eye-tracking experiments are enabling visual cognition/ search models – and eventual user interface design(s). Pattern match quality, statistical and probabilistic approaches are under investigation for characterizing uncertainty in geospatial temporal graph representations. E2 G1 E3 OP1 E12 E22 Multi-source data search under one framework Representation and search under heterogeneous temporal conditions (ephemeral and activity). Complex sensor feature data “compressed” for analysis with pointers back to original sensor source. Intermittency/ interruption nodes implemented. P1 R1 5 E1 E11 R3 4 E1 New, efficient graph algorithm formulation developed to enable geospatial-temporal topological search complexity. E1 B1 E7 E33 B8 E35 P4 Shadow Roof Bright 14 GeoSpatial Semantic Graph Representations of Features & Activity A Graph includes activity: @ t=1, the graph includes objects with location From t=1 to t=2, the graph encodes change Nodes for activity events.1 Node attributes include time observed. No persistence expected. Spatial and temporal relationship edges.1 t=3 t=2 A A t=1 A 15 Signature Search A signature (over space/ time) encodes a desired question. For example, “Where are buildings with nearby grass, pavement, and dirt? Graph template: Building Grass Pavement Dirt 16 Matches Graph search finds all matches to signature template. In this case all red nodes with adjacent green, grey, and tan nodes. New approach to graph search, polynomial time complexity Searches are saved A t=3 t=2 A A t=1 A 17 Advantages Efficient representations in time (only store change). Relationship, change, and temporal analysis over multiple times and heterogeneous spatial ensembles in the same query. Change detection Activity characterization Efficient search algorithms – computation not limited by brittle relational databases. Potential to take full advantage of graph topology search advances, enhanced by geospatial-temporal semantics. Feature-based analysis Multi-modality, in a single search representation. Sensor agnostic – PANTHER emphasizes SAR. 18 Extracting Statistical Distributions Using Superpixel Segmentation Superpixel Segmentation of SAR Image • Superpixel Segmentation • Divides image into compact regions containing pixels similar in spatial proximity and intensity • Derives from large body of research in optical image processing community extended to SAR imagery • PANTHER Approach • Superpixel segmentation algorithms enable high quality segmentations and efficient execution • De-noising advances enable utilization of novel image processing techniques. 19 Initial Automated Static Feature Extraction Result Exploit ~21 days worth of change via coherence Result: automatically extract paved roads 20 Query algorithms must handle disconnections & interruptions Static and ephemeral features for time slice A:* Data semantics: Building Fence Dirt Road Gravel Desert Shadow Low Vegetation Human Tracks Vehicle Tracks Vehicle B F DR G D S LV HT VT V Did someone arrive by car and visit a building? Note: Based on hand-annotated primitive features. * Note: Hand-edited for explanation clarity. 21 Example: Connected Signature Result from star-graph algorithm:* Correct Correct Correct Correct Why wasn’t this found? Note: Based on hand-annotated primitive features. * Note: Hand-edited for explanation clarity. 22 Result of Disconnected Signature Search Result from star-graph algorithm:* Correct Correct Correct Correct Correct Note: Based on hand-annotated primitive features. * Note: Hand-edited for explanation clarity. 23 Site Activity Analysis Diversity of Problems Power Plant Search Tank Complex Search Activity Analysis -- Interrupted Signature Construction Analysis All of these were solved by the same code. 24 Discovery of Flight Patterns Holding Pattern Mapping Avoid Collections of geometric descriptions can describe a trajectory. Extensions: impact of time. Forgot Something 25 Big Feature Space Advantage Clustering, leading to unsupervised learning techniques Previous examples showed searching the space for a specific pattern or a specific volume of the feature space But, with clustering, the computer can group the different patterns in the feature space without knowing a priori what they are. Perhaps most importantly, many clustering algorithms specifically identify outliers in the feature space that correspond to odd behaviors Forgot Something, Revisited Did not specify “find this,” only told routine to “make groups of similar flights.” This was one of many clusters that had distinctive shapes 26 Discovery of Odd flights Clustering done based on geometric features Many clusters found, but what remains is… Note: we have ~5M points/day, ~1GB/day, currently >300GB Represents approximately 700 out of a total of 50,000 flights from one day 27 Better knowledge & deeper insights from Big Data in minutes, not months; over months, not hours or minutes; covering hundreds, not 10s of km2 28 Technical Areas for Collaboration Graph Analytics Tensor Analysis Computational Geometry Digital Signal and Image Processing Machine Learning Scalable and Highly Distributed Architectures Human Performance and Cognitive Understanding Machine Feature Identification in Imagery (EO, SAR, LIDAR) Uncertainty Quantification and Propagation – from Sensor to Answer Robust Data Analysis Intelligent Data Collection 29 University Collaborations PhD Interns: Colorado State University, Utah State University Faculty have clearances and clear expertise in critical areas Students can gain clearances Year-round appointments give flexibility for Sandia work, university requirements Win-win-win: Sandia gains valuable technical support in critical challenge areas, Student gets a degree/publications/potential future employment, University and professor fulfill mission and gain a strong supporter. Critical Skills Master’s Program (CSMP): University of Illinois Similar to old One-Year-on-Campus-Program NDA’s: USU, CSU, UIUC (in progress) 30