the talk, "Data Sciences at Sandia National Laboratories."

advertisement
Data Sciences at Sandia
National Laboratories
Sept, 2014
Steven Castillo, Ph.D.
ISR Systems Engineering and Decision Support
Presentation to
University of Illinois at Urbana-Champaign
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin
Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Approved for UUR: SAND2014-17938 PE
1
Motivation Slide
2
4
2
4
4
2
9
24
39
1
3
8
Mission Context
7
• The analysis of data is a common point of success or failure
• We see exponential growth in the size, complexity and information
density of data far exceeding available human analysis capabilities.
• This research challenge intersects most Sandia Mission Areas.
Why Sandia?
 Envision, create the future
 Pathfinder, data-centric systems for
high-consequence decision making:
 Sandia delivers sensors into several mission areas.
 Sandia has access to relevant data sets and analysts.
 We live the full data science cycle at the cutting edge as part of executing
our broad national security missions.
2
Contributors to current limitations:
• Only a small fraction of the data is ever examined by
analysts. Workflows are labor intensive and devoid of
effective computational tools.
• Systems do not exploit the relationship discovery potential
of the data or identify meaningful, defensible trends and
patterns.
3
Intelligence
Information
Data
Analyst (2014)
Signal
Increasing Value
Decreasing Volume
Analyst (2017)
Physics
Sensors
• Increasing density
• Increasing data rates
• Increasing information density
4
4
Focus: High Consequence Decision Making
Human Analyst Centric – Transforming Information into
Actionable Intelligence





Discovery and Disambiguation
Patterns of Life: anomaly detection
Patterns of Life: predictive analytics
Robust Data Analysis: incomplete, vulnerable and uncertain data
Intelligent Data Collection: tasking of sensors, optimal data sets
Crosscutting Capabilities: Fundamental mathematics and science
transformed into advanced mission capability
Sandia World Class S&T: Research Foundations
Supporting National Security Missions: Nuclear Weapons, Defense &
Intelligence
5
Research Portfolio – An Investment in Next
Generation Analytics for the Nation
 Current portfolio
 PANTHER Grand Challenge – Pixels to
Intelligence: Pattern Analytics
 Counter Adversarial Data Analytics internal research & development
 Adversarial Modeling - internal
research & development
 DARPA sponsored graph modeling
 Customer sponsored streaming/graph
algorithms
 Customer funded cyber defense
Portfolio focus on
innovative R&D
• Foster high-risk
projects
• Promote R&D
impact on mission
stakeholders
6
Contrast National
Security to Google
Google
NS Data to Decisions
Data Gathering
Highly Cooperative/dense,
homogeneous (Android,
Chrome, Google
Searches, etc.)
Adversarial
Environments/sparse/diver
se (can’t see everything all
the time)
Data Analytics
Environment
Large, homogeneous
computing environment
(cloud)
Diverse architectures,
available throughput,
bandwidth
Human Analysis
Consumer performing
“buy” decisions
Analyst extracting
actionable intelligence
Decision Consequences
Small – millions of
consumers making
decisions related to
shopping
High – tactical and
strategic, UQ and
validation are critical
7
Grand Challenges – Internal R&D
 Laboratory Directed Research &
Development (LDRD)
 Sandia initiates 1-2 GC LDRDs/ yr
 Cross laboratory, interdisciplinary
 High risk, urgent needs, low TRL (technology
readiness level)
 Three year effort
 Create a Lasting National Technical
Capability
 Required to assemble highly engaged external
advisory board
 Cooperative development with partners to
shape technical vision
8
PANTHER Grand Challenge
Pattern Analytics to Support High Performance Exploitation and
Reasoning: Crosscutting Innovation in Geospatial-Temporal Analysis
PANTHER R&D: Scalable, temporal and geospatial
relationship discovery for robust, mission- relevant
pattern analysis.
How is PANTHER different? Develop new
mathematical approaches to this problem with a deep
understanding of human capabilities and information
landscape.
Challenge Problem Categories:
Signature Search – enable searches for signatures that
are difficult to discretize and subject to interruptions in
space and time.
Motion & Trajectory Analysis –discover patterns in
motion datasets at multiple semantic scales under
conditions of intermittent data.
9
Panther Team - Interdisciplinary
 PI: Kristina Czuchlewski, PhD
 PM: Bill Hart, PhD
 Team Leads: Jim Chow, Ph.D., Randy Brost, Ph.D., Laura
McNamara, Ph.D. and David Stracuzzi, Ph.D.
10
R&D Challenge Questions
 Where are chemical processing plants?
Signature Search
 Where are active businesses?
 Did someone arrive in a car and enter a building? Which one(s)?
 Which aircraft flights are point-to-point?
Trajectory Analysis
 Are any aircraft flying search patterns?
 Are any aircraft flying search patterns over sensitive locations?
 Is there an activity surge?
Temporal Patterns
 Is an aircraft flight pattern departing from normal? What might it do next?
 What can we infer given limited data?
 For any of the above, what is the result confidence?
Limited Data
Uncertainty/Quality
11
Current Paradigm  New Paradigm
State of the Art
Sandia R&D
PANTHER Grand Challenge
Approach
Single frame or
target recognition
Statistical, pixellevel processing for
specific tactical
missions
Identify patterns/ relationships and
activities over multiple target classes and
extend to strategic missions.
Achieve 107 reduction in data volume in
space & time.
Spatial
Single frame
Specific areas (10s
of km2)
Achieve 104 single frame reduction in
data presented.
Query and exploit 100s of km2
Temporal
Minutes of data,
Weeks of
real-time and offline. information,
available in minutes
Pattern based query of up to 1 year’s
worth of data.
Available in minutes.
Example
ATR, CCD,
matched filters
Statistical
Normalized
Coherence
Object-based image analysis: features,
entities, patterns and relationships. Ask
questions that cannot currently be
answered.
Will it
work?
Limited, does not
scale.
Limited, scales for
specific missions.
Unproven. Scalable & extensible to other
domains.
12
FY14 Highlights
 Unsupervised computational methods for
detecting outliers (with no a priori
knowledge) within a TB-scale database… in
under 30 minutes.
 New geometric feature vector enables
comparisons.
 Result: 44 seconds on our Netezza DB
machine to compute a trajectory
segmentation from points.
 Result: 20 minutes of wall-clock time to turn
1.2 billion points into 15 million trajectories.
 Geospatial feature extraction and
classification for signature relationship and
temporal trending searches.
 Pixel statistics extracted via new superpixel
segmentation algorithm.
 Unique temporal attributes of coherent
changes exploited for labeling.
13
FY14 Highlights, cont.
R2
E4
E1
3
B7
E10
G4
E37
P3
R7
6
E2
E34
G3
E32
E30
8
E2
R5
E19
E16
E18
E20
E27
E14
E25
1
E3
E36
E6
G2
E13
B6
E29
P2
R6
B3
B2
E17
E23
B5
E5
R4
E24
E21 B4
E9
E8
2
E1
 Eye-tracking experiments are enabling visual
cognition/ search models – and eventual user
interface design(s).
 Pattern match quality, statistical and probabilistic
approaches are under investigation for characterizing
uncertainty in geospatial temporal graph
representations.
E2

G1
E3
OP1
E12
E22

Multi-source data search under one framework
Representation and search under heterogeneous
temporal conditions (ephemeral and activity).
Complex sensor feature data “compressed” for analysis
with pointers back to original sensor source.
Intermittency/ interruption nodes implemented.
P1
R1
5
E1


E11
R3
4
E1
 New, efficient graph algorithm formulation
developed to enable geospatial-temporal topological
search complexity.
E1
B1
E7
E33
B8
E35
P4
Shadow
Roof
Bright
14
GeoSpatial Semantic Graph
Representations of Features & Activity
A
 Graph includes activity:
 @ t=1, the graph includes objects with
location
 From t=1 to t=2, the graph encodes change
 Nodes for activity events.1
 Node attributes include time observed.
 No persistence expected.
 Spatial and temporal relationship edges.1
t=3
t=2
A
A
t=1
A
15
Signature Search
 A signature (over space/ time) encodes a
desired question.
 For example, “Where are buildings with
nearby grass, pavement, and dirt?
 Graph template:
Building
Grass
Pavement
Dirt
16
Matches
 Graph search finds all matches to
signature template.
 In this case all red nodes with adjacent
green, grey, and tan nodes.
 New approach to graph search,
polynomial time complexity
 Searches are saved
A
t=3
t=2
A
A
t=1
A
17
Advantages
 Efficient representations in time (only store change).
 Relationship, change, and temporal analysis over multiple times and
heterogeneous spatial ensembles in the same query.
 Change detection
 Activity characterization
 Efficient search algorithms – computation not limited by brittle relational
databases.
 Potential to take full advantage of graph topology search advances,
enhanced by geospatial-temporal semantics.
 Feature-based analysis
 Multi-modality, in a single search representation.
 Sensor agnostic – PANTHER emphasizes SAR.
18
Extracting Statistical Distributions Using
Superpixel Segmentation
Superpixel Segmentation of
SAR Image
• Superpixel Segmentation
• Divides image into compact regions
containing pixels similar in spatial
proximity and intensity
• Derives from large body of research in
optical image processing community
extended to SAR imagery
• PANTHER Approach
• Superpixel segmentation algorithms
enable high quality segmentations and
efficient execution
• De-noising advances enable utilization
of novel image processing techniques.
19
Initial Automated Static Feature
Extraction Result
Exploit ~21 days worth of change via coherence
Result: automatically extract paved roads
20
Query algorithms must handle
disconnections & interruptions
Static and ephemeral features for time slice A:*
Data semantics:
Building
Fence
Dirt Road
Gravel
Desert
Shadow
Low Vegetation
Human Tracks
Vehicle Tracks
Vehicle
B
F
DR
G
D
S
LV
HT
VT
V
Did someone arrive by car and visit a building?
Note: Based on hand-annotated primitive features.
* Note: Hand-edited for explanation clarity. 21
Example: Connected Signature
Result from star-graph algorithm:*
Correct
Correct
Correct
Correct
Why wasn’t this found?
Note: Based on hand-annotated primitive features.
* Note: Hand-edited for explanation clarity.
22
Result of Disconnected Signature Search
Result from star-graph algorithm:*
Correct
Correct
Correct
Correct
Correct
Note: Based on hand-annotated primitive features.
* Note: Hand-edited for explanation clarity.
23
Site Activity Analysis
Diversity of Problems
Power Plant Search
Tank Complex Search
Activity Analysis -- Interrupted Signature
Construction Analysis
All of these were solved by the same code.
24
Discovery of Flight Patterns
Holding Pattern
Mapping
Avoid
Collections of
geometric
descriptions
can describe a
trajectory.
Extensions:
impact of time.
Forgot Something
25
Big Feature Space Advantage




Clustering, leading to
unsupervised learning
techniques
Previous examples showed
searching the space for a
specific pattern or a specific
volume of the feature space
But, with clustering, the
computer can group the
different patterns in the
feature space without knowing
a priori what they are.
Perhaps most importantly,
many clustering algorithms
specifically identify outliers in
the feature space that
correspond to odd behaviors
Forgot Something, Revisited
Did not specify “find this,” only told routine to
“make groups of similar flights.”
This was one of many clusters that had
distinctive shapes
26
Discovery of Odd flights
Clustering done based
on geometric features
Many clusters found,
but what remains is…
Note: we have ~5M
points/day,
~1GB/day, currently
>300GB
Represents approximately 700 out of a total of
50,000 flights from one day
27
Better knowledge & deeper insights
from Big Data
in minutes, not months;
over months, not hours or minutes;
covering hundreds, not 10s of km2
28
Technical Areas for Collaboration









Graph Analytics
Tensor Analysis
Computational Geometry
Digital Signal and Image Processing
Machine Learning
Scalable and Highly Distributed Architectures
Human Performance and Cognitive Understanding
Machine Feature Identification in Imagery (EO, SAR, LIDAR)
Uncertainty Quantification and Propagation – from Sensor to
Answer
 Robust Data Analysis
 Intelligent Data Collection
29
University Collaborations
 PhD Interns: Colorado State University, Utah State University
 Faculty have clearances and clear expertise in critical areas
 Students can gain clearances
 Year-round appointments give flexibility for Sandia work, university
requirements
 Win-win-win: Sandia gains valuable technical support in critical
challenge areas, Student gets a degree/publications/potential future
employment, University and professor fulfill mission and gain a strong
supporter.
 Critical Skills Master’s Program (CSMP): University of Illinois
 Similar to old One-Year-on-Campus-Program
 NDA’s: USU, CSU, UIUC (in progress)
30
Download