D4M-1
Jeremy Kepner
MIT Lincoln Laboratory
3 October 2012
This work is sponsored by the Department of the Air Force under Air Force Contract
#FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.
•
Nicholas Arcolano
•
Michelle Beard
•
Bob Bond
•
Josh Haines
•
Matthew Schmidt
•
Ben Miller
•
Benjamin O’Gwynn
•
Tamara Yu
•
Bill Arcand
•
Bill Bergeron
Acknowledgements
•
David Bestor
•
Chansup Byun
•
Matt Hubbell
•
Pete Michaleas
•
Julie Mullen
•
Andy Prout
•
Albert Reuther
•
Tony Rosa
•
Charles Yee
•
Dylan Hutchinson
D4M-2
D4M-3
•
Introduction
•
Theory
•
Results
•
Summary
Outline
Example Applications of Graph Analytics
ISR Social Cyber
• Graphs represent entities and relationships detected through multi-INT sources
• 1,000s – 1,000,000s tracks and locations
• GOAL: Identify anomalous patterns of life
• Graphs represent relationships between individuals or documents
• 10,000s – 10,000,000s individual and interactions
• GOAL: Identify hidden social networks
• Graphs represent communication patterns of computers on a network
• 1,000,000s – 1,000,000,000s network events
• GOAL: Detect cyber attacks or malicious software
•
Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets
D4M-4
- Interactive
- On-demand
- Elastic
Enterprise
Four Ecosystems Dominate
Cloud Computing
Big Compute
- High performance
- Parallel Languages
- Scientific computing
- Java
- Map/Reduce
- Easy admin
- Indexing
- Search
- Security
Big Data DBMS
•
Each ecosystem is at the center of a multi-$B market
•
Pros/cons of each are numerous; diverging hardware/software
• Some missions can exist wholly in one ecosystem; some can’t
D4M-5
Four Ecosystems Dominate
Cloud Computing
Enterprise LLGrid Big Compute
- Interactive
- On-demand
- Elastic
- High performance
- Parallel Languages
- Scientific computing
MapReduce
- Java
- Map/Reduce
- Easy admin
- Indexing
- Search
- Security
Big Data DBMS
•
LLGrid MapReduce provides map/reduce interface in a big compute environment
•
D4M provides an interactive parallel scientific computing environment to databases
D4M-6
Big Data + Big Compute Challenge
Database Worldview
“It’s the data!”
Delivering data is the end
Supercomputing Worldview
“It’s the computer!”
Delivering data is the start
Shared Data Shared Compute
Separate Compute Separate Data
•
Database and supercomputing views are fundamentally different
•
Have never coexisted; do not know how to coexist
• Big Data “Analytics” are forcing them together
•
Current standard practice duplicates hardware and data
D4M-7
Big Data + Big Compute Stack
Novel Analytics for:
Text, Cyber, Bio
High Level Composable API:
D4M (“Databases for Matlab”)
Distributed Database:
Accumulo (triple store)
Weak Signatures,
Noisy Data,
Dynamics
A
B
C
Array
Algebra
E
Distributed
Database/
Distributed File
System
High Performance Computing:
LLGrid + Hadoop
Interactive
Supercomputing
D4M-8
•
Combining Big Compute and Big Data enables entirely new domains
High Level Language: D4M http://www.mit.edu/~kepner/D4M
Distributed Database
D4M
Dynamic
Distributed
Dimensional
Data
Model
Associative Arrays
Numerical Computing Environment
A
C
B
Query:
Alice
Bob
Cathy
David
Earl
E
D
A D4M query returns a sparse matrix or a graph…
…for statistical signal processing or graph analysis in
MATLAB
D4M-9
D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
D4M-10
Outline
•
Introduction
•
Theory
– Associate Arrays
– Incidence Matrix
•
Results
•
Summary
Big Tables
What are Spreadsheets and Big Tables?
Spreadsheets
• Spreadsheets are the most commonly used analytical structure on Earth
(100M users/day?)
• Big Tables (Google, Amazon, …) store most of the analyzed data in the world
(Exabytes?)
• Simultaneous diverse data: strings, dates, integers, reals , …
• Simultaneous diverse uses: matrices, functions, hash tables, databases, …
• No formal mathematical basis; Zero papers in AMA or SIAM
D4M-11
D4M Key Concept:
Associative Arrays Unify Four Abstractions
•
Extends associative arrays to 2D and mixed data types
A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0
•
Key innovation: 2D is 1-to-1 with triple store
('alice ','bob ','cited ') or ('alice ','bob ',47.0)
A T x A T x cited bob bob carl alice cited carl alice
D4M-12
Composable Associative Arrays
•
Key innovation: mathematical closure
– All associative array operations return associative arrays
•
Enables composable mathematical operations
A + B A - B A & B A|B A*B
•
Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
•
Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra
•
Complex queries with ~50x less effort than Java/SQL
•
Naturally leads to high performance parallel implementation
D4M-13
Associative Array Definitions
•
Keys and values are from the infinite strict totally ordered set
•
Associative array A( k ) : d
, k =(k 1 ,…,k d ) , is a partial function from d keys (typically 2) to 1 value , where
A( k i
) = v i and
otherwise
•
Binary operations on associative arrays A
3 where
=
f() or
f()
, have the properties
– If A
1
( k i
) = v
1 and A
2
(k i
) = v
2
, then A
3
( k i
) is v
1
f() v
2
= f(v
1
,v
2
) or
= A
1 v
1
f()
A
2
, v
2
= f(v
1
,v
2
)
– If
A
1
( k i
) = v or
and A
2
( k i
) =
or v , then A
3
( k i
) is v
f()
= v or v
f()
=
•
High level usage dictated by these definitions
•
Deeper algebraic properties set by the collision function f()
•
Frequent switching between “algebras” (how spreadsheets are used)
D4M-14
Theory Questions
•
Associative arrays can be constructed from a few definitions
•
Similar to linear algebra, but applicable to a wider range of data
•
Key questions
– Which linear algebra properties do apply to associative arrays (intuitive)
– Which linear algebra properties do not apply to associative arrays
(watch out)
– Which associative array properties do not apply to linear algebra (new)
Associative
Arrays new intuitive
Linear
Algebra watch out
D4M-15
References
•
Book: “Graph Algorithms in the Language of Linear Algebra”
•
Editors: Kepner (MIT-LL) and Gilbert (UCSB)
•
Contributors:
– Bader (Ga Tech)
– Bliss (MIT-LL)
– Bond (MIT-LL)
– Dunlavy (Sandia)
–
Faloutsos (CMU)
–
Fineman (CMU)
–
Gilbert (USCB)
–
Heitsch (Ga Tech)
–
Hendrickson (Sandia)
–
Kegelmeyer (Sandia)
– Kepner (MIT-LL)
– Kolda (Sandia)
– Leskovec (CMU)
– Madduri (Ga Tech)
– Mohindra (MIT-LL)
–
Nguyen (MIT)
–
Radar (MIT-LL)
–
Reinhardt (Microsoft)
–
Robinson (MIT-LL)
–
Shah (USCB)
D4M-16
D4M-17
Outline
•
Introduction
•
Theory
– Associate Arrays
– Incidence Matrix
•
Results
•
Summary
D4M-18
Digraphs are Black & White
The World is Color
D4M-19
Artist: Ann Pibal; Painting: “XCRS”
Blue
Silver
Green
Orange
Pink
5 Edge Colors
D4M-20
Artist: Ann Pibal; Painting: “XCRS”
V12 V14
V13
20 Vertices
V3 V17 V8 V19
V7
V9 V11 V2 V16 V6
D4M-21
V10
V5
V1 V15 V4 V18
Artist: Ann Pibal; Painting: “XCRS”
V20
1 Isolated Standard Edge
D4M-22
P4
Artist: Ann Pibal; Painting: “XCRS”
12 Multi Edges
D4M-23
Artist: Ann Pibal; Painting: “XCRS”
18 Hyper Edges
P5
P8
D4M-24
O5
P3
P7
P6
Artist: Ann Pibal; Painting: “XCRS”
D4M-25
27 Edge Orderings
O5 < P3,P6,P7,P8
O5 < B1,S1,G1,O1,O2,P1
O5 < B2,S2,G2,O3,O4,P2 < P7,P8
P5
P8
O5
P3
P7
P6
Artist: Ann Pibal; Painting: “XCRS”
52 Standard Multi Edges
P5x2
P8x2
D4M-26
O5x5
P3x3
P7x2
P6x2
Artist: Ann Pibal; Painting: “XCRS”
D4M-27
Summary Observations
•
Standard edge representation fragments hyper edges
– Information is lost
•
Digraph representation compresses multi-edges
– Information is lost
•
Matrix representation drops edge labels
– Information is lost
•
Standard graph representation drops edge order
– Information is lost
•
Need edge representation that preserves information
Artist: Ann Pibal; Painting: “XCRS”
Solution: Incidence Matrix
D4M-28
Edge Color Order V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
B1 Blue 2 1 1 1
S1 Silver 2 1 1 1
G1 Green 2 1 1 1
O1 Orange 2 1 1 1
O2 Orange 2 1 1 1
P1 Pink 2 1 1 1
B2
S2
Blue
Silver
2
2
1 1 1 1 1
1 1 1 1 1
G2
P2
P3
P5
Green
O3 Orange
Pink
Pink
Pink
2
2
O4 Orange 2
2
O5 Orange 1
2
P4 Pink 2
2
P6 Pink 2
1
1
1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1
1 1
1
1
1
1
1
1
1
1 1
1
P7 Pink 3
P8 Pink 3
1
1
1
1
1
1
Artist: Ann Pibal; Painting: “XCRS”
D4M-29
Outline
•
Introduction
•
Theory
•
Results
– Network monitoring example
– Bioinformatics example
•
Summary
Graph Construction Using D4M:
Explode Schema
Raw
Data
Use as row indices
CSV
Files
Distributed
Database
Dense Table log_id
001
002
003 src_ip
128.0.0.1
192.168.1.2
128.0.0.1
server_ip
208.29.69.138
157.166.255.18
74.125.224.72
Assoc.
Arrays
Create columns for each unique type/value pair log_id|001 log_id|002 log_id|003 src_ip|128.0.0.1
src_ip|192.168.1.2
server_ip|157.166.255.18
1 0 0
0
1
1
0
1
0 server_ip|208.29.69.138
1
0
0 server_ip|74.125.224.72
0
0
1
Exploded Table
D4M-30
Graph Construction Using D4M:
Storing Exploded Data as Triples
Raw
Data
CSV
Files
Assoc.
Arrays
Distributed
Database
Exploded Table log_id|001 log_id|002 log_id|003 src_ip|128.0.0.1
src_ip|192.168.1.2
server_ip|157.166.255.18
1 0 0
0
1
1
0
1
0 server_ip|208.29.69.138
1
0
0 server_ip|74.125.224.72
0
0
1
Row log_id|001 log_id|001 log_id|002 log_id|002 log_id|003 log_id|003
D4M stores the triple data representing both the exploded table and its transpose
Table Triples
Column src_ip|128.0.0.1
server_ip|208.29.69.138
src_ip|192.168.1.2
server_ip|157.166.255.18
src_ip|128.0.0.1
server_ip|74.125.224.72
Value
1
1
1
1
1
1
Table Transpose Triples
Row server_ip|157.166.255.18
server_ip|208.29.69.138
server_ip|74.125.224.72
src_ip|128.0.0.1
src_ip|128.0.0.1
src_ip|192.168.1.2
Column log_id|002 log_id|001 log_id|003 log_id|001 log_id|003 log_id|002
Value
1
1
1
1
1
1
D4M-31
D4M-32
Graph Construction Using D4M:
Construct Associative Arrays
Raw
Data
CSV
Files
Distributed
Database
D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)
(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)
(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)
...
Assoc.
Arrays
Graph Construction Using D4M:
Construct Associative Arrays
Raw
Data
CSV
Files
Distributed
Database
D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
Assoc.
Arrays
D4M Query #2 data = T(Row(keys), :);
(‘log_id|001’,‘server_ip|208.29.69.138’,1)
(‘log_id|001’,‘src_ip|128.0.0.1’,1)
(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)
...
(‘log_id|002’,‘server_ip|157.166.255.18’,1)
(‘log_id|002’,‘src_ip|192.168.1.2’,1)
(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)
...
(‘log_id|003’,‘server_ip|74.125.224.72’,1)
(‘log_id|003’,‘src_ip|128.0.0.1’,1)
(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)
...
D4M-33
Graph Construction Using D4M:
Construct Associative Arrays
D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2 data = T(Row(keys), :);
Associative Array Algebra
G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);
(‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1)
(‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1)
(‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)
...
D4M-34
D4M-35
Raw
Data
Graph Construction Using D4M:
Construct Associative Arrays
CSV
Files
Assoc.
Arrays
Distributed
Database
D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2 data = T(Row(keys), :);
Associative Array Algebra
G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);
Adj(G);
•
Graphs can be constructed with minimal effort using D4M queries and associative array algebra
D4M-36
Accumulo Ingestion Scalability Study
LLGrid MapReduce With A Python Application
Accumulo Database: 1 Master + 7 Tablet servers
4 Mil e/s
Data #1:
5 GB of 200 files
Data #2:
30 GB of 1000 files
D4M-37
Outline
•
Introduction
•
Theory
•
Results
– Network monitoring example
– Bioinformatics example
•
Summary
Relative Cost per DNA Sequence
Big Data
Energy Efficient
High Volume Sequencer
Portable
Sequencer
D4M-38
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome
Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed
03/08/2012
Example Disease Outbreak
May-July 2011 - Virulent E. Coli Outbreak Germany diarrhea kidney
Outbreak identified
Spanish
Cucumbers implicated
DNA
Sequence released
Sprouts
Identified
Deaths www.rki.de EHEC final report
Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses
Publishing sequence data allowed for broad community to fully characterize pathogen
Sequencing and crowd source analysis showed promising potential -> Still too slow
D4M-39
Sequence Matching Graph
Sparse Matrix Multiply in D4M
RNA Reference Set Collected Sample
A
1
A
2
A
1
A
2
' sequence word (10mer) sequence word (10mer) unknown sequence ID
• Associative arrays provide a natural framework for sequence matching
D4M-40
Database Automatically Computes
Reference 10mer Distribution
0.5% 5%
50%
• Using 10mer distribution can quickly select reference 10mers that maximally differentiate sample sequences and eliminate most 10mers
D4M-41
Leveraging “Big Data” Technologies for High
Speed Sequence Matching
D4M
10000
BLAST
100x smaller
1000
100
D4M +
Triple Store
10
100 10000 1000000 code volume (lines)
• High performance triple store database trades computations for lookups
• Used Apache Accumulo database to accelerate comparison by 100x
• Used Lincoln D4M software to reduce code size by 100x
D4M-42
Summary
•
Big data is found across a wide range of areas
– Document analysis
– Computer network analysis
– DNA Sequencing
•
Currently there is a gap in big data analysis tools for algorithm developers
•
D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation
D4M-43