Slides - MIT Database Group

advertisement

D4M-1

Transforming Big Data with D4M

Jeremy Kepner

MIT Lincoln Laboratory

3 October 2012

This work is sponsored by the Department of the Air Force under Air Force Contract

#FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

Nicholas Arcolano

Michelle Beard

Bob Bond

Josh Haines

Matthew Schmidt

Ben Miller

Benjamin O’Gwynn

Tamara Yu

Bill Arcand

Bill Bergeron

Acknowledgements

David Bestor

Chansup Byun

Matt Hubbell

Pete Michaleas

Julie Mullen

Andy Prout

Albert Reuther

Tony Rosa

Charles Yee

Dylan Hutchinson

D4M-2

D4M-3

Introduction

Theory

Results

Summary

Outline

Example Applications of Graph Analytics

ISR Social Cyber

• Graphs represent entities and relationships detected through multi-INT sources

• 1,000s – 1,000,000s tracks and locations

• GOAL: Identify anomalous patterns of life

• Graphs represent relationships between individuals or documents

• 10,000s – 10,000,000s individual and interactions

• GOAL: Identify hidden social networks

• Graphs represent communication patterns of computers on a network

• 1,000,000s – 1,000,000,000s network events

• GOAL: Detect cyber attacks or malicious software

Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets

D4M-4

- Interactive

- On-demand

- Elastic

Enterprise

Four Ecosystems Dominate

Cloud Computing

Big Compute

- High performance

- Parallel Languages

- Scientific computing

- Java

- Map/Reduce

- Easy admin

- Indexing

- Search

- Security

Big Data DBMS

Each ecosystem is at the center of a multi-$B market

Pros/cons of each are numerous; diverging hardware/software

• Some missions can exist wholly in one ecosystem; some can’t

D4M-5

Four Ecosystems Dominate

Cloud Computing

Enterprise LLGrid Big Compute

- Interactive

- On-demand

- Elastic

- High performance

- Parallel Languages

- Scientific computing

MapReduce

- Java

- Map/Reduce

- Easy admin

- Indexing

- Search

- Security

Big Data DBMS

LLGrid MapReduce provides map/reduce interface in a big compute environment

D4M provides an interactive parallel scientific computing environment to databases

D4M-6

Big Data + Big Compute Challenge

Database Worldview

“It’s the data!”

Delivering data is the end

Supercomputing Worldview

“It’s the computer!”

Delivering data is the start

Shared Data Shared Compute

Separate Compute Separate Data

Database and supercomputing views are fundamentally different

Have never coexisted; do not know how to coexist

• Big Data “Analytics” are forcing them together

Current standard practice duplicates hardware and data

D4M-7

Big Data + Big Compute Stack

Novel Analytics for:

Text, Cyber, Bio

High Level Composable API:

D4M (“Databases for Matlab”)

Distributed Database:

Accumulo (triple store)

Weak Signatures,

Noisy Data,

Dynamics

A

B

C

Array

Algebra

E

Distributed

Database/

Distributed File

System

High Performance Computing:

LLGrid + Hadoop

Interactive

Supercomputing

D4M-8

Combining Big Compute and Big Data enables entirely new domains

High Level Language: D4M http://www.mit.edu/~kepner/D4M

Distributed Database

D4M

Dynamic

Distributed

Dimensional

Data

Model

Associative Arrays

Numerical Computing Environment

A

C

B

Query:

Alice

Bob

Cathy

David

Earl

E

D

A D4M query returns a sparse matrix or a graph…

…for statistical signal processing or graph analysis in

MATLAB

D4M-9

D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

D4M-10

Outline

Introduction

Theory

– Associate Arrays

– Incidence Matrix

Results

Summary

Big Tables

What are Spreadsheets and Big Tables?

Spreadsheets

Spreadsheets are the most commonly used analytical structure on Earth

(100M users/day?)

• Big Tables (Google, Amazon, …) store most of the analyzed data in the world

(Exabytes?)

• Simultaneous diverse data: strings, dates, integers, reals , …

• Simultaneous diverse uses: matrices, functions, hash tables, databases, …

• No formal mathematical basis; Zero papers in AMA or SIAM

D4M-11

D4M Key Concept:

Associative Arrays Unify Four Abstractions

Extends associative arrays to 2D and mixed data types

A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0

Key innovation: 2D is 1-to-1 with triple store

('alice ','bob ','cited ') or ('alice ','bob ',47.0)

A T x A T x cited bob bob carl alice  cited carl alice

D4M-12

Composable Associative Arrays

Key innovation: mathematical closure

– All associative array operations return associative arrays

Enables composable mathematical operations

A + B A - B A & B A|B A*B

Enables composable query operations via array indexing

A('alice bob ',:) A('alice ',:) A('al* ',:)

A('alice : bob ',:) A(1:2,:) A == 47.0

Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra

Complex queries with ~50x less effort than Java/SQL

Naturally leads to high performance parallel implementation

D4M-13

Associative Array Definitions

Keys and values are from the infinite strict totally ordered set

Associative array A( k ) : d

, k =(k 1 ,…,k d ) , is a partial function from d keys (typically 2) to 1 value , where

A( k i

) = v i and

 otherwise

Binary operations on associative arrays A

3 where

=

 f() or

 f()

, have the properties

– If A

1

( k i

) = v

1 and A

2

(k i

) = v

2

, then A

3

( k i

) is v

1

 f() v

2

= f(v

1

,v

2

) or

= A

1 v

1

 f()

A

2

, v

2

= f(v

1

,v

2

)

– If

A

1

( k i

) = v or

 and A

2

( k i

) =

 or v , then A

3

( k i

) is v

 f()

= v or v

 f()

=

High level usage dictated by these definitions

Deeper algebraic properties set by the collision function f()

Frequent switching between “algebras” (how spreadsheets are used)

D4M-14

Theory Questions

Associative arrays can be constructed from a few definitions

Similar to linear algebra, but applicable to a wider range of data

Key questions

– Which linear algebra properties do apply to associative arrays (intuitive)

– Which linear algebra properties do not apply to associative arrays

(watch out)

– Which associative array properties do not apply to linear algebra (new)

Associative

Arrays new intuitive

Linear

Algebra watch out

D4M-15

References

Book: “Graph Algorithms in the Language of Linear Algebra”

Editors: Kepner (MIT-LL) and Gilbert (UCSB)

Contributors:

– Bader (Ga Tech)

– Bliss (MIT-LL)

– Bond (MIT-LL)

– Dunlavy (Sandia)

Faloutsos (CMU)

Fineman (CMU)

Gilbert (USCB)

Heitsch (Ga Tech)

Hendrickson (Sandia)

Kegelmeyer (Sandia)

– Kepner (MIT-LL)

– Kolda (Sandia)

– Leskovec (CMU)

– Madduri (Ga Tech)

– Mohindra (MIT-LL)

Nguyen (MIT)

Radar (MIT-LL)

Reinhardt (Microsoft)

Robinson (MIT-LL)

Shah (USCB)

D4M-16

D4M-17

Outline

Introduction

Theory

– Associate Arrays

– Incidence Matrix

Results

Summary

D4M-18

Digraphs are Black & White

The World is Color

D4M-19

Artist: Ann Pibal; Painting: “XCRS”

Blue

Silver

Green

Orange

Pink

5 Edge Colors

D4M-20

Artist: Ann Pibal; Painting: “XCRS”

V12 V14

V13

20 Vertices

V3 V17 V8 V19

V7

V9 V11 V2 V16 V6

D4M-21

V10

V5

V1 V15 V4 V18

Artist: Ann Pibal; Painting: “XCRS”

V20

1 Isolated Standard Edge

D4M-22

P4

Artist: Ann Pibal; Painting: “XCRS”

12 Multi Edges

D4M-23

Artist: Ann Pibal; Painting: “XCRS”

18 Hyper Edges

P5

P8

D4M-24

O5

P3

P7

P6

Artist: Ann Pibal; Painting: “XCRS”

D4M-25

27 Edge Orderings

O5 < P3,P6,P7,P8

O5 < B1,S1,G1,O1,O2,P1

O5 < B2,S2,G2,O3,O4,P2 < P7,P8

P5

P8

O5

P3

P7

P6

Artist: Ann Pibal; Painting: “XCRS”

52 Standard Multi Edges

P5x2

P8x2

D4M-26

O5x5

P3x3

P7x2

P6x2

Artist: Ann Pibal; Painting: “XCRS”

D4M-27

Summary Observations

Standard edge representation fragments hyper edges

– Information is lost

Digraph representation compresses multi-edges

– Information is lost

Matrix representation drops edge labels

– Information is lost

Standard graph representation drops edge order

– Information is lost

Need edge representation that preserves information

Artist: Ann Pibal; Painting: “XCRS”

Solution: Incidence Matrix

D4M-28

Edge Color Order V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20

B1 Blue 2 1 1 1

S1 Silver 2 1 1 1

G1 Green 2 1 1 1

O1 Orange 2 1 1 1

O2 Orange 2 1 1 1

P1 Pink 2 1 1 1

B2

S2

Blue

Silver

2

2

1 1 1 1 1

1 1 1 1 1

G2

P2

P3

P5

Green

O3 Orange

Pink

Pink

Pink

2

2

O4 Orange 2

2

O5 Orange 1

2

P4 Pink 2

2

P6 Pink 2

1

1

1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1

1 1

1

1

1

1

1

1

1

1 1

1

P7 Pink 3

P8 Pink 3

1

1

1

1

1

1

Artist: Ann Pibal; Painting: “XCRS”

D4M-29

Outline

Introduction

Theory

Results

– Network monitoring example

– Bioinformatics example

Summary

Graph Construction Using D4M:

Explode Schema

Raw

Data

Use as row indices

CSV

Files

Distributed

Database

Dense Table log_id

001

002

003 src_ip

128.0.0.1

192.168.1.2

128.0.0.1

server_ip

208.29.69.138

157.166.255.18

74.125.224.72

Assoc.

Arrays

Create columns for each unique type/value pair log_id|001 log_id|002 log_id|003 src_ip|128.0.0.1

src_ip|192.168.1.2

server_ip|157.166.255.18

1 0 0

0

1

1

0

1

0 server_ip|208.29.69.138

1

0

0 server_ip|74.125.224.72

0

0

1

Exploded Table

D4M-30

Graph Construction Using D4M:

Storing Exploded Data as Triples

Raw

Data

CSV

Files

Assoc.

Arrays

Distributed

Database

Exploded Table log_id|001 log_id|002 log_id|003 src_ip|128.0.0.1

src_ip|192.168.1.2

server_ip|157.166.255.18

1 0 0

0

1

1

0

1

0 server_ip|208.29.69.138

1

0

0 server_ip|74.125.224.72

0

0

1

Row log_id|001 log_id|001 log_id|002 log_id|002 log_id|003 log_id|003

D4M stores the triple data representing both the exploded table and its transpose

Table Triples

Column src_ip|128.0.0.1

server_ip|208.29.69.138

src_ip|192.168.1.2

server_ip|157.166.255.18

src_ip|128.0.0.1

server_ip|74.125.224.72

Value

1

1

1

1

1

1

Table Transpose Triples

Row server_ip|157.166.255.18

server_ip|208.29.69.138

server_ip|74.125.224.72

src_ip|128.0.0.1

src_ip|128.0.0.1

src_ip|192.168.1.2

Column log_id|002 log_id|001 log_id|003 log_id|001 log_id|003 log_id|002

Value

1

1

1

1

1

1

D4M-31

D4M-32

Graph Construction Using D4M:

Construct Associative Arrays

Raw

Data

CSV

Files

Distributed

Database

D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...

’time_stamp|13/May/2011:23:59:59’,);

(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)

(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)

(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)

...

Assoc.

Arrays

Graph Construction Using D4M:

Construct Associative Arrays

Raw

Data

CSV

Files

Distributed

Database

D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...

’time_stamp|13/May/2011:23:59:59’,);

Assoc.

Arrays

D4M Query #2 data = T(Row(keys), :);

(‘log_id|001’,‘server_ip|208.29.69.138’,1)

(‘log_id|001’,‘src_ip|128.0.0.1’,1)

(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)

...

(‘log_id|002’,‘server_ip|157.166.255.18’,1)

(‘log_id|002’,‘src_ip|192.168.1.2’,1)

(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)

...

(‘log_id|003’,‘server_ip|74.125.224.72’,1)

(‘log_id|003’,‘src_ip|128.0.0.1’,1)

(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)

...

D4M-33

Graph Construction Using D4M:

Construct Associative Arrays

D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...

’time_stamp|13/May/2011:23:59:59’,);

D4M Query #2 data = T(Row(keys), :);

Associative Array Algebra

G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);

(‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1)

(‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1)

(‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)

...

D4M-34

D4M-35

Raw

Data

Graph Construction Using D4M:

Construct Associative Arrays

CSV

Files

Assoc.

Arrays

Distributed

Database

D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...

’time_stamp|13/May/2011:23:59:59’,);

D4M Query #2 data = T(Row(keys), :);

Associative Array Algebra

G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);

Adj(G);

Graphs can be constructed with minimal effort using D4M queries and associative array algebra

D4M-36

Accumulo Ingestion Scalability Study

LLGrid MapReduce With A Python Application

Accumulo Database: 1 Master + 7 Tablet servers

4 Mil e/s

Data #1:

5 GB of 200 files

Data #2:

30 GB of 1000 files

D4M-37

Outline

Introduction

Theory

Results

– Network monitoring example

– Bioinformatics example

Summary

Relative Cost per DNA Sequence

Big Data

Energy Efficient

High Volume Sequencer

Portable

Sequencer

D4M-38

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome

Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed

03/08/2012

Example Disease Outbreak

May-July 2011 - Virulent E. Coli Outbreak Germany diarrhea kidney

Outbreak identified

Spanish

Cucumbers implicated

DNA

Sequence released

Sprouts

Identified

Deaths www.rki.de EHEC final report

Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses

Publishing sequence data allowed for broad community to fully characterize pathogen

Sequencing and crowd source analysis showed promising potential -> Still too slow

D4M-39

Sequence Matching  Graph 

Sparse Matrix Multiply in D4M

RNA Reference Set Collected Sample

A

1

A

2

A

1

A

2

' sequence word (10mer) sequence word (10mer) unknown sequence ID

• Associative arrays provide a natural framework for sequence matching

D4M-40

Database Automatically Computes

Reference 10mer Distribution

0.5% 5%

50%

• Using 10mer distribution can quickly select reference 10mers that maximally differentiate sample sequences and eliminate most 10mers

D4M-41

Leveraging “Big Data” Technologies for High

Speed Sequence Matching

D4M

10000

BLAST

100x smaller

1000

100

D4M +

Triple Store

10

100 10000 1000000 code volume (lines)

• High performance triple store database trades computations for lookups

• Used Apache Accumulo database to accelerate comparison by 100x

• Used Lincoln D4M software to reduce code size by 100x

D4M-42

Summary

Big data is found across a wide range of areas

– Document analysis

– Computer network analysis

– DNA Sequencing

Currently there is a gap in big data analysis tools for algorithm developers

D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation

D4M-43

Download