IBM blue-and-white template

advertisement
Path Processing using
Solid State Storage
Manos Athanassoulis, DIAS, EPFL*
Mustafa Canim, IBM Watson Research Labs
Kenneth Ross, IBM Watson Research Labs, Columbia University
Bishwaranjan Bhattacharjee, IBM Watson Research Labs
*work done during an internship at IBM.
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Why Path Processing?
Why Solid State Storage (SSS)?
 App’s use linkage information  Increasing capacity
 Social
 Exponential increase
 Scientific
 Follows Moore’s law
 Government
 Read performance
 Financial
 OOM faster than disks
 Knowledge
 Random read performance
 Watson (Jeopardy Champ)
 Graph processing not enough
 Link type modeled by RDF
 Crucial for path processing
 New technologies
 Flash already mature
 Phase Change Memory (PCM)
 … more tech’s are coming
2
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Path processing
3
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Path processing
1) Cannot prefetch
2) Retrieve-data-then-follow-link
3) A lot of useless data are retrieved
How can Solid State Storage help?
4
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Path processing (and Solid State Storage)
1) Small access latency
2) Read mostly usefull data
3) Efficient random IO accesses
4) Can we do something better?
Build SSS-aware systems
5
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
In the rest of the talk …
 RDF data model and systems
 Solid State Storage for Path Processing
 Technology
 Flash vs PCM
 Storing and managing RDF data over Solid State Storage
 Conclusions
6
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Resource Description Framework (RDF) meta-data model
Data is represented in Statements each one comprised by a triple
Statement: <Subject, Predicate, Object>
Each statement describes a property of a subject:
<“IBM”, “is-a”, “Corporation”>
or a connection between to objects:
<“Manos”, “interned-at”, “IBM”>
or a value of a Property of a Subject:
<“Manos”, “born-in”, “1984”>
The notation is more complex:
7
•
Subjects are Universal Resource Identifiers (URIs)
•
Predicates are URIs
•
Objects are either URIs or literals
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
RDF data management
 Two alternatives are used to store data
 Relational RDF storage
•
Use existing relational stores
•
Create relational tables
•
Basic approach: A triple-store
• One big table with three columns
 Native RDF storage
•
Tailored to the needs of the specific workload
•
No underlying system assumed
Can we take the best of both worlds?
8
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Outline
 RDF data model and systems
 Solid State Storage for Path Processing
 Technology
 Flash vs PCM
 Storing and managing RDF data over Solid State Storage
 Conclusions
9
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Solid State Storage facts
 We have access to a PCI-based PCM prototype (compared with fusionIO)
4K acceses
PCM prototype*
fusionIO
Read Latency (SW+HW)
36µs
72µs
3%
60%
386µs
241µs
11%
20%
STDEV
Write Latency (SW+HW)
STDEV
 PCM prototype vs Flash state-of-the-art
4K accesses
PCM prototype*
FusionIO
Read BW
800MB/s
700MB/s
Read latency (HW) 20µs (4KB)
50µs (512B)
Write BW
550MB/s
40MB/s
Write latency (HW) 250µs (4KB)
<200µs (4KB)
Endurance
100K cycles
1M cycles
Write Cap (TB/GB) 1000
10
August 2012
590
*Very early Micron PCM prototype
© 2012 IBM Corporation
Path Processing using Solid State Storage
Exploiting Solid State Storage for path processing
 Path-processing involves link-following queries
•
Access latency is critical
 Solid State Storage is tailored for path-processing:
•
OOM lower read latency than traditional storage
•
Very fast random-read performance
 PCM is expected to outperform Flash in read performance
 Next in this talk:
11
•
PCM vs Flash when running link-following queries
•
Storing and managing RDF data on Solid State Storage
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
PCM vs Flash in path processing
 Prototype implementation of link-following queries
 Workload: Given a randomly generated graph, execute linkfollowing queries of variable length without buffering
 Graph generation
5GB synthetic data with random number of edges (between 3 and 30
edges per vertex)
 Querying Parameters
Number of threads (1, 2, 4, 8, 16, 32, 64, 96, 128, 192)
Pagesize (4K, 8K, 16K, 32K)
Length of the query (2, 4, 10, 100 accesses per query)
 Hypothesis: PCM can offer important performance improvements
12
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
PCM vs Flash
Query length: 100 hops
PCM performs consistently better for smaller page granularities
13
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
An RDF repository for Solid State Storage
PYTHIA
14
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Building a SSS-aware RDF repository
 We focused on building a graph-based RDF repository
 We need to design a new system which:
•
Takes into account the graph-structure of the data
•
Supports any RDF-based query
 We introduce Pythia, a new RDF repository, which uses:
15
•
The notion of RDF-tuple
•
New internal structures
•
New data layout
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
RDF-tuple
<Subject>,
<Predicate1>, {<Object1_1>, <Object1_2>, …},
<Predicate2>, {<Object2_1>, <Object2_2>, …},
…
<PredicateN>, {<ObjectN_1>, <ObjectN_2>, …},
The RDF-tuple design:
•
allows us to locate within a page the most important information of a
Subject.
•
allows us to avoid repeating redundant information (Subject and
Predicate resources)
• This is further optimized by the URL Dictionary
16
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
DRAM
Pythia
SSS
Query Engine
Literals
Dictionary
URL Dictionary
Hash Index
Main storage:
S, P, O
17
August 2012
Hash Index
Aux storage:
O, P, S
–Repository for
Very Large Objects
© 2012 IBM Corporation
Path Processing using Solid State Storage
Data layout on Pythia
Tuple Metadata
Tuple 0
Subject Resource
Tuple 1
Predicates dictionary IDs
Objects: (if literal) Literal dictionary ID
Tuple 2
Tuple 3
Objects: (else) Object Resource and pageID, tupleID
LEN
Sptr
nofP
Optr
dicID
Optr
dicID
…
…
…
…
nofO
local
ORpt
pID
tID
local
ORpt
pID
tID
…
…
nofO
local
ORpt
pID
tID
…
…
…
<Subject>
…
18
August 2012
<Object1_1>
…
…
<Object1_2>
<Object2_1>
…
© 2012 IBM Corporation
Path Processing using Solid State Storage
Storing Yago2 using Pythia
 Yago2 is a semantic knowledge base, introduced by Max-Planck
Institute in 2007, derived from wikipedia, WordNet, and GeoNames
(currently ~10M entries, 460M facts).
Yago2 in Pythia
 Initial data: 2.3GB
 Main DB files: 1.3GB
 Large objects: 192MB
•
Can be aggressively decreased with page-level compression (tuples
will move to main file as well)
 Indexes: 121MB (hash-based, in memory)
 Dictionaries: 569MB
•
Possible optimization: Take into account type of literal (now string)
 More than 99% of the SPO tuples can fit in a single 4K page
19
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Evaluating Pythia (Setup & Dataset)
 Prototype C++ implementation
 System Setup
•
24-core Intel XEON X560 with linux x86_64 (2.6.32-28)
•
32GB of memory
•
12GB PCM card (Micron prototype card)
•
74GB Flash card (fusionIO)
 Workload: Yago2
 Queries: a mix of 6 queries with randomized parameters
20
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
How often can you ask Pythia?
21
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
How fast does Pythia answer?
22
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Pythia vs RDF-3X
 RDF – 3X is the de facto research state-of-the-art
 Data in a virtual table and accessed through compressed indexes
6 indexes (all permutations of S,P,O) and 3 aggregate indexes
OSP
OPS
P, Count
POS
O, Count
PSO
SOP
23
August 2012
SPO
S, Count
© 2012 IBM Corporation
Path Processing using Solid State Storage
Pythia vs RDF-3X
 Q1: Find all male citizens of Greece.
 Q2: Find all OECD member economies that Switzerland deals with.
 Q3: Find all mafia films that Al Pacino acted in.
 Size on disk for Yago2: Raw data 2.3GB
 Pythia: 2.2GB (no compression)
1.5GB db files (on disk)
0.7GB dictionaries/indexes (loaded in memory during startup)
 RDF-3X: 2.2GB (aggressive compression)
2.2GB a single file (on disk)
24
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Conclusions
 Solid State Storage is naturally tailored for path processing
 PCM, Flash and more new technologies
 PCM comparative advantage against flash is lower read latency
 1.5x-2.5x speedup in a workload with dependent reads
 Pythia: A solid-state-storage-aware path-processing system
 1.5x – 2.5x high bandwidth on PCM compared to Flash
 1.5x – 2.0x lower response times on PCM compared to Flash
 Competitive against state-of-the-art (RDF-3X)
25
August 2012
© 2012 IBM Corporation
Path Processing using Solid State Storage
Thank you!
Pythia (Greek: Πυθία; IPA pɪθiːɑː), commonly known as the Oracle of
Delphi, was the priestess at the Temple of Apollo at Delphi, located
on the slopes of Mount Parnassus, delivering prophecies.
26
August 2012
© 2012 IBM Corporation
Download