ppt

advertisement
SPARQL Basic Graph Pattern Processing with Iterative
MapReduce
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2010-04-26
Presented by Jaeseok Myung
MapReduce
 MapReduce is easily accessible

The Hadoop project provides an open-source MR implementation
 MapReduce gives users a simple abstraction for utilizing parallel
and distributed system

Programming Model
–
Map(k,v) -> list(k’, v’)
–
Reduce(k’, list(v’)) -> list(v’’)
 Useful for Massive Data Processing
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 2/23
MR & Cloud Computing
 MapReduce is a kind of platform

MapReduce utilizes a number of commodity machines

There can be a number of applications using MapReduce
App.
App.
App.
MapReduce
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 3/23
RDF Data Warehouse using MapReduce
 Data Warehouse using MapReduce

With extensive studies, it has become known that MR is specialized
for large-scale fault-tolerant data analyses

Hive, CloudBase
–

Data warehousing solutions built on top of Hadoop
Advantages
–
Scalability
–
Extensibility
–
Fault-tolerance
 My Research Interest

RDF Data Warehouse using MapReduce
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 4/23
Why RDF Data Warehouse?

Flexible Data Model



The underlying structure of any expression in RDF is a collection of
triples (s, p, o)
Data Integration

RDB-to-RDF (intra)

Linked Open Data (inter)

Incremental Integration
Inference

We can discover some knowledge from what we already know

A goal of data analyses
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 5/23
Approaches & Advantages
•
Building a
Data Warehouse
•
•
•
Center for E-Business Technology
Support Tools
•
•
Simple
Fast
•
•
Performance
Optimization
Conventional
DW Solutions
Centralized
Before
the Cloud
RDF Data
Warehouse
Distributed
& Parallel
(MR)Cloud
Computing
Flexibility
Integration
Inference
•
•
Complexity
Large-scale
data analyses
Copyright  2010 by CEBT
• Scalability
• Extensibility
• Faulttolerance
MDAC 2010 – 6/23
SPARQL BGP Processing with MapReduce
 Both RDF and MapReduce can benefit a data warehouse

RDF is a data model
–

Flexibility, Integration, Inference
MapReduce is a programming model
–
Scalability, Extensibility, Fault-tolerance
 It has been difficult to create synergy because there have been
only few algorithms which connects the data model and the
framework

We should focus on a MR algorithm that manipulates RDF datasets

A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 7/23
SPARQL Basic Graph Pattern
 SPARQL is a query language for RDF datasets
 Basic Graph Pattern(BGP) is a set of triple patterns


Triple patterns are similar to RDF triples (s, p, o) except that each
of the subject, predicate and object can be a variable
SELECT ?x ?y1 ?y2 ?y3
WHERE {
?x rdf:type ub:Professor.
TP#1
BGP
?x ub:worksFor <Department0>. TP#2
?x ub:name ?y1.
TP#3
?x ub:emailAddress ?y2.
TP#4
?x ub:telephone ?y3
TP#5
}
BGP processing is important
–
Most of SPARQL queries have one or more BGPs
–
BGPs require expansive join operations among triple patterns
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 8/23
SPARQL BGP Processing with MapReduce
 Two Operations

MR-Selection
–

SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3
}
Extracts RDF triples which satisfy
at least one triple pattern
MR-Join
–
Merges selected triples
<Prof0>
rdf:type
<Prof0>
ub:worksFor
<Prof0>
ub:Professor
MR-Join
rdf:type
ub:Professor
<Dept0>
ub:worksFor
<Dept0>
ub:name
“Professor0”
ub:name
“Professor0”
<Prof0>
ub:email
“prof0@email.com”
ub:email
“prof0@email.com”
<Prof0>
ub:telephone
“000-0000-0000”
ub:telephone
“000-0000-0000”
<Dept0>
rdf:type
ub:Department
…
…
…
Center for E-Business Technology
<Prof0>
MR-Selection
Copyright  2010 by CEBT
MDAC 2010 – 9/23
MR-Selection
public void map() {
Read a triple (s, p, o)
// example, s: Prof0 p: rdf:type o:ub:Professor
for each (triple pattern in a given query) {
if(input triple satisfies a triple pattern) {
make a key and a value
// key = [x]Prof0 (variable name, value)
// value = 1 (# of the satisfied triple pattern)
output (key, value)
}
SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3
}
}
}
public void reduce() {
read input from the map function
// input format: (key, list(satisfied tp_numbers))
for each (value in a list of tp_numbers) {
make a key and a value
// key = <1>x, value = [x]Prof0
output (key, value)
}
}
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 10/23
MR-Selection
 Conceptually, the MR-Selection algorithm produces temporary
tables which satisfy each triple pattern
2
tp1
3
x
x
x
y1
x
4
y2
…
…
…
…
…
…
5
x
y3
…
…

A result table has variable names as a relational table has attribute
names

It also has values for the variable names, as does the relational
table
 The result table will be used for the next MR-Join operation if
necessary
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 11/23
MR-Join: Map
SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3
}
<Prof0>
rdf:type
ub:Professor
<Prof0>
ub:worksFor
<Dept0>
<Prof0>
ub:name
“Professor0”
<Prof0>
ub:email
“prof0@email.com”
<Prof0>
ub:telephone
“000-0000-0000”
<Prof1>
ub:email
“prof1@email.com”
<Prof1>
ub:telephone
“111-1111-1111”
Center for E-Business Technology
BGP Analyzer
BGP Analyzer examines a given query
before execution and provides joinkeys to the map function
Join-key (shared variable) ?x
Mapper
Values of
Join-key
variable
<Prof0>
<Prof1>
Copyright  2010 by CEBT
<Prof0>
rdf:type
ub:Professor
<Prof0>
ub:worksFor
<Dept0>
<Prof0>
ub:name
“Professor0”
<Prof0>
ub:email
“prof0@email.com”
<Prof0>
ub:telephone
“000-0000-0000”
<Prof1>
ub:email
“prof1@email.com”
<Prof1>
ub:telephone
“111-1111-1111”
MDAC 2010 – 12/23
MR-Join: Map
public void map() {
read input from MR-Selection
// example input (<1>x, [x]Prof0)
// example input (<3>x|y1, [x]Prof0|[y1]Professor0)
get join-key variables and corresponding tp_numbers
to be joined from the BGP Analyzer
// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)
SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3
}
for each (join-key determined by BGP Analyzer) {
if(input is related to the join-key) {
make a key and a value
// key = [x]Prof0 (variable name, value)
// value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value)
// value = <tp>3</tp>[x]Prof0|[y1]Professor0
output (key, value)
}
}
}
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 13/23
MR-Join: Reduce
SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3
}
<Prof0>
rdf:type
ub:Professor
<Prof0>
ub:worksFor
<Dept0>
<Prof0>
ub:name
“Professor0”
<Prof0>
ub:email
“prof0@email.com”
<Prof0>
ub:telephone
“000-0000-0000”
<Prof1>
ub:email
“prof1@email.com”
<Prof1>
ub:telephone
“111-1111-1111”
Center for E-Business Technology
BGP Analyzer
BGP Analyzer can provide triple pattern
numbers related to the join-key variable by
examining a given query
Triple pattern numbers
related to the join-key variable
Reducer
Constraints
for
Join-key
variable X
<x>
1, 2, 3,
4, 5
Copyright  2010 by CEBT
<Prof0>
rdf:type
ub:Professor
ub:worksFor
<Dept0>
ub:name
“Professor0”
ub:email
“prof0@email.com”
ub:telephone
“000-0000-0000”
MDAC 2010 – 14/23
MR-Join: Reduce
public void reduce() {
read input from the Map function
// example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0])
get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer
// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)
create a temporary hashtable H
for each (value in values) {
add an element
// key = <1>x, value = [x]Prof0
// key = <3>x|y1, value = [x]Prof0|[y1]Professor0
} // H will be used for checking whether the input satisfies all related tps.
if(keys in H cover all tp_numbers to be joined) {
make a Cartesian product among values in H
// (a1, b1), (a1, c1) => (a1, b1, c1)
make a key and a value
// key = <1|3>x|y1
// value = [x]Prof0|[y1]Professor0
output (key, value)
}
}
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 15/23
Join-key Selection Strategies
 BGP Analyzer provides join-key variables by analyzing a query

How to select join-key variables?

If a BGP has a shared variable
–

We can easily select the variable
If a BGP has two or more shared variables
–
We applied two heuristics to select join-key variables
–
Greedy Selection

–
Select a join-key according to the number of related triple patterns
Multiple Selection

Select join-keys until every triple pattern is participated in a MR-Join
operation

Utilize the distributed and parallel system architecture
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 16/23
SPARQL BGP Processing with MR
 Advantages

MapReduce can benefit from the multi-way join technique
–
If triple patterns share a variable, MR can join them all at once
–
It is not unusual that a BGP has several triple patterns sharing the same
variable because RDF has a fixed simple data model
(a)
(x, y1, y2, y3)
(x, y1, y2)
(x, y1)
SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3
}
Center for E-Business Technology
(x)
⋈
tp1
⋈
⋈
⋈
2
3
4
5
x
x
x
y1
x
y2
x
y3
…
…
…
…
…
…
…
…
x
y3
…
…
(b)
(x, y1, y2, y3)
⋈
tp1
3
x
x
x
y1
x
4
y2
…
…
…
…
…
…
2
Copyright  2010 by CEBT
5
MDAC 2010 – 17/23
SPARQL BGP Processing with MR
 Disadvantages

If we have two or more shared variables, we need expansive MR
iterations

triple patterns in a query cannot be covered by a certain variable
SELECT ?x ?y1 ?y2 ?y3
WHERE {
1 ?x rdf:type ub:Professor.
2 ?x ub:worksFor <Department0>.
3 ?x ub:name ?y1.
4 ?x ub:emailAddress ?y2.
5 ?x ub:telephone ?y3.
6 ?y2 ub:alias ?y4
}
⋈
(x, y1, y2, y3)
⋈
tp1
3
x
x
x
y1
x
4
y2
…
…
…
…
…
…
2
5
6
x
y3
y2
y4
…
…
…
…

If we have two shared variables, MR iterations cannot be avoided

To reduce unnecessary MR iteration, join-key selection strategies
should be applied
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 18/23
Experiment


Environment

LUBM Dataset

Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS
The effect of multi-way join

Multi-way join technique reduces the execution time by joining several triple
patterns at once

Some queries do not show a significant difference because they are too
simple to take advantages of multi-way join
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
2way
123.391181.583 69.773 256.591 75.533 44.198 205.636232.551256.031 68.834 66.834 112.802 73.369 47.092
Multi
-way
86.423 104.035 67.214 126.474 74.163 44.526 135.047140.414152.747 73.337 63.557 86.117 72.825 42.156
Diff.
36.968 77.548 2.559 130.117 1.37
Center for E-Business Technology
-0.328 70.589 92.137 103.284 -4.503 3.277 26.685 0.544
Copyright  2010 by CEBT
4.936
MDAC 2010 – 19/23
Experiment
 Scalability

As the number of machines increase, the average execution time is
decreased
–

The MR algorithm makes a sufficient number of reducers so we can
utilize a number of machines
While we increase the data size, the algorithm shows scalable
execution time
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 20/23
Issues & Future Work – Indexing
 Execution Time of MR-Selection and each MR-Join Iteration

MR-Selection can be a bottleneck because it takes about 40 seconds
 The underlying storage structure is important

N-triple format -> HBase, Partitioning

Building an index needs a significant amount of loading time
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 21/23
Issues & Future Work – Pipelining
 Hadoop’s MR implementation materializes intermediate results
into the file system

It takes so much time because of disk I/O
 Pipelining

Allows to send and receive data between tasks and between jobs
without disk I/O
–
Some implementations become available

Hadoop Online Prototype (http://code.google.com/p/hop/)

CGL-MapReduce (eScience 2008)
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 22/23
Conclusion
 There still remain many issues

This work is still in progress
 Conclusion


RDF Data Warehouse using MapReduce
–
RDF: Flexibility, Integration, Inference
–
MapReduce: Scalability, Extensibility, Fault-tolerance
SPARQL Processing with MapReduce
–
Synergy effects between RDF and MapReduce
–
Issues

System Architecture

Loading(Indexing), Pipelining, Encoding, …
Center for E-Business Technology
Copyright  2010 by CEBT
MDAC 2010 – 23/23
Download