ppt

advertisement
Towards Scalable RDF Graph Analytics
on MapReduce
Padmashree Ravindra
Vikas V. Deshpande
Kemafor Anyanwu
{pravind2, vvdeshpa, kogan}@ncsu.edu
COUL - semantic
COmpUting research Lab
Introduction
Growing interest in exploiting RDF data
for decision-making
 Requires support for analytical-style querying
- More complex than traditional SPJ queries
- Often include multiple groupings and / or aggregations
- Next release of SPARQL expected to include such constructs
e.g. : Sales (Cust, prod, price, loc, month, year)
* For each prod, count for each month of 2008, the sales that were
between previous month’s avg sale and next month’s avg sale
Prod
Month
Count
Prod1
Feb
3
* Example from [1]
(prev_avg_sale,
next_avg_sale)
Analytical Query Processing
 Traditional OLAP techniques
 Requires star / snowflake schema
 Enterprise-scale
 But Semantic Web data (RDF)
 Semi-structured (labeled graphs)
Absence of star-like schema
 Billion triple data sets
Goal : Exploit MapReduce-based frameworks to
develop a scalable, cost-effective platform for
Semantic Web analytics.
MapReduce-based Data
Processing
 High-level dataflow languages - Pig Latin,
DryadLINQ, HiveQL, JAQL
 Hybrid approach - HadoopDB [5]
 MapReduce in RDF processing
Graph pattern queries [8], [9]
 Graph closure computation [10]
 RAPID [6]
 Succinct expression of complex queries
 Optimize multiple groupings / aggregations
RDF data model
Statements (triples)
Sub
Prop
Graph representation
Obj
R1
type
Ranking
R1
pageRank
11
R1
pageURL
Url1
R1
avgDuration
97
UV1
type
UserVisits
UV1
srcIP
158.112.27.3
UV1
destURL
url1
UV1
adRevenue
339.08142
UV1
visitDate
1979/12/12
UV1
userAgent
SCOPE
UV1
cCode
VNM
UV1
iCode
VNM-KH
UV1
sKeyword
comets
UV1
avgTime
3
Rankings
Groups = Stars
UserVisits
Traditional Querying of RDF
 Graph pattern matching
E.g. Get details about all pages visited by particular users
between “1979/12/01” and “1979/12/30”
SPARQL Query
Matching graph pattern
Example Analytical Query on
RDF data
Compute the average pageRank and total adRevenue
for all pages visited by a particular srcIP with
visitDate between 1979/12/01 and 1979/12/30
Pattern matching
Star sub graphs – Rankings, UserVisits
Join between the stars
Grouping based on value of srcIP property
Aggregation on value of pageRank and
adRevenue
Pig : Data Processing
 Express data processing tasks using highlevel query primitives
usability, code reuse, automatic optimization
Pig Latin data model : atom, tuple, bag (nesting)
Operators : LOAD, STORE, JOIN, GROUP BY,
COGROUP, FOREACH, SPLIT, aggr. functions
 Extensibility support via UDFs
Operators compile into MapReduce jobs
Equijoin on
Partition
REL
REL
A using
A (column
values
0)in
and
ageREL
column
B (column
($1) 1)
JOIN
SPLITAAbyinto
$0, minors
B by $1;IF $1 < 18,
majors IF $1 >= 18;
Compiling Pig Latin’s JOIN to
MapReduce
REL B
REL A
$0
P1
P2
P1
$0
$1
C1
P1
C1
P2
C2
P1
P1
P2
map
$1
P1
18
P2
25
Annotate based on
$1 (join
key)
JOIN
A by
$1,
B by $0;
reduce
P1 Reducer 1
C1
P1
P1
18
C2
P1
P1
18
P2 Reducer 2
Package tuples
$0
$1
$2
$3
C1
P1
P1
18
C2
P1
P1
18
C1
P2
P2
25
C1
P2
P2
25
Pattern Matching in Pig :
Approach 1
Rankings
type
R1
RankingsStarPattern =
JOIN triples1 ON Sub,
triples2 ON Sub,
triples3 ON Sub;
Ranking
pageRank
pageURL
Triple store
11
url1
triples1
triples2
triples3
Sub
Prop
Obj
Sub
Prop
Obj
R1
R1
R1
UV1
UV1
type
pageRank
pageURL
type
srcIP
Ranking
11
Url1
UserVisits
158.112.27.3
R1
R1
R1
UV1
UV1
type
pageRank
pageURL
type
srcIP
Ranking
11
Url1
UserVisits
158.112.27.3
Sub
R1
R1
R1
UV1
UV1
Issues
Prop
type
pageRank
pageURL
type
srcIP
Obj
Ranking
11
Url1
UserVisits
158.112.27.3
Rankings star pattern = 3-way self-join
UserVisits star pattern = 5-way self-join
- Self-joins on very large relations  high I/O costs
- Generate meaningless tuples  additional filtering step
(R1, type, Ranking, R1, type, Ranking, R1, type, Ranking)
Approach 2: Vertical Partitioning
LOAD all the RDF triples
SPLIT
typeRanking
Sub Prop Obj
R1 type Ranking
R2 type Ranking
destURL
Sub Prop
Obj
UV1 destURL url1
UV2 destURL url1
pageRank
pageURL
Sub Prop
Obj
R1 pageURL url1
R2 pageURL url2
typeUV
visitDate
srcIP
Sub Prop
Obj
R1 pageRank 11
R2 pageRank 27
Sub Prop
Obj
UV1 visitDate 1979/12/12
UV2 visitDate 1980/02/02
Sub Prop Obj
UV1 type userVisits
UV2 type userVisits
Sub Prop Obj
UV1 scrIP 158.112.27.3
UV2 scrIP 159.222.21.9
adRev
Sub Prop Obj
UV1 adRev 339.08142
UV2 adRev 330.51248
visitDate
Sub Prop
Obj
UV1 visitDate 1979/12/12
UV4 visitDate 1979/12/02
UserVisits = JOIN
(compute Star Pattern)
Ranking = JOIN
(compute Star Pattern)
JOIN between Ranking, UserVisits
GROUP BY srcIP
FOREACH group GENERATE aggregations
Filter
Approach 2: Vertical Partitioning
LOAD all the RDF triples
SPLIT
typeRanking
Sub Prop Obj
R1 type Ranking
R2 type Ranking
pageURL
Sub Prop
Obj
R1 pageURL url1
R2 pageURL url2
destURL
typeUV
Sub Prop
Obj
UV1 destURL url1
UV2 destURL url1
pageRank
Sub Prop
Obj
R1 pageRank 11
R2 pageRank 27
Ranking = JOIN
(compute Star Pattern)
visitDate
Sub Prop
Obj
UV1 visitDate 1979/12/12
UV2 visitDate 1980/02/02
Sub Prop Obj
UV1 type userVisits
UV2 type userVisits
srcIP
Sub Prop Obj
UV1 scrIP 158.112.27.3
UV2 scrIP 159.222.21.9
adRev
Sub Prop Obj
UV1 adRev 339.08142
UV2 adRev 330.51248
Issues
 SPLIT : Concurrent sub flows
Risk of Disk spills  I/O costs
 Structure of intermediate
relations
Compilation to MapReduce Jobs
Rankings
map1
reduce1
UserVisits
FILTER
FILTER
JOIN
JOIN
map2
reduce2
map3
JOIN
reduce3
GROUP
BY
map4
FOREACH reduce4
Step 3
1 : Aggregation
2
Pattern Matching
Grouping
Our Approach : RAPID+
 Goal : Minimize I/O costs
 Strategy:
 Concurrent computation of star patterns
using grouping-based algorithm
 Can improve efficiency using Operatorcoalescing and Look-ahead processing
Concurrent Star Pattern Matching
 Use grouping-based algorithm on a triple storage model
- GROUP BY Subject
 More efficient if prior filtering of irrelevant triples`
Sub
R1
R1
Ranking
R1
R1
UV1
UV1
UV1
UV1
UV1
UserVisits
UV1
UV1
UV1
UV1
UV1
Prop
type
pageRank
pageURL
avgDuration
type
srcIP
destURL
adRevenue
visitDate
userAgent
cCode
iCode
sKeyword
avgTime
Obj
Ranking
11
Url1
97
UserVisits
158.112.27.3
url1
339.08142
1979/12/12
SCOPE
VNM
VNM-KH
comets
3
Compute the average pageRank and total
adRevenue for all pageURLs visited by a
particular srcIP with visitDate between
1979/12/01 and 1979/12/30
Filter
irrelevant
properties
Sub
Prop
Obj
R1
type
Ranking
R1
pageRank
11
R1
pageURL
Url1
UV1
type
UserVisits
UV1
srcIP
158.112.27.3
UV1
destURL
url1
UV1
adRevenue
339.08142
UV1
visitDate
1979/12/12
Concurrent Star Pattern Matching -2
Filter irrelevant triples by coalescing LOAD
and FILTER operators
Our Approach
Using Pig Latin
LOAD
map1
FILTER
Operator
Coalescing
map1
LOAD
loadFilter
input = LOAD ‘\data’ using
loadFilter ( pageRank,
pageURL, type:Ranking,
destURL, adRevenue, srcIP,
visitDate, type:UserVisits )
Savings by Coalescing:
Context switching
Parameter passing
Multiple handling of same data
Grouping-based Pattern Matching
starSubgraphs = GROUP input BY $0;
Sub
Prop
Obj
R1
type
Ranking
R1
pageRank
11
R1
pageURL
Url1
UV1
type
UserVisits
UV1
srcIP
158.112.27.3
UV1
destURL
url1
UV1
adRevenue
339.08142
UV1
visitDate
1979/12/12
GROUP
BY
Subject
BUT heterogeneous bags
Filtering the Groups
BUT all possible sub patterns computed
Filter non-matching sub patterns
 Structure-based filtering
eliminate sub graphs
with missing properties
 Value-based filtering
validate each sub graph
against filter condition
visitDate between 1979/12/01
Missing
srcIP
and
1979/12/30
Joining the Stars : Look-ahead
Processing
Star Pattern Matching
Cycle
Annotate based on
map
Subject
Group by
by Subject
Subject
Group
Process each
each bag
bag
Process
Structure-based
reduce Structure-based
and
and value-based
value-based
filtering
filtering
Annotate based on
value of join prop
Next Cycle
(Joining the Stars)
Process each bag
No repeated processing
Annotate based on
map

value of join property
Join between the star
sub graphs
reduce
Example : Look-ahead Processing
Star Pattern Matching  Joining the Stars
Structure-based filtering
Value-based filtering
Look-Ahead - Annotate bag based on join key
Join between the
star sub graphs
Eliminate properties irrelevant for future processing (join and filter prop)
 Minimize size of intermediate results
Comparison : Pig vs RAPID+
Pig Approach
RAPID+
Multiple map-reduce cycles
- N star sub graphs  N cycles
Single cycle
- N star sub graphs  1 cycle
Potential for increased I/O
(i) Disk spills (SPLIT operator)
(ii) Materialization of several
intermediate results due to
sequential computation of star
patterns
Minimized I/O
(i) Filtering in triple storage model +
load-filter coalescing
(ii) Concurrent computation of star
patterns (single intermediate
result)
Would require advanced optimization
techniques
- Introduce project operator to
eliminate unneeded columns
Smaller intermediate result sizes
- Eliminate tuples and columns not
necessary in future steps of processing
Not applicable
Minimize repeated tuple handling by
look-ahead processing
Case Study
 Setup: 5-node / 20-node Hadoop clusters
on NCSU’s Virtual Computing Lab [13]
 Dataset: Synthetic benchmark data set [4]
 Tasks: Baseline case
 Task A (PM) – basic pattern matching
(2 star patterns and a join between the stars)
 Task B (PM+GA) – pattern matching with
grouping and aggregation (two look-ahead
processing opportunities)
Experimental Results
Cost
CostAnalysis
Analysisfor
forTask
TaskBA(PM+GA)
(PM)
5-node
5-nodecluster
cluster
Experimental Results
Scalability Study 5-node vs 20-nodes
1.8GB per node
2.8GB per node
Conclusion and Ongoing work
Promising results even for baseline case
Further opportunities for improvement
 First-class operators vs UDFs
 Exploit combiners during aggregations
 More efficient data structures for processing
bags
 Further look-ahead optimizations during
multiple groupings and aggregations
References
[1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim “The MD-join: an operator for Complex OLAP” ICDE
2001, 108–121
[2] J. Dean and S. Ghemawat. “MapReduce : Simplified Data Processing on Large Clusters”. In Proc. Of
OSDI'04, 2004
[3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. “Pig Latin: a not-so-foreign language for data
processing”. In Proc. of ACM SIGMOD2008, p.1099 -1110
[4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A Comparison of
Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009
[5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of
MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009
[6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web.
ISWC 2009
[7] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A
system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008
[8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008.
IEEE Fourth International Conference on eScience '08. 2008
[9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for
Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008.
[10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce,"
in Proceedings of the ISWC ‘09, 2009
[11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using
Vertical Partitioning. VLDB 2007
[12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web
Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer
[13] VCL Setup at NC State University, https://vcl.ncsu.edu/
[14] HiveQL, http://hadoop.apache.org/hive/
[15] JAQL, http://code.google.com/p/jaql
[16] RDF, http://www.w3.org/RDF/
Thank You!
Download