Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, kogan}@ncsu.edu COUL - semantic COmpUting research Lab Introduction Growing interest in exploiting RDF data for decision-making Requires support for analytical-style querying - More complex than traditional SPJ queries - Often include multiple groupings and / or aggregations - Next release of SPARQL expected to include such constructs e.g. : Sales (Cust, prod, price, loc, month, year) * For each prod, count for each month of 2008, the sales that were between previous month’s avg sale and next month’s avg sale Prod Month Count Prod1 Feb 3 * Example from [1] (prev_avg_sale, next_avg_sale) Analytical Query Processing Traditional OLAP techniques Requires star / snowflake schema Enterprise-scale But Semantic Web data (RDF) Semi-structured (labeled graphs) Absence of star-like schema Billion triple data sets Goal : Exploit MapReduce-based frameworks to develop a scalable, cost-effective platform for Semantic Web analytics. MapReduce-based Data Processing High-level dataflow languages - Pig Latin, DryadLINQ, HiveQL, JAQL Hybrid approach - HadoopDB [5] MapReduce in RDF processing Graph pattern queries [8], [9] Graph closure computation [10] RAPID [6] Succinct expression of complex queries Optimize multiple groupings / aggregations RDF data model Statements (triples) Sub Prop Graph representation Obj R1 type Ranking R1 pageRank 11 R1 pageURL Url1 R1 avgDuration 97 UV1 type UserVisits UV1 srcIP 158.112.27.3 UV1 destURL url1 UV1 adRevenue 339.08142 UV1 visitDate 1979/12/12 UV1 userAgent SCOPE UV1 cCode VNM UV1 iCode VNM-KH UV1 sKeyword comets UV1 avgTime 3 Rankings Groups = Stars UserVisits Traditional Querying of RDF Graph pattern matching E.g. Get details about all pages visited by particular users between “1979/12/01” and “1979/12/30” SPARQL Query Matching graph pattern Example Analytical Query on RDF data Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Pattern matching Star sub graphs – Rankings, UserVisits Join between the stars Grouping based on value of srcIP property Aggregation on value of pageRank and adRevenue Pig : Data Processing Express data processing tasks using highlevel query primitives usability, code reuse, automatic optimization Pig Latin data model : atom, tuple, bag (nesting) Operators : LOAD, STORE, JOIN, GROUP BY, COGROUP, FOREACH, SPLIT, aggr. functions Extensibility support via UDFs Operators compile into MapReduce jobs Equijoin on Partition REL REL A using A (column values 0)in and ageREL column B (column ($1) 1) JOIN SPLITAAbyinto $0, minors B by $1;IF $1 < 18, majors IF $1 >= 18; Compiling Pig Latin’s JOIN to MapReduce REL B REL A $0 P1 P2 P1 $0 $1 C1 P1 C1 P2 C2 P1 P1 P2 map $1 P1 18 P2 25 Annotate based on $1 (join key) JOIN A by $1, B by $0; reduce P1 Reducer 1 C1 P1 P1 18 C2 P1 P1 18 P2 Reducer 2 Package tuples $0 $1 $2 $3 C1 P1 P1 18 C2 P1 P1 18 C1 P2 P2 25 C1 P2 P2 25 Pattern Matching in Pig : Approach 1 Rankings type R1 RankingsStarPattern = JOIN triples1 ON Sub, triples2 ON Sub, triples3 ON Sub; Ranking pageRank pageURL Triple store 11 url1 triples1 triples2 triples3 Sub Prop Obj Sub Prop Obj R1 R1 R1 UV1 UV1 type pageRank pageURL type srcIP Ranking 11 Url1 UserVisits 158.112.27.3 R1 R1 R1 UV1 UV1 type pageRank pageURL type srcIP Ranking 11 Url1 UserVisits 158.112.27.3 Sub R1 R1 R1 UV1 UV1 Issues Prop type pageRank pageURL type srcIP Obj Ranking 11 Url1 UserVisits 158.112.27.3 Rankings star pattern = 3-way self-join UserVisits star pattern = 5-way self-join - Self-joins on very large relations high I/O costs - Generate meaningless tuples additional filtering step (R1, type, Ranking, R1, type, Ranking, R1, type, Ranking) Approach 2: Vertical Partitioning LOAD all the RDF triples SPLIT typeRanking Sub Prop Obj R1 type Ranking R2 type Ranking destURL Sub Prop Obj UV1 destURL url1 UV2 destURL url1 pageRank pageURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 typeUV visitDate srcIP Sub Prop Obj R1 pageRank 11 R2 pageRank 27 Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 adRev Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV4 visitDate 1979/12/02 UserVisits = JOIN (compute Star Pattern) Ranking = JOIN (compute Star Pattern) JOIN between Ranking, UserVisits GROUP BY srcIP FOREACH group GENERATE aggregations Filter Approach 2: Vertical Partitioning LOAD all the RDF triples SPLIT typeRanking Sub Prop Obj R1 type Ranking R2 type Ranking pageURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 destURL typeUV Sub Prop Obj UV1 destURL url1 UV2 destURL url1 pageRank Sub Prop Obj R1 pageRank 11 R2 pageRank 27 Ranking = JOIN (compute Star Pattern) visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj UV1 type userVisits UV2 type userVisits srcIP Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 adRev Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 Issues SPLIT : Concurrent sub flows Risk of Disk spills I/O costs Structure of intermediate relations Compilation to MapReduce Jobs Rankings map1 reduce1 UserVisits FILTER FILTER JOIN JOIN map2 reduce2 map3 JOIN reduce3 GROUP BY map4 FOREACH reduce4 Step 3 1 : Aggregation 2 Pattern Matching Grouping Our Approach : RAPID+ Goal : Minimize I/O costs Strategy: Concurrent computation of star patterns using grouping-based algorithm Can improve efficiency using Operatorcoalescing and Look-ahead processing Concurrent Star Pattern Matching Use grouping-based algorithm on a triple storage model - GROUP BY Subject More efficient if prior filtering of irrelevant triples` Sub R1 R1 Ranking R1 R1 UV1 UV1 UV1 UV1 UV1 UserVisits UV1 UV1 UV1 UV1 UV1 Prop type pageRank pageURL avgDuration type srcIP destURL adRevenue visitDate userAgent cCode iCode sKeyword avgTime Obj Ranking 11 Url1 97 UserVisits 158.112.27.3 url1 339.08142 1979/12/12 SCOPE VNM VNM-KH comets 3 Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Filter irrelevant properties Sub Prop Obj R1 type Ranking R1 pageRank 11 R1 pageURL Url1 UV1 type UserVisits UV1 srcIP 158.112.27.3 UV1 destURL url1 UV1 adRevenue 339.08142 UV1 visitDate 1979/12/12 Concurrent Star Pattern Matching -2 Filter irrelevant triples by coalescing LOAD and FILTER operators Our Approach Using Pig Latin LOAD map1 FILTER Operator Coalescing map1 LOAD loadFilter input = LOAD ‘\data’ using loadFilter ( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits ) Savings by Coalescing: Context switching Parameter passing Multiple handling of same data Grouping-based Pattern Matching starSubgraphs = GROUP input BY $0; Sub Prop Obj R1 type Ranking R1 pageRank 11 R1 pageURL Url1 UV1 type UserVisits UV1 srcIP 158.112.27.3 UV1 destURL url1 UV1 adRevenue 339.08142 UV1 visitDate 1979/12/12 GROUP BY Subject BUT heterogeneous bags Filtering the Groups BUT all possible sub patterns computed Filter non-matching sub patterns Structure-based filtering eliminate sub graphs with missing properties Value-based filtering validate each sub graph against filter condition visitDate between 1979/12/01 Missing srcIP and 1979/12/30 Joining the Stars : Look-ahead Processing Star Pattern Matching Cycle Annotate based on map Subject Group by by Subject Subject Group Process each each bag bag Process Structure-based reduce Structure-based and and value-based value-based filtering filtering Annotate based on value of join prop Next Cycle (Joining the Stars) Process each bag No repeated processing Annotate based on map value of join property Join between the star sub graphs reduce Example : Look-ahead Processing Star Pattern Matching Joining the Stars Structure-based filtering Value-based filtering Look-Ahead - Annotate bag based on join key Join between the star sub graphs Eliminate properties irrelevant for future processing (join and filter prop) Minimize size of intermediate results Comparison : Pig vs RAPID+ Pig Approach RAPID+ Multiple map-reduce cycles - N star sub graphs N cycles Single cycle - N star sub graphs 1 cycle Potential for increased I/O (i) Disk spills (SPLIT operator) (ii) Materialization of several intermediate results due to sequential computation of star patterns Minimized I/O (i) Filtering in triple storage model + load-filter coalescing (ii) Concurrent computation of star patterns (single intermediate result) Would require advanced optimization techniques - Introduce project operator to eliminate unneeded columns Smaller intermediate result sizes - Eliminate tuples and columns not necessary in future steps of processing Not applicable Minimize repeated tuple handling by look-ahead processing Case Study Setup: 5-node / 20-node Hadoop clusters on NCSU’s Virtual Computing Lab [13] Dataset: Synthetic benchmark data set [4] Tasks: Baseline case Task A (PM) – basic pattern matching (2 star patterns and a join between the stars) Task B (PM+GA) – pattern matching with grouping and aggregation (two look-ahead processing opportunities) Experimental Results Cost CostAnalysis Analysisfor forTask TaskBA(PM+GA) (PM) 5-node 5-nodecluster cluster Experimental Results Scalability Study 5-node vs 20-nodes 1.8GB per node 2.8GB per node Conclusion and Ongoing work Promising results even for baseline case Further opportunities for improvement First-class operators vs UDFs Exploit combiners during aggregations More efficient data structures for processing bags Further look-ahead optimizations during multiple groupings and aggregations References [1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim “The MD-join: an operator for Complex OLAP” ICDE 2001, 108–121 [2] J. Dean and S. Ghemawat. “MapReduce : Simplified Data Processing on Large Clusters”. In Proc. Of OSDI'04, 2004 [3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. “Pig Latin: a not-so-foreign language for data processing”. In Proc. of ACM SIGMOD2008, p.1099 -1110 [4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A Comparison of Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009 [5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 [6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 [7] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008 [8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008. IEEE Fourth International Conference on eScience '08. 2008 [9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008. [10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce," in Proceedings of the ISWC ‘09, 2009 [11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 [12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer [13] VCL Setup at NC State University, https://vcl.ncsu.edu/ [14] HiveQL, http://hadoop.apache.org/hive/ [15] JAQL, http://code.google.com/p/jaql [16] RDF, http://www.w3.org/RDF/ Thank You!