SPARQL Basic Graph Pattern Processing with Iterative MapReduce Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea 2010-04-26 Presented by Jaeseok Myung MapReduce MapReduce is easily accessible The Hadoop project provides an open-source MR implementation MapReduce gives users a simple abstraction for utilizing parallel and distributed system Programming Model – Map(k,v) -> list(k’, v’) – Reduce(k’, list(v’)) -> list(v’’) Useful for Massive Data Processing Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 2/23 MR & Cloud Computing MapReduce is a kind of platform MapReduce utilizes a number of commodity machines There can be a number of applications using MapReduce App. App. App. MapReduce Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 3/23 RDF Data Warehouse using MapReduce Data Warehouse using MapReduce With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses Hive, CloudBase – Data warehousing solutions built on top of Hadoop Advantages – Scalability – Extensibility – Fault-tolerance My Research Interest RDF Data Warehouse using MapReduce Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 4/23 Why RDF Data Warehouse? Flexible Data Model The underlying structure of any expression in RDF is a collection of triples (s, p, o) Data Integration RDB-to-RDF (intra) Linked Open Data (inter) Incremental Integration Inference We can discover some knowledge from what we already know A goal of data analyses Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 5/23 Approaches & Advantages • Building a Data Warehouse • • • Center for E-Business Technology Support Tools • • Simple Fast • • Performance Optimization Conventional DW Solutions Centralized Before the Cloud RDF Data Warehouse Distributed & Parallel (MR)Cloud Computing Flexibility Integration Inference • • Complexity Large-scale data analyses Copyright 2010 by CEBT • Scalability • Extensibility • Faulttolerance MDAC 2010 – 6/23 SPARQL BGP Processing with MapReduce Both RDF and MapReduce can benefit a data warehouse RDF is a data model – Flexibility, Integration, Inference MapReduce is a programming model – Scalability, Extensibility, Fault-tolerance It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework We should focus on a MR algorithm that manipulates RDF datasets A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 7/23 SPARQL Basic Graph Pattern SPARQL is a query language for RDF datasets Basic Graph Pattern(BGP) is a set of triple patterns Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. TP#1 BGP ?x ub:worksFor <Department0>. TP#2 ?x ub:name ?y1. TP#3 ?x ub:emailAddress ?y2. TP#4 ?x ub:telephone ?y3 TP#5 } BGP processing is important – Most of SPARQL queries have one or more BGPs – BGPs require expansive join operations among triple patterns Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 8/23 SPARQL BGP Processing with MapReduce Two Operations MR-Selection – SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3 } Extracts RDF triples which satisfy at least one triple pattern MR-Join – Merges selected triples <Prof0> rdf:type <Prof0> ub:worksFor <Prof0> ub:Professor MR-Join rdf:type ub:Professor <Dept0> ub:worksFor <Dept0> ub:name “Professor0” ub:name “Professor0” <Prof0> ub:email “prof0@email.com” ub:email “prof0@email.com” <Prof0> ub:telephone “000-0000-0000” ub:telephone “000-0000-0000” <Dept0> rdf:type ub:Department … … … Center for E-Business Technology <Prof0> MR-Selection Copyright 2010 by CEBT MDAC 2010 – 9/23 MR-Selection public void map() { Read a triple (s, p, o) // example, s: Prof0 p: rdf:type o:ub:Professor for each (triple pattern in a given query) { if(input triple satisfies a triple pattern) { make a key and a value // key = [x]Prof0 (variable name, value) // value = 1 (# of the satisfied triple pattern) output (key, value) } SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3 } } } public void reduce() { read input from the map function // input format: (key, list(satisfied tp_numbers)) for each (value in a list of tp_numbers) { make a key and a value // key = <1>x, value = [x]Prof0 output (key, value) } } Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 10/23 MR-Selection Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern 2 tp1 3 x x x y1 x 4 y2 … … … … … … 5 x y3 … … A result table has variable names as a relational table has attribute names It also has values for the variable names, as does the relational table The result table will be used for the next MR-Join operation if necessary Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 11/23 MR-Join: Map SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3 } <Prof0> rdf:type ub:Professor <Prof0> ub:worksFor <Dept0> <Prof0> ub:name “Professor0” <Prof0> ub:email “prof0@email.com” <Prof0> ub:telephone “000-0000-0000” <Prof1> ub:email “prof1@email.com” <Prof1> ub:telephone “111-1111-1111” Center for E-Business Technology BGP Analyzer BGP Analyzer examines a given query before execution and provides joinkeys to the map function Join-key (shared variable) ?x Mapper Values of Join-key variable <Prof0> <Prof1> Copyright 2010 by CEBT <Prof0> rdf:type ub:Professor <Prof0> ub:worksFor <Dept0> <Prof0> ub:name “Professor0” <Prof0> ub:email “prof0@email.com” <Prof0> ub:telephone “000-0000-0000” <Prof1> ub:email “prof1@email.com” <Prof1> ub:telephone “111-1111-1111” MDAC 2010 – 12/23 MR-Join: Map public void map() { read input from MR-Selection // example input (<1>x, [x]Prof0) // example input (<3>x|y1, [x]Prof0|[y1]Professor0) get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer // example join-key: x, tp_numbers=(1, 2, 3, 4, 5) SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3 } for each (join-key determined by BGP Analyzer) { if(input is related to the join-key) { make a key and a value // key = [x]Prof0 (variable name, value) // value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value) // value = <tp>3</tp>[x]Prof0|[y1]Professor0 output (key, value) } } } Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 13/23 MR-Join: Reduce SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3 } <Prof0> rdf:type ub:Professor <Prof0> ub:worksFor <Dept0> <Prof0> ub:name “Professor0” <Prof0> ub:email “prof0@email.com” <Prof0> ub:telephone “000-0000-0000” <Prof1> ub:email “prof1@email.com” <Prof1> ub:telephone “111-1111-1111” Center for E-Business Technology BGP Analyzer BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query Triple pattern numbers related to the join-key variable Reducer Constraints for Join-key variable X <x> 1, 2, 3, 4, 5 Copyright 2010 by CEBT <Prof0> rdf:type ub:Professor ub:worksFor <Dept0> ub:name “Professor0” ub:email “prof0@email.com” ub:telephone “000-0000-0000” MDAC 2010 – 14/23 MR-Join: Reduce public void reduce() { read input from the Map function // example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0]) get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer // example join-key: x, tp_numbers=(1, 2, 3, 4, 5) create a temporary hashtable H for each (value in values) { add an element // key = <1>x, value = [x]Prof0 // key = <3>x|y1, value = [x]Prof0|[y1]Professor0 } // H will be used for checking whether the input satisfies all related tps. if(keys in H cover all tp_numbers to be joined) { make a Cartesian product among values in H // (a1, b1), (a1, c1) => (a1, b1, c1) make a key and a value // key = <1|3>x|y1 // value = [x]Prof0|[y1]Professor0 output (key, value) } } Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 15/23 Join-key Selection Strategies BGP Analyzer provides join-key variables by analyzing a query How to select join-key variables? If a BGP has a shared variable – We can easily select the variable If a BGP has two or more shared variables – We applied two heuristics to select join-key variables – Greedy Selection – Select a join-key according to the number of related triple patterns Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join operation Utilize the distributed and parallel system architecture Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 16/23 SPARQL BGP Processing with MR Advantages MapReduce can benefit from the multi-way join technique – If triple patterns share a variable, MR can join them all at once – It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model (a) (x, y1, y2, y3) (x, y1, y2) (x, y1) SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3 } Center for E-Business Technology (x) ⋈ tp1 ⋈ ⋈ ⋈ 2 3 4 5 x x x y1 x y2 x y3 … … … … … … … … x y3 … … (b) (x, y1, y2, y3) ⋈ tp1 3 x x x y1 x 4 y2 … … … … … … 2 Copyright 2010 by CEBT 5 MDAC 2010 – 17/23 SPARQL BGP Processing with MR Disadvantages If we have two or more shared variables, we need expansive MR iterations triple patterns in a query cannot be covered by a certain variable SELECT ?x ?y1 ?y2 ?y3 WHERE { 1 ?x rdf:type ub:Professor. 2 ?x ub:worksFor <Department0>. 3 ?x ub:name ?y1. 4 ?x ub:emailAddress ?y2. 5 ?x ub:telephone ?y3. 6 ?y2 ub:alias ?y4 } ⋈ (x, y1, y2, y3) ⋈ tp1 3 x x x y1 x 4 y2 … … … … … … 2 5 6 x y3 y2 y4 … … … … If we have two shared variables, MR iterations cannot be avoided To reduce unnecessary MR iteration, join-key selection strategies should be applied Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 18/23 Experiment Environment LUBM Dataset Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS The effect of multi-way join Multi-way join technique reduces the execution time by joining several triple patterns at once Some queries do not show a significant difference because they are too simple to take advantages of multi-way join Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 2way 123.391181.583 69.773 256.591 75.533 44.198 205.636232.551256.031 68.834 66.834 112.802 73.369 47.092 Multi -way 86.423 104.035 67.214 126.474 74.163 44.526 135.047140.414152.747 73.337 63.557 86.117 72.825 42.156 Diff. 36.968 77.548 2.559 130.117 1.37 Center for E-Business Technology -0.328 70.589 92.137 103.284 -4.503 3.277 26.685 0.544 Copyright 2010 by CEBT 4.936 MDAC 2010 – 19/23 Experiment Scalability As the number of machines increase, the average execution time is decreased – The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines While we increase the data size, the algorithm shows scalable execution time Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 20/23 Issues & Future Work – Indexing Execution Time of MR-Selection and each MR-Join Iteration MR-Selection can be a bottleneck because it takes about 40 seconds The underlying storage structure is important N-triple format -> HBase, Partitioning Building an index needs a significant amount of loading time Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 21/23 Issues & Future Work – Pipelining Hadoop’s MR implementation materializes intermediate results into the file system It takes so much time because of disk I/O Pipelining Allows to send and receive data between tasks and between jobs without disk I/O – Some implementations become available Hadoop Online Prototype (http://code.google.com/p/hop/) CGL-MapReduce (eScience 2008) Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 22/23 Conclusion There still remain many issues This work is still in progress Conclusion RDF Data Warehouse using MapReduce – RDF: Flexibility, Integration, Inference – MapReduce: Scalability, Extensibility, Fault-tolerance SPARQL Processing with MapReduce – Synergy effects between RDF and MapReduce – Issues System Architecture Loading(Indexing), Pipelining, Encoding, … Center for E-Business Technology Copyright 2010 by CEBT MDAC 2010 – 23/23