QSX: Advanced Topics in Web Databases Querying Social Graphs Spring 2015 Instructor: Professor Wenfei Fan Classes: 11:00-12:50, Wednesday, AP 2.07 Office Hours: Informatics Forum 5.23, 11:00-12:00, Thursday TA: Ruizhe Huang, s1335233@sms.ed.ac.uk Web: http://homepages.inf.ed.ac.uk/wenfei/qsx/home.html 1 Course format 2 Course format Good news: there will be no exam! Research seminar: • Lectures: to provide background. • Reviews/essays: research papers related to the topics. down from 14, 2012 Bad news: you have to study a number of research papers, and moreover, write reviews for a bunch of papers (8) Worse: you have to do a tough project -- 40% – 45% Furthermore, you have to write a report and present your project – 15% Individual project: research or development and demo 3 Paper reviews – 40% Read research papers listed at the end of lecture notes 2--8 Four sets of homework, starting from week 3; two reviews each set Research papers: choose two each time and two reviews • 5% for each paper, and 10% for each homework deadlines: • 11am, Wednesday, January 28, week 3 • 11am, Wednesday, February 11, week 5 • 11am, Wednesday, February 25, week 7 • 11am, Wednesday, March 11, week 9 understand a topic 4 Paper reviews – 40% Research paper review: one page for each paper; say 10 marks Summary: 3 • A clear problem statement: input, question/output • The need for this line of research: motivation; challenges • A summary of the key ideas and contributions Evaluation 4 • Criteria for the line of research (scalability, expressive power, applications) • Evaluation based on your criteria; justify your evaluation – Strong points – Weak points Possible extensions/revisions, for querying big graphs How well you understand the line of research 3 5 Research project – 45% Listed at the end of lecture notes 2 -- 8 Individual project – start early! Topics: nontrivial Research project: Apply the techniques you have learned – Study a simple research problem – Develop an algorithm to solve it – Justify its correctness and give complexity analysis – Conduct an experimental study to verify effectiveness Example: incremental graph reachability • Given a graph G and a pair (s,by t) the of nodes G, find whether there The cost is decided size ofinchanges, not by |G| is a path from s to t, in response to changes to G • Is the incremental problem “bounded”? If so, develop a bounded incremental algorithm. Otherwise disprove it Something you can include in your CV 6 Development and demo – 45% Development project: Implementation of existing algorithms – Pick a topic and an application domain – Design a prototype system for the application – Implement the system based on existing algorithms – Verify that your system is useful in practice Example: Graph pattern matching by graph simulation • Develop a MapReduce algorithm by revising algorithms for graph simulation • Justify the correctness and scalability of your algorithm • Implement a “system” based on your algorithm, for, e.g., job hunters: given a (big) graph G and a pattern query Q, compute matches Q(G) • Demonstrate that the system is fully functional, scalable and efficient You are encouraged to come up with your own project 7 Research project reports A research “paper” (10 pages, using latex) • Introduction: problem statement, motivation, contributions; justify the novelty and technical depth of your solution • Related work: what has been done; the novelty of your work • Your algorithm: example, explanation, analyses, justification • Experimental study • Conclusion: possible extensions Evaluation: – novelty (25%) – technical depth, justification (25%) – experimental study (25%), – presentation (report; 25%) Deadline: 11am, Wednesday, March 25th (week 11) 8 Development project reports A technical report (10 pages, using latex) • Introduction: application domain, motivation, an overview of your prototype system • Related work • System: architecture; functionality; algorithms; criteria Justification: design? What is new about your system? • Demonstration: based on your criteria; snapshots • Conclusion: what can we get out of it? possible extensions Evaluation: – Design choices – your criteria and decision (25%), – Completion: functionality (25%) – Performance, evaluation (25%) – Presentation (report; 25%) Deadline: 11am, Wednesday, March 25th (week 11) 9 Presentation: 15% Report your work to the class – • 10 minutes each, starting from week 9 • 7 minutes for presentation, and • 3 minutes for Q&A – how well you understand the subject Presentation • Problem statement, motivation • Your contributions (a few bullets) • Technical solutions (research) / demonstration (demo) • A quick summary of what you have learned from doing the project Question handling: demonstrate that you have developed a good understanding of the line of work Learn how to present your work 10 What is this course about? 11 Social networks modeled as graphs Edge: relationship Node: person B A1 Am supervise W W W report W W W W W Labeled, directed graphs: Facebook, Twitter, LinkedIn, … 12 Graph queries Find all matches of a pattern in a graph B Identify suspects in a drug ring B A1 Am 1 AM 3 S W W 3 W W W W FW pattern graph W W “Understanding the structure of drug trafficking organizations” 13 Querying graphs Input: a query Q and a data graph G, Output: all the matches of Q in G. • subgraph isomorphism a bijective function f on nodes: (u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G • graph simulation a binary relation S on nodes for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S A departure from our familiar database queries 14 Flashback: Relational databases What is a relation? What is a relation schema? person(FN, LN, city, pid, status), What is a relational schema? What is an instance of a relation schema? What is a relational database? What are constraints? FN LN city pid status Mary Smith NYC 01234 single Mary Dupont NYC 12035 married Bob Luth EDI 09456 married Robert Luth EDI 09433 married 15 Flashback: Relational databases Structured data: It has a highly regular structure The structure is constrained by a schema (type + constraints) Schema and instances: A schema is a description of a particular collection of data, Schemas in database vs. types in programming language. An instance of a schema (database) is the collection of information stored in the database at a particular moment. Instances of schemas vs. values of types Specified by a schema (type + constraints) 16 Flashback: Relational queries Name a few relational operators – Projection: A R – Selection: C R – Join: R1 C R2 – Union: R1 R2 – Set difference: R1 R2 – Group by and aggregate (max, min, count, average) What is a conjunctive query? What is a first-order logic? What does it mean by saying that a query language is relationally complete? Relational queries: well defined (logic) and studied 17 Example: Graph Search (Facebook) Find me restaurants in New York my friends have been to in 2014 • friend(pid1, pid2) • person(pid, name, city) • dine(pid, rid, dd, mm, yy) A relational query select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2014 A simple conjunctive query 18 Database systems A database is a collection of data, typically containing the information about one or more related organizations. A database management system (DBMS) is a software package designed to store and manage databases. Query languages, query processing techniques Integrity constraints for the consistency of the data Database views, updates Secondary storage, indexing Concurrency control, recovery, security ... A mature subject for almost 50 years 19 XML An XML document is modeled as a node-labeled ordered tree. Element node: typically internal, with a name (tag) and children (subelements and attributes), e.g., student, name. Attribute node: leaf with a name (tag) and text, e.g., @id. Text node: leaf with text (string) but without a name. db student @id “123” name taking firstName lastName “George” “Bush” student ... title taking title course course @cno “Eng 055” @cno “Eng 055” “Spelling” “Spelling” trees, possibly with a schema (XML Schema, DTD) 20 XML Queries Q: Find titles and authors of all books published by AddisonWesley after 1991. <answer>{ for $book in /bib/book where $book/@year > 1991 and $book/publisher=‘Addison-Wesley’ return <book> <title> {$book/title } </title>, for $author in $book/author return <author> {$author } </author> </book> }</answer> Well studied: XPath, XSLT, XQuery 21 XML constraints absolute relative relative (//book, {title}) (//book, (chapter, {number})) (//book/chapter, (section, {number})) db book title “XML” “1” book chapter chapter number section section number section section number text number “6” Well“bib” studied and “1” “10” book book title chapter chapter “SGML” number number number number “1” by “5”XML Schema supported “1” “10” 22 However, when it comes to graphs … Semistructured: – No schema – No constraints yet No standard query languages – A variety of queries used in practice – Nontrivial What is the complexity of the following problems? – Subgraph isomorphism NP-complete – Simple path: given a graph G, a pair (s, t) of nodes in G, and a regular expression R, it is to decide whether there exists a simple path from s to t that satisfies R. Query optimization techniques, indexing, updates, … preliminary The study of graph queries is still in its infancy 23 Worse still, real-life graphs are big social scale 100G (1011) Web scale 1T (1012) brain scale, 100T (1014) Real-life P.Burkhardt, et al, US. National Facebook : more than 1.38 billion scope nodes, and over 140 billion linksSecurity Agency, May 2013 We need new techniques to cope with the volume of big data 24 Challenges introduced by big data Traditional computational complexity theory of almost 50 years: • tractable: polynomial time computable (PTIME) • intractable: NP-hard • beyond: PSPACE-hard, EXPTIME-hard, undecidable… Howwhen long does it take? What happens it comes to big data? Using SSD of 6G/s, a linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) • 5.28 years when D is of 1EB (1018B) Facebook : more than 1.38 billion O(n) time is already beyond reach onand bigover data140 in practice! nodes, billion links Polynomial time queries become intractable on big data! 25 Why do we care about graphs? social networks knowledge graph program diagrams brain network metabolic networks cyber networks Graph queries are essential for data analysis in emerging applications26 Querying collaborative networks To form a team for a software development project, we want to hire a software developer (S) such that • there is a project manager PM who recommends both S and a software designer SD, and • S and SD recommend each other project manager designer developer • Web site classification Similarly, • social position detection • headhunters; • recommendation systems for finding experts; • intelligence analysis • even adolescent drug use Social network analysis 27 Social media marketing If x and x’ are friends living in the same area c, y is a restaurant in area c, and if x’ likes y, then the chances are the x also likes y x x’ friend We can advertise restaurant y to person x like live-in c y User-targeted advertising 28 Knowledge base expansion Given a knowledge graph G and a newly extracted entity e with context C, decide whether we should expand G with e song by in artist album record name name name There are 410 songs named “yesterday” in Freebase, and among them 30 were If there exists an entity e’ in G that “matches” e, Beatles then don’t add recorded by the new entity e to G • Does the name of a song uniquely determine the song? • Add the name of recording artist? Knowledge fusion, knowledge base disambiguation 29 POI recommendation Given a set S of points in a space M, a point p in M and a positive integer k Find top-k points in S that are closest to p Applications: POI recommendation Pattern recognition Clustering Transportation network analysis ... Neiborhood queries 30 Graph systems Graph query engines – Giraph (Pregel, Google) – GraphLab, machine learning and data mining – Neo4j, Neo Tech – GraphX – TAO: Facebook Key-value stores (NoSQL) – Trinity, Microsoft – CloudGraph, RDF triple stores – RDF3X – YARS2 A number of graph database systems have been developed 31 Why study graph queries? Prevalent use in traditional and emerging applications • Transportation networks, intelligence analysis, biochemistry • Social networks, cyber networks, knowledge bases, recommendation systems … Benefits: prepare you for Facebook/Twitter • graduate study: current research and practical issues; • the job market: skills/knowledge in need Connection with many other areas of computer science: algorithms, databases, distributed systems, Big Data … You may have heard this Querying big graphs: $$$ 32 What will be covered in this course? 33 Graph queries and algorithms Graph search • Reachability • Regular reachability PageRank Nearest neighbors Keyword search Graph pattern matching • subgraph isomorphism • graph simulation Querying both data and topological structures? • revisions of graph simulation for social network analysis Queries, complexity and algorithms (sequential and parallel) 34 Querying big graphs Given a query Q and big graph G, compute Q(G) Q( G ) A number of techniques: TB, PB, EB 1. Distributed query processing 2. Query preserving data compression 3. PB Query A linear scan of data of sizeanswering may take using days views 4. Bounded evaluability of queries Graph queries are expensive, some are intractable 5. Bounded incremental evaluation 6. … Is it still feasible to query big graphs? 35 Parallel models for querying graphs <k1, v1> <k1, v1> <k1, v1> mapper <k2, v2> mapper <k2, v2> <k1, v1> mapper <k2, v2> MapReduce Beyond MapReduce: Think like a vertex • PBS • Vertex-centric reducer reducer Think like a graph • GRAPE (partial evaluation + partitioned parallelism) <k3, v3> <k3, v3> MapReduce and beyond: Models and algorithms 36 Query preserving graph compression Making big graphs small 18 times faster on average for reachability queries 37 Answering queries using views The cost of query processing: f(|G|, |Q|) can we compute Q(G) without accessing G, i.e., independent of |G|? Q( G ) Q’( V(G) ) Answering graph pattern queries on big social graphs: Regardless of how big G is – the cost is “independent” of G V(G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching The complexity is no longer a function of |G| 38 Incremental graph pattern matching Incremental query answering Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M The cost of query processing: a function of |G| and |Q| incremental algorithms: |CHANGED|, the size The updating cost that isof changes in • the input: ∆G, and • the output: ∆M inherent to the incremental problem itself Bounded: the cost is expressible as f(|CHANGED|, |Q|)? Incremental graph simulation: bounded Complexity analysis in terms of the size of changes 39 Bounded evaluability Input: A class Q of queries Question: Can we find, for any query Q Q and any (possibly big) graph G, a subgraph GQ of G such that |GQ | M, and Q(G) = Q(GQ)? Q( G GQ Independent of the size of G ) Q( GQ ) 60% of graph pattern queries are boundedly evaluable Improvement: 4 orders of magnitudes Making the cost of computing Q(G) independent of |G|! 40 When exact answers are beyond reach It may not be possible to compute exact answers. Is it still feasible to answer such queries on big data? Yes, approximate query answering When exact algorithms are infeasible, we find inexact algorithms with performance guarantees – can’t be too far! • Query-driven approximation • Data-driven approximation Personalized social search: reduce graphs of PB size to GB (1.5 * 10-6) Yes, querying big graphs is feasible! 41 Summary and review Why study graph queries? What are the differences between graph queries and relational queries? What are the main challenges for querying graphs? Give examples of the following: – graph search – keyword search – kNN join – graph pattern matching Name a few applications of graph queries 42