Lecture note 1 - Informatics Homepages Server

advertisement
QSX: Advanced Topics in Web Databases
Querying Social Graphs
Spring 2015
 Instructor: Professor Wenfei Fan
 Classes: 11:00-12:50, Wednesday, AP 2.07
 Office Hours: Informatics Forum 5.23, 11:00-12:00, Thursday
 TA: Ruizhe Huang, s1335233@sms.ed.ac.uk
 Web: http://homepages.inf.ed.ac.uk/wenfei/qsx/home.html
1
Course format
2
Course format
 Good news: there will be no exam!
 Research seminar:
• Lectures: to provide background.
• Reviews/essays: research papers related to the topics.
down from 14, 2012
 Bad news: you have to study a number of research papers, and
moreover, write reviews for a bunch of papers (8)
 Worse: you have to do a tough project
-- 40%
– 45%
 Furthermore, you have to write a report and present your
project
– 15%
Individual project: research or development and demo
3
Paper reviews – 40%
 Read research papers listed at the end of lecture notes 2--8
 Four sets of homework, starting from week 3; two reviews each
set
 Research papers: choose two each time and two reviews
• 5% for each paper, and 10% for each homework
 deadlines:
• 11am, Wednesday, January 28, week 3
• 11am, Wednesday, February 11, week 5
• 11am, Wednesday, February 25, week 7
• 11am, Wednesday, March 11, week 9
understand a topic
4
Paper reviews – 40%
Research paper review: one page for each paper; say 10 marks
 Summary:
3
• A clear problem statement: input, question/output
• The need for this line of research: motivation; challenges
• A summary of the key ideas and contributions
 Evaluation
4
• Criteria for the line of research (scalability, expressive
power, applications)
• Evaluation based on your criteria; justify your evaluation
– Strong points
– Weak points
 Possible extensions/revisions, for querying big graphs
How well you understand the line of research
3
5
Research project – 45%
Listed at the end of lecture notes 2 -- 8
 Individual project – start early!
 Topics: nontrivial
 Research project:
Apply the techniques you have learned
– Study a simple research problem
– Develop an algorithm to solve it
– Justify its correctness and give complexity analysis
– Conduct an experimental study to verify effectiveness

Example: incremental graph reachability
• Given a graph
G and
a pair (s,by
t) the
of nodes
G, find whether
there
The cost
is decided
size ofinchanges,
not by |G|
is a path from s to t, in response to changes to G
• Is the incremental problem “bounded”? If so, develop a bounded
incremental algorithm. Otherwise disprove it
Something you can include in your CV
6
Development and demo – 45%
 Development project:
Implementation of existing algorithms
– Pick a topic and an application domain
– Design a prototype system for the application
– Implement the system based on existing algorithms
– Verify that your system is useful in practice

Example: Graph pattern matching by graph simulation
• Develop a MapReduce algorithm by revising algorithms for graph
simulation
• Justify the correctness and scalability of your algorithm
• Implement a “system” based on your algorithm, for, e.g., job
hunters: given a (big) graph G and a pattern query Q, compute
matches Q(G)
• Demonstrate that the system is fully functional, scalable and
efficient
You are encouraged to come up with your own project
7
Research project reports
 A research “paper” (10 pages, using latex)
• Introduction: problem statement, motivation, contributions;
justify the novelty and technical depth of your solution
• Related work: what has been done; the novelty of your work
• Your algorithm: example, explanation, analyses, justification
• Experimental study
• Conclusion: possible extensions
 Evaluation:
– novelty (25%)
– technical depth, justification (25%)
– experimental study (25%),
– presentation (report; 25%)
Deadline: 11am, Wednesday, March 25th (week 11)
8
Development project reports
 A technical report (10 pages, using latex)
• Introduction: application domain, motivation, an overview of
your prototype system
• Related work
• System: architecture; functionality; algorithms; criteria
Justification: design? What is new about your system?
• Demonstration: based on your criteria; snapshots
• Conclusion: what can we get out of it? possible extensions
 Evaluation:
– Design choices – your criteria and decision (25%),
– Completion: functionality (25%)
– Performance, evaluation (25%)
– Presentation (report; 25%)
Deadline: 11am, Wednesday, March 25th (week 11)
9
Presentation: 15%
 Report your work to the class –
• 10 minutes each, starting from week 9
• 7 minutes for presentation, and
• 3 minutes for Q&A – how well you understand the subject
 Presentation
• Problem statement, motivation
• Your contributions (a few bullets)
• Technical solutions (research) / demonstration (demo)
• A quick summary of what you have learned from doing the
project
 Question handling: demonstrate that you have developed a
good understanding of the line of work
Learn how to present your work
10
What is this course about?
11
Social networks modeled as graphs
Edge: relationship
Node: person
B
A1
Am
supervise
W
W
W
report
W
W
W
W
W
Labeled, directed graphs: Facebook, Twitter, LinkedIn, …
12
Graph queries
Find all matches of a pattern in a graph
B
Identify
suspects
in a drug ring
B
A1
Am
1
AM
3
S
W
W
3
W
W
W
W
FW
pattern graph
W
W
“Understanding the structure of drug trafficking organizations”
13
Querying graphs
 Input: a query Q and a data graph G,
 Output: all the matches of Q in G.
•
subgraph isomorphism
a bijective function f on nodes: (u,u’ )
∈ Q iff (f(u), f(u’)) ∈ G
•
graph simulation
a binary relation S on nodes
for each (u,v)∈ S, each edge (u,u’)
in Q is mapped to an edge (v, v’ )
in G, such that (u’,v’ )∈ S
A departure from our familiar database queries
14
Flashback: Relational databases
 What is a relation?
 What is a relation schema?
person(FN, LN, city, pid, status),
 What is a relational schema?
 What is an instance of a relation schema?
 What is a relational database?
 What are constraints?
FN
LN
city
pid
status
Mary
Smith
NYC
01234
single
Mary
Dupont
NYC
12035
married
Bob
Luth
EDI
09456
married
Robert
Luth
EDI
09433
married
15
Flashback: Relational databases
Structured data:
 It has a highly regular structure
 The structure is constrained by a schema (type + constraints)
Schema and instances:
 A schema is a description of a particular collection of data,
Schemas in database vs. types in programming language.
 An instance of a schema (database) is the collection of
information stored in the database at a particular moment.
Instances of schemas vs. values of types
Specified by a schema (type + constraints)
16
Flashback: Relational queries
 Name a few relational operators
– Projection: A R
– Selection: C R
– Join: R1
C
R2
– Union: R1  R2
– Set difference: R1  R2
– Group by and aggregate (max, min, count, average)
 What is a conjunctive query?
 What is a first-order logic?
 What does it mean by saying that a query language is
relationally complete?
Relational queries: well defined (logic) and studied
17
Example: Graph Search (Facebook)


Find me restaurants in New York my friends have been to in 2014
• friend(pid1, pid2)
• person(pid, name, city)
• dine(pid, rid, dd, mm, yy)
A relational query
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2014
A simple conjunctive query
18
Database systems
 A database is a collection of data, typically containing the
information about one or more related organizations.
 A database management system (DBMS) is a software
package designed to store and manage databases.
 Query languages, query processing techniques
 Integrity constraints for the consistency of the data
 Database views, updates
 Secondary storage, indexing
 Concurrency control, recovery, security
 ...
A mature subject for almost 50 years
19
XML
An XML document is modeled as a node-labeled ordered tree.
 Element node: typically internal, with a name (tag) and children
(subelements and attributes), e.g., student, name.
 Attribute node: leaf with a name (tag) and text, e.g., @id.
 Text node: leaf with text (string) but without a name.
db
student
@id
“123”
name
taking
firstName
lastName
“George”
“Bush”
student
...
title
taking
title
course
course
@cno
“Eng 055”
@cno
“Eng 055”
“Spelling”
“Spelling”
trees, possibly with a schema (XML Schema, DTD)
20
XML Queries
Q: Find titles and authors of all books published by AddisonWesley after 1991.
<answer>{
for $book in /bib/book
where $book/@year > 1991 and $book/publisher=‘Addison-Wesley’
return
<book>
<title> {$book/title } </title>,
for $author in $book/author return
<author> {$author } </author>
</book>
}</answer>
Well studied: XPath, XSLT, XQuery
21
XML constraints
absolute
relative
relative
(//book, {title})
(//book, (chapter, {number}))
(//book/chapter, (section, {number}))
db
book
title
“XML”
“1”
book
chapter
chapter
number section section number section section
number text
number “6”
Well“bib”
studied
and
“1”
“10”
book
book
title
chapter chapter
“SGML”
number number
number number
“1” by
“5”XML Schema
supported
“1”
“10”
22
However, when it comes to graphs …
 Semistructured:
– No schema
– No constraints yet
 No standard query languages
– A variety of queries used in practice
– Nontrivial
 What is the complexity of the following problems?
– Subgraph isomorphism
NP-complete
– Simple path: given a graph G, a pair (s, t) of nodes in G, and
a regular expression R, it is to decide whether there exists a
simple path from s to t that satisfies R.
 Query optimization techniques, indexing, updates, … preliminary
The study of graph queries is still in its infancy
23
Worse still, real-life graphs are big
social scale
100G (1011)
Web scale
1T (1012)
brain scale, 100T
(1014)
Real-life
P.Burkhardt, et al, US. National
Facebook : more than 1.38 billion
scope
nodes, and over 140 billion linksSecurity Agency, May 2013
We need new techniques to cope with the volume of big data
24
Challenges introduced by big data

Traditional computational complexity theory of almost 50 years:
•
tractable: polynomial time computable (PTIME)
•
intractable: NP-hard
•
beyond: PSPACE-hard, EXPTIME-hard, undecidable…
Howwhen
long does
it take?
What happens
it comes
to big data?
Using SSD of 6G/s, a linear scan of a data set D would take
•
1.9 days when D is of 1PB (1015B)
•
5.28 years when D is of 1EB (1018B)
Facebook : more than 1.38 billion
O(n) time is already beyond reach
onand
bigover
data140
in practice!
nodes,
billion links
Polynomial time queries become intractable on big data!
25
Why do we care about graphs?
social networks
knowledge graph
program diagrams
brain network
metabolic networks
cyber networks
Graph queries are essential for data analysis in emerging applications26
Querying collaborative networks


To form a team for a software
development project, we want to hire a
software developer (S) such that
•
there is a project manager PM
who recommends both S and a
software designer SD, and
•
S and SD recommend each other
project
manager
designer
developer
• Web site classification
Similarly,
• social position detection
•
headhunters;
•
recommendation systems for finding experts;
•
intelligence analysis
•
even adolescent drug use
Social network analysis
27
Social media marketing

If x and x’ are friends living in the same area c, y is a
restaurant in area c, and if x’ likes y, then the chances are the
x also likes y
x
x’
friend
We can advertise restaurant y
to person x
like
live-in
c
y
User-targeted advertising
28
Knowledge base expansion

Given a knowledge graph G and a newly extracted entity e
with context C, decide whether we should expand G with e
song
by
in
artist
album
record
name
name
name
There are 410 songs named “yesterday” in
Freebase, and among them 30 were
 If there exists an entity e’ in G that
“matches”
e, Beatles
then don’t add
recorded
by the
new entity e to G
•
Does the name of a song uniquely determine the song?
•
Add the name of recording artist?
Knowledge fusion, knowledge base disambiguation
29
POI recommendation

Given a set S of points in a
space M, a point p in M and a
positive integer k

Find top-k points in S that are
closest to p
Applications:
 POI recommendation
 Pattern recognition
 Clustering
 Transportation network analysis
 ...
Neiborhood queries
30
Graph systems
 Graph query engines
– Giraph (Pregel, Google)
– GraphLab, machine learning and data mining
– Neo4j, Neo Tech
– GraphX
– TAO: Facebook
 Key-value stores (NoSQL)
– Trinity, Microsoft
– CloudGraph,
 RDF triple stores
– RDF3X
– YARS2
A number of graph database systems have been developed
31
Why study graph queries?
 Prevalent use in traditional and emerging applications
• Transportation networks, intelligence analysis, biochemistry
• Social networks, cyber networks, knowledge bases,
recommendation systems …
 Benefits: prepare you for
Facebook/Twitter
• graduate study: current research and practical issues;
• the job market: skills/knowledge in need
 Connection with many other areas of computer science:
algorithms, databases, distributed systems, Big Data …
You may have heard this
Querying big graphs: $$$
32
What will be covered in this course?
33
Graph queries and algorithms
 Graph search
• Reachability
• Regular reachability
 PageRank
 Nearest neighbors
 Keyword search
 Graph pattern matching
• subgraph isomorphism
• graph simulation
Querying both data and
topological
structures?
• revisions of graph simulation for
social network
analysis
Queries, complexity and algorithms (sequential and parallel)
34
Querying big graphs
Given a query Q and big graph G, compute Q(G)
Q(


G
)
A number of techniques:
TB, PB, EB
1. Distributed query processing
2. Query preserving data compression
3. PB
Query
A linear scan of data of
sizeanswering
may take using
days views
4. Bounded
evaluability
of queries
Graph queries are expensive,
some
are intractable
5. Bounded incremental evaluation
6. …
Is it still feasible to query big graphs?
35
Parallel models for querying graphs
<k1, v1> <k1, v1> <k1, v1>
mapper
<k2, v2>
mapper
<k2, v2>
<k1, v1>
mapper
<k2, v2>
MapReduce
Beyond MapReduce:
 Think like a vertex
• PBS
• Vertex-centric
reducer
reducer
 Think like a graph
• GRAPE (partial evaluation
+ partitioned parallelism)
<k3, v3>
<k3, v3>
MapReduce and beyond: Models and algorithms
36
Query preserving graph compression
Making big graphs small
18 times faster on average for reachability queries
37
Answering queries using views
The cost of query processing: f(|G|, |Q|)
can we compute Q(G) without accessing G, i.e.,
independent of |G|?
Q(
G
)
Q’( V(G) )
Answering graph pattern queries on big social graphs:
 Regardless of how big G is – the cost is “independent” of G
 V(G ) is often much smaller than G (4% -- 12% on real-life data)
Improvement: 31 times faster for graph pattern matching
The complexity is no longer a function of |G|
38
Incremental graph pattern matching
Incremental query answering
 Input: Q, G, Q(G), ∆G
 Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
The cost of query processing: a function of |G| and |Q|
 incremental algorithms: |CHANGED|,
the size
The updating cost
that isof changes in

•
the input: ∆G, and
•
the output: ∆M
inherent to the incremental
problem itself
Bounded: the cost is expressible as f(|CHANGED|, |Q|)?
Incremental graph simulation: bounded
Complexity analysis in terms of the size of changes
39
Bounded evaluability


Input: A class Q of queries
Question: Can we find, for any query Q  Q and any (possibly
big) graph G, a subgraph GQ of G such that

|GQ |  M, and

Q(G) = Q(GQ)?
Q(
G
GQ
Independent of the size of G
)


Q( GQ
)
60% of graph pattern queries are
boundedly evaluable
Improvement: 4 orders of magnitudes
Making the cost of computing Q(G) independent of |G|!
40
When exact answers are beyond reach
It may not be possible to compute exact answers. Is it still
feasible to answer such queries on big data?

Yes, approximate query answering
When exact algorithms are infeasible, we find inexact
algorithms with performance guarantees – can’t be too far!
•
Query-driven approximation
•
Data-driven approximation

Personalized social search: reduce graphs of PB size
to GB (1.5 * 10-6)
Yes, querying big graphs is feasible!
41
Summary and review
 Why study graph queries?
 What are the differences between graph queries and relational
queries?
 What are the main challenges for querying graphs?
 Give examples of the following:
– graph search
– keyword search
– kNN join
– graph pattern matching
 Name a few applications of graph queries
42
Download