Lecture note 5

advertisement
TDD: Topics in Distributed Databases
Querying Big Data: Theory and Practice
 Theory
– Tractability revisited for querying big data
– Parallel scalability
– Bounded evaluability
 Techniques
–
–
–
–
–
Parallel algorithms
Bounded evaluability and access constraints
Query-preserving compression
Query answering using views
Bounded incremental query processing
1
Fundamental question
To query big data, we have to determine whether it is feasible at all.
For a class Q of queries, can we find an algorithm T such that given
any Q in Q and any big dataset D, T efficiently computes the
answers Q(D) of Q in D within our available resources?
Is this feasible or not for Q?

Tractability revised for querying big data

Parallel scalability

Bounded evaluability
New theory for querying big data
2
BD-tractability
3
The good, the bad and the ugly

Traditional computational complexity theory of almost 50 years:
•
The good: polynomial time computable (PTIME)
•
The bad: NP-hard (intractable)
•
The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
What happens when it comes to big data?
Using SSD of 6G/s, a linear scan of a data set D would take
•
1.9 days when D is of 1PB (1015B)
•
5.28 years when D is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
Polynomial time queries become intractable on big data!
4
Complexity classes within P
Polynomial time algorithms are no longer tractable on big data.
So we may consider “smaller” complexity classes
parallel logk(n)
 NC (Nick’s class): highly parallel feasible
•
•
parallel polylog time
polynomially many processors
as hard as P = NP
BIG open: P = NC?



L: O(log n) space
NL: nondeterministic O(log n) space
polylog-space: logk(n) space
L  NL  polylog-space  P,
NC  P
Too restrictive to include practical queries feasible on big data
5
Tractability revisited for queries on big data
A class Q of queries is BD-tractable if there exists a PTIME
preprocessing function  such that
 for any database D on which queries of Q are defined,
D’ = (D)
 for all queries Q in Q defined on D, Q(D) can
be computed
parallel
logk(|D|, by
|Q|)
evaluating Q on D’ in parallel polylog time (NC)
Q1((D))

Q2((D))
D
(D)
。
。
Does it work? If a linear scan of D could be done in log(|D|) time:
 15 seconds when D is of 1 PB instead of 1.99 days
 18 seconds when D is of 1 EB rather than 5.28 years
BD-tractable queries are feasible on big data
6
BD-tractable queries
A class Q of queries is BD-tractable if there exists a PTIME
preprocessing function  such that
 for any database D on which queries of Q are defined,
What is the maximum size of D’ ?
D’ = (D)
for all queries Q in Q defined on D, Q(D) can be computed by
evaluating Q on D’ in parallel polylog time (NC)

Preprocessing: a common practice
database
people
in parallelofwith
more resources


one-time process, offline, once for all queries in Q
indices, compression, views, incremental computation, …
not necessarily reduce the size of D
BDTQ0: the set of all BD-tractable query classes
7
What query classes are BD-tractable?
Boolean selection queries
 Input: A dataset D
 Query: Does there exist a tuple t in D such that t[A] = c?
Build a B+-tree on the A-column values in D. Then all such selection
queries can be answered in O(log(|D|)) time.
Graph reachability queries
NL-complete
 Input: A directed graph G
 Query: Does there exist a path from node s to t in G?
What else?
Relational algebra + set recursion on ordered relational databases
D. Suciu and V. Tannen: A query language for NC, PODS 1994
Some natural query classes are BD-tractable
8
Deal with queries that are not BD-tractable
at aBD-tractable.
node s, and visits all its children,
Many query classesStarts
are not
pushing them onto a stack in the reverse order
induced
by the vertex numbering. After all of s’
Breadth-Depth Search
(BDS)
children are visited, it continues with the node on
 Input: An unordered graph G = (V, E) with a numbering on its
the top of the stack, which plays the role of s
nodes, and a pair (u, v) of nodes in V
 Question: Is u visited before v in the breadth-depth search of G?
Is this problem (query class) BD-tractable?
What is P-complete?
D is empty, Q is (G, (u, v))
No. The problem is well known to be P-complete!
 We need PTIME to process each query (G, (u, v)) !
 Preprocessing does not help us answer such queries.
Can we make it BD-tractable?
9
Make queries BD-tractable
Factorization: partition instances to identify a data part D for
preprocessing, and a query part Q for operations
Breadth-Depth Search (BDS)
 Input: An unordered graph G = (V, E) with a numbering on its
nodes, and a pair (u, v) of nodes in V
 Question: Is u visited before v in the breadth-depth search of G?
Factorization: D is G = (V, E), Q is (u, v)


Preprocessing: (G) performs BDS on G, and returns a list M
consisting of nodes in V in the same order as they are visited
For all queries (u, v), whether u occurs before v can be decided
by a binary search on M, in log(|M|) time
after proper factorization
BDTQ: The set of all query classes that can be made BD-tractable10
Fundamental problems for BD-tractability
BD-tractable queries help practitioners determine what query
classes are tractable on big data.
Are we done yet?
Why
do we need reduction?
No, a number of questions in connection with
a complexity
class!
Reductions: how to transform
a problem
to familiar
another in the class
Analogous
to our
that we know how to solve, and
hence make
it BD-tractable?
NP-complete
problems

Complete problems: Is there a natural problem (a class of queries)
that is the hardest one in the complexity class? A problem to which
all problems in the complexity class can be reduced

Name one NP-complete
Why
do we care?
 How large is BDTQ? BDTQ0? Compared to
P? NC?
problem
that you know
Fundamental to any complexity classes: P, NP, …
11
Reductions
transformations
for making
Departing from our familiar polynomial-time
reductions,
we need
queries
reductions that are in NC, and deal with
bothBD-tractable
data D and query Q!
NC-factor reductions NC: a pair of NC functions that allow refactorizations (repartition data and query part), for BDTQ

F-reductions F: a pair of NC functions that do not allow refactorizations, for BDTQ0

to determine whether a query
class is BD-tractable
Properties:
 transitivity: if Q1 NC Q2 and Q2 NC Q3, then Q1 NC Q3 (also F)

compatibility:
 if Q1 NC Q2 and Q2 is in BDTQ, then so is Q1.
 if Q1 F Q2 and Q2 is in BDTQ0, then so is Q1.
transform a given problem to one that we know how to solve
12
Complete problems for BDTQ
A query class Q is complete for BDTQ if Q is in BDTQ, and
moreover, for any query class Q’ in BDTQ, Q’ NC Q

A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for
any query class Q’ in BDTQ0, Q’ F Q

Is there a complete problems for BDTQ?

There exists a natural query class Q that is complete for BDTQ
Breadth-Depth Search (BDS)
What does this tell us?
BDS is both P-complete and BDTQ-complete!
13
Is there a complete problem for BDTQ0?
A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for
any query class Q’ in BDTQ0, Q’ F Q

An open problem
Unless P = NC, a query class complete for BDTQ0 is a witness for
P \ NC


Whether P = NC is as hard as whether P = NP
If we can find a complete problem for BDTQ0 and show that it is
not in NC, then we solve the big open whether P = NC
It is hard to find a complete problem for BDTQ0
14
Comparing with P and NC
How large is BDTQ? How large is BDTQ0?

NC  BDTQ = P
separation
All PTIME query classes can be made BD-tractable!

Unless P = NC, NCneed
 BDTQ
P
proper0 
factorizations
to answer
PTIMEquery
queries
on big data
Unless P = NC, not all PTIME
classes
are BD-tractable
Properly contained in P
PTIME
BD-tractable
not
BD-tractable
Not all polynomial-time queries are BD-tractable
15
Polynomial hierarchy revised
NP and beyond
Parallel polylog time
P
BD-tractable
not
BD-tractable
Tractability revised for querying big data
16
What can we get from BD-tractability?
Guidelines for the following.
BDTQ0

What query classes are feasible on big data?

What query classes can be made feasible to answer on BDTQ
big data?

How to determine whether it is feasible to answer a class Q of
queries on big data?
Reduce Q to a complete problem Qc for BDTQ via NC

If so, how to answer queries in Q?
•
Identify factorizations (NC reductions) such that Q NC Qc
•
Compose the reduction and the algorithm for answering
queries of Qc
Why we need to study theory for querying big data
17
Parallel scalability
18
Parallel query answering
BD-tractability is hard to achieve.
Parallel processing is widely used, given more resources
Using 10000 SSD of 6G/s, a linear scan of D might take:
 1.9 days/10000 = 16 seconds when D is of 1PB (1015B)
 5.28 years/10000 = 4.63 days when D is of 1EB (1018B)
Only ideally!
interconnection network
P
P
P
M
M
M
DB
DB
DB
10,000 processors
How to define “better”?
Parallel scalable: the more processors, the “better”?
19
Degree of parallelism -- speedup
Speedup: for a given task, TS/TL,
 TS: time taken by a traditional DBMS
 TL: time taken by a parallel system with more resources
 TS/TL: more sources mean proportionally less time for a task
 Linear speedup: the speedup is N while the parallel system has N
times resources of the traditional system
Speed: throughput
response time
Linear speedup
resources
Question: can we do better than linear speedup?
20
Degree of parallelism -- scaleup
Scaleup: TS/TL
 A task Q, and a task QN, N times bigger than Q
 A DBMS MS, and a parallel DBMS ML,N times larger
 TS: time taken by MS to execute Q
 TL: time taken by ML to execute QN
 Linear scaleup: if TL = TS, i.e., the time is constant if the
resource increases in proportion to increase in problem size
TS/TL
resources and problem size
Question: can we do better than linear scaleup?
21
Better than linear scaleup/speedup?
NO, even hard to achieve linear speedup/scaleup!
Give 3 reasons
Startup costs: initializing each process
Interference: competing for shared resources (network, disk,
memory or even locks)
Think of blocking in MapReduce
Skew: it is difficult to divide a task into exactly equal-sized parts;
the response time is determined by the largest part
 Data shipment cost for shard-nothing architectures
In the real world, linear scaleup is too ideal to get!
A weaker criterion: the more processors are
available, the less response time it takes.
Linear speedup is the best we can hope for -- optimal!
22
Parallel query answering
Given a big graph G, and n processors S1, …, Sn
G is partitioned into fragments (G1, …, Gn)
G is distributed to n processors: Gi is stored at Si
Parallel query answering
Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
Output: Q(G), the answer to Q in G


Performance
What is it? Why do we
Response time (aka parallel computation
need to cost):
worry Interval
about it?from the
time when Q is submitted to the time when Q(G) is returned
Data shipment (aka network traffic): the total amount of data
shipped between different processors, as messages
23
Performance guarantees: bounds on response time and data shipment
Parallel scalability


Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
Output: Q(G), the answer to Q in G
Complexity
t(|G|, |Q|): the time taken by a sequential algorithm with a single
processor
Polynomial
reduction
T(|G|, |Q|, n): the time taken by a parallel
algorithm
with n
(including the cost of data
processors
shipment, k is a constant)
Parallel scalable: if
T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k)
When G is big, we can still query G by adding more processors if we
can afford them
A distributed algorithm is useful if it is parallel scalable
24
linear scalability
An algorithm T for answering a class Q of queries
Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
Output: Q(G), the answer to Q in G
The more processors,
the less response time
Algorithm T is linearly scalable in
computation if its parallel complexity is a function of |Q| and |G|/n,
and in
data shipment if the total amount of data shipped is a function of |Q|
and n
Independent of the size |G| of big G
Is it always possible?
Querying big data by adding more processors
25
Graph pattern matching via graph simulation
 Input: a graph pattern graph Q and a graph G
 Output: Q(G) is a binary relation S on the nodes of Q and G
•
each node u in Q is mapped to a node v in G, such that
(u, v)∈ S
•
for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an
edge (v, v’ ) in G, such that (u’,v’ )∈ S
Parallel scalable?
O((| V | + | VQ |) (| E | + | EQ| )) time
26
Impossibility
There exists NO algorithm for distributed graph simulation that is
parallel scalable in either
computation, or
Why?
data shipment
Pattern: 2 nodes
Graph: 2n nodes, distributed to n
processors
Possibility: when G is a tree, parallel scalable in both response time
and data shipment
Nontrivial to develop parallel scalable algorithms
27
Weak parallel scalability
Algorithm T is weakly parallel scalable in
 computation if its parallel computation cost is a function of |Q|
|G|/n and |Ef|, and in
 data shipment if the total amount of data shipped is a function of
|Q| and |Ef|
edges across different fragments
Rational: we can partition G as preprocessing, such that
 |Ef| is minimized (an NP-complete problem, but there are
effective heuristic algorithms), and
 When G grows, |Ef| does not increase substantially
The cost is not a function of |G| in practice
Doable: graph simulation is weakly parallel scalable
28
MRC: Scalability of MapReduce algorithms
Characterize scalable MapReduce algorithms in terms of disk usage,
memory usage, communication cost, CPU cost and rounds.
For a constant  > 0 and a data set D, |D|1- machines, a MapReduce
algorithm is in MRC if
Disk: each machine uses O(|D|1-) disk, O(|D|2-2) in total.
Memory: each machine uses O(|D|1-) memory, O(|D|2-2) in total.
Data shipment: in each round, each machine sends or receives
O(|D|1-) amount of data, O(|D|2-2) in total.
CPU: in each round, each machine takes polynomial time in |D|.
The number of rounds: polylog in |D|, that is, logk(|D|)
the larger D is, the more processors
The response time is still a polynomial in |D|
29
MMC: a revision of MRC
For a constant  > 0 and a data set D, n machines, a MapReduce
algorithm is in MMC if
Disk: each machine uses O(|D|/n) disk, O(|D|) in total.
Memory: each machine uses O(|D|/n) memory, O(|D|) in total.
Data shipment: in each round, each machine sends or receives
O(|D|/n) amount of data, O(|D|) in total.
CPU: in each round, each machine takes O(Ts/n) time, where Ts is
the time to solve the problem in a single machine.
The number of rounds: O(1), a constant number of rounds.
Speedup: are
O(Ts/n)
time Recursive computation?
What algorithms
in MRC?
the more machines are
used, thetoless
time is taken
Restricted
MapReduce
Compared with BD-tractable and parallel scalability
30
Bounded evaluability
31
Scale independence
Input: A class Q of queries

Question: Can we find, for any query Q  Q and any (possibly
big) dataset D, a fraction DQ of D such that


|DQ |  M, and

Q(D) = Q(DQ)?
Q(
D
Independent of the size of D
DQ
)
Q( DQ
)
Particularly useful for

A single dataset D, eg, the social graph of Facebook

Minimum DQ – the necessary amount of data for answering Q
Making the cost of computing Q(D) independent of |D|!
32
Facebook: Graph Search

Find me restaurants in New York my friends have been to in 2013
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2013
Access constraints (real-life limits)

Facebook: 5000 friends per person

Each year has at most 366 days

Each person dines at most once per day

pid is a key for relation person
How many tuples do we need to access?
33
Bounded query evaluation

Find me restaurants in New York my friends have been to in 2013
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2013
A query plan

Fetch 5000 pid’s for friends of p0 -- 5000 friends per person

For each pid, check whether she lives in NYC – 5000 person tuples

For each pid living in NYC, finds
restaurants
where
theythan
dined
Contrast
to Facebook
: more
1.38in
billion nodes, and over 140 billion links
2013 – 5000 * 366 tuples at most
Accessing 5000 + 5000 + 5000 * 366 tuples in total
34
Access constraints
Combining cardinality constraints and index
On a relation schema R: X  (Y, N)



X, Y: sets of attributes of R
for any X-value, there exist at most N distinct Y values
Index on X for Y: given an X value, find relevant Y values
Examples

friend(pid1, pid2): pid1  (pid2, 5000)

dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has
at most 366 days and each person dines at most once per day

person(pid, name, city): pid  (city, 1)
person
5000 friends per person
pid is a key for relation
Access schema: A set of access constraints
35
Finding access schema
On a relation schema R: X  (Y, N)

Functional dependencies X  Y: X  (Y, 1)

Keys X: X  (R, 1)

Domain constraints, e.g., each year has at most 366 days

Real-life bounds: 5000 friends per person (Facebook)

The semantics of real-life data, e.g., accidents in the UK from
How to find these?
1975-2005
•
dd, mm, yy  (aid, 610) at most 610 accidents in a day
•
aid  (vid, 192)

Discovery: extension of function dependency discovery, TANE
at most 192 vehicles in an accident
Bounded evaluability: only a small number of access constraints 36
Bounded queries


Input: A class Q of queries, an access schema A
Question: Can we find by using A, for any query Q  Q and
any (possibly big) dataset D, a fraction DQ of D such that

|DQ |  M, and

Q(D) = Q(DQ)?
Examples
What are these?

The graph search query at Facebook

All Boolean conjunctive queries are bounded
– Boolean: Q(D) is true or false
But how to find DQ?
– Conjunctive: SPC, selection, projection, Cartesian product
Boundedness: to decide whether it is possible to compute Q(D)
by accessing a bounded amount of data at all
37
Boundedly evaluable queries


Input: A class Q of queries, an access schema A
Question: Can we find by using A, for any query Q  Q and
any (possibly big) dataset D, a fraction DQ of D such that



|DQ |  M,
Q(D) = Q(DQ), and moreover,
effectively find
DQ can be identified in time determined by Q and A?
Examples

The graph search query at Facebook

All Boolean conjunctive queries are bounded but are not
necessarily effectively bounded!
If Q is boundedly evaluable, for any big D, we can efficiently
compute Q(D) by accessing a bounded amount of data!
38
Deciding bounded evaluability


Input: A query Q, an access schema A
Question: Is Q boundedly evaluable under A?
Yes. doable


Conjunctive queries (SPC) with restricted query plans:
• Characterization: sound and complete rules
• PTIME algorithms for checking effective boundedness and for
generating query plans, in |Q| and |A|
What can we do?
Relational algebra (SQL): undecidable
•
•
Special cases
Sufficient conditions
Parameterized queries in
recommendation systems, even SQL
Many practical queries are in fact boundedly evaluable!
39
Techniques for querying big data
40
An approach to querying big data
Given a query Q, an access schema A and a big dataset D
1. Decide whether Q is effectively bounded under A
2. If so, generate a bounded query plan for Q
3. Otherwise, do one of the following:
① Extend access schema or instantiate some parameters of Q,
 77% of conjunctive queries are boundedly evaluable
to make Q effectively bounded
 Efficiency: 9 seconds vs. 14 hours of MySQL
② Use other tricks to make D small (to be seen shortly)
 60% of graph pattern queries are boundedly evaluable
③ Compute approximate query answers to Q in D
(via subgraph isomorphism)
 Improvement: 4 orders of magnitudes
Very effective for conjunctive queries
41
Bounded evaluability using views


Input: A class Q of queries, a set of views V, an access schema A
Question: Can we find by using A, for any query Q  Q and any
(possibly big) dataset D, a fraction DQ of D such that

|DQ |  M,

access views, and additionally
a rewriting Q’ of Q using V,
a bounded amount of data

Q(D) = Q’(DQ,V(D)), and

DQ can be identified in time determined by Q, V, and A?
Q(
D
DQ
)
Q’( DQ ,
V
)
Query Q may not be boundedly evaluable, but may be
boundedly evaluable with views!
42
Incremental bounded evaluability


Input: A class Q of queries, an access schema A
Question: Can we find by using A, for any query Q  Q, any
dataset D, and any changes D to D, a fraction DQ of D such that
access an additional bounded amount of data
 |D |  M,
Q
Q(

Q(D  D) = Q(D)  Q(D, DQ ), and

DQ can be identified in time determined by Q and A?
D D
Q
D
)
Q(
D
)  Q( DQ ,
D
)
old output
Query Q may not be boundedly evaluable, but may be
incrementally boundedly evaluable!
43
Parallel query processing
manageable sizes
Divide and conquer
 partition G into fragments (G1, …, Gn), distributed to various sites
 upon receiving a query Q,
evaluate Q on smaller Gi
• evaluate Q( Gi ) in parallel
• collect partial answers at a coordinator site, and assemble
them to find the answer Q( G ) in the entire G
Q(
Q( G1
) Q( G2
G
)
)
…
Q( Gn
)
graph pattern matching in GRAPE: 21 times faster than MapReduce
Parallel processing = Partial evaluation + message passing
44
Query preserving compression
The cost of query processing: f(|G|, |Q|)
reduce the parameter?
Query preserving compression <R, P> for a class L of queries
 For any data collection G, GC = R(G)
Compressing
 For any Q in L, Q( G ) = P(Q, Gc)
G
R
Q
P
Q( G )
Post-processing
In contrast
to lossless
Q(
)
Gc
G
compression, retain only
No
need to
restore the
relevant
information
for
Q original
graph
G or in L.
answering
queries
decompress
the data.
Query preserving!
Q( Gc )
Better compression
Q( GC ) ratio!
18 times faster on average for reachability queries
45
Answering queries using views
The cost of query processing: f(|G|, |Q|)
we compute
Q(G) a
without
G, i.e., L
Query answeringcan
using
views: given
queryaccessing
Q in a language
independent
of |G|? query Q’ such that
and a set V views,
find another
 Q and Q’ are equivalent
for any G, Q(G) = Q’(G)
 Q’ only accesses V(G )
Q(
G
)
Q’( V(G) )
V(G ) is often much smaller than G (4% -- 12% on real-life data)
Improvement: 31 times faster for graph pattern matching
The complexity is no longer a function of |G|
46
Incremental query answering
5%/week in
 Real-life data is dynamic – constantly changes,
∆G
Web graphs
 Re-compute Q(G⊕∆G) starting from scratch?
 Changes ∆G are typically small
Compute Q(G) once, and then incrementally maintain it
Changes to the input
Old output
Incremental
query processing:
 Input: Q, G, Q(G), ∆G
 Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
New output
Changes to the output
When changes ∆G to the data G are small, typically so are the
At
least twice
asthe
fastoutput
for pattern
matching
for changes up to 10%
changes
∆M to
Q(G⊕
∆G)
Minimizing unnecessary recomputation
47
Complexity of incremental problems
Incremental query answering
 Input: Q, G, Q(G), ∆G
 Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
The cost of query processing: a function of |G| and |Q|
 incremental algorithms: |CHANGED|,
the size
The updating cost
that isof changes in

•
the input: ∆G, and
•
the output: ∆M
inherent to the incremental
problem itself
Bounded: the cost is expressible as f(|CHANGED|, |Q|)?
Incremental graph simulation: bounded
Complexity analysis in terms of the size of changes
48
A principled approach: Making big data small

Bounded evaluable queries

Parallel query processing (MapReduce, GRAPE, etc)

Query preserving compression: convert big data to small data

Query answering using views: make big data small

Bounded incremental query answering: depending on the size of
the changes rather than the size of the original big data

...
Including but not limited to graph queries
Yes, MapReduce is useful, but it is not the only way!
Combinations of these can do much better than MapReduce!
49
Summary and Review
 What is BD-tractability? Why do we care about it?
 What is parallel scalability? Name a few parallel scalable algorithms
 What is bounded evaluability? Why do we want to study it?
 How to make big data “small”?
 Is MapReduce the only way for querying big data? Can we do better than
it?
 What is query preserving data compression? Query answering using
views? Bounded incremental query answering?
 If a class of queries is known not to be BD-tractable, how can we process
the queries in the context of big data?
50
Projects (1)
Prove or disprove one of the following query classes is
• BD-tractable,
Pick one of these
• parallel scalable
• in MMC
If so, give an algorithm as a proof. Otherwise, prove the
impossibility but identify practical sub-classes that are scalable.
The query classes include
• Distance queries on graphs
• Graph pattern matching by subgraph isomorphism
• Graph pattern matching by graph simulation
• Subgraph isomorphism and graph simulation on trees
 Experimentally evaluate your algorithms

Both impossibility and possibility results are useful!
51
Projects (2)
Improve the performance of graph pattern matching via
subgraph isomorphism via one of the following approaches:
• query-preserving graph compression
• query answering using views
 Prove the correctness of your algorithm, give complexity analysis
and provide performance guarantees
 Experimentally evaluate your algorithm and demonstrate the
improvement

A research and development project
52
Projects (3)

It is known that graph pattern matching via graph simulation can
benefit from:
• query-preserving graph compression
• query answering using views
W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression,
SIGMOD, 2012. (query-preserving compression)
W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using
Views, ICDE 2014. (query answering using views)
Implement one of the algorithms
 Experimentally evaluate your algorithm and demonstrate the
improvement
 Bonus: can you combine the two approaches and verify its
benefit?
A development project

53
Projects (4)
Find an application with a set of SPC (conjunctive) queries and a
dataset
 Identify access constraints on your dataset for your queries
 Implement an algorithm that, given a query in your class, decide
whether the query is boundedly evaluable under your access
constraints
 If so, generate a query plan to evaluate your queries by
accessing a bounded amount of data
 Experimentally evaluate your algorithm and demonstrate the
improvement

A development project
54
Projects (5)


Write a survey on techniques for querying big data, covering
• parallel query processing,
• data compression
• query answering using views
• incremental query processing
• …
Survey:
• A set of 5-6 representative papers
• A set of criteria for evaluation
• Evaluate each model based on the criteria
• Make recommendation: what to use in different applications
Develop a good understanding on the topic
55
Reading: data quality

W. Fan and F.Geerts. Foundations of data quality management.
Morgan & Claypool Publishers, 2012. (available upon request)
–
–
–
–
–
–
Data consistency (Chapter 2)
Entity resolution (record matching; Chapter 4)
Information completeness (Chapter 5)
Data currency (Chapter 6)
Data accuracy (SIGMOD 2013 paper)
Deducing the true values of objects in data fusion (Chap. 7)
56
Reading for the next week
http://homepages.inf.ed.ac.uk/wenfei/publication.html
M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in
Inconsistent Databases, PODS 1999.
http://web.ing.puc.cl/~marenas/publications/pods99.pdf
2.
Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in
Relational Data. TKDD, 2007.
http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd0
7/bhattacharya-tkdd.pdf
3.
P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal
Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf
4.
W. Fan and F. Geerts,Relative information completeness, PODS,
2009.
5.
Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of
attributes. SIGMOD 2013.
6.
P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for
XML. WWW 2001.
1.
57
Download