Lecture note 5

TDD: Topics in Distributed Databases Querying Big Data: Theory and Practice  Theory – Tractability revisited for querying big data – Parallel scalability – Bounded evaluability  Techniques – – – – – Parallel algorithms Bounded evaluability and access constraints Query-preserving compression Query answering using views Bounded incremental query processing 1 Fundamental question To query big data, we have to determine whether it is feasible at all. For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources? Is this feasible or not for Q?  Tractability revised for querying big data  Parallel scalability  Bounded evaluability New theory for querying big data 2 BD-tractability 3 The good, the bad and the ugly  Traditional computational complexity theory of almost 50 years: • The good: polynomial time computable (PTIME) • The bad: NP-hard (intractable) • The ugly: PSPACE-hard, EXPTIME-hard, undecidable… What happens when it comes to big data? Using SSD of 6G/s, a linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) • 5.28 years when D is of 1EB (1018B) O(n) time is already beyond reach on big data in practice! Polynomial time queries become intractable on big data! 4 Complexity classes within P Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes parallel logk(n)  NC (Nick’s class): highly parallel feasible • • parallel polylog time polynomially many processors as hard as P = NP BIG open: P = NC?    L: O(log n) space NL: nondeterministic O(log n) space polylog-space: logk(n) space L  NL  polylog-space  P, NC  P Too restrictive to include practical queries feasible on big data 5 Tractability revisited for queries on big data A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that  for any database D on which queries of Q are defined, D’ = (D)  for all queries Q in Q defined on D, Q(D) can be computed parallel logk(|D|, by |Q|) evaluating Q on D’ in parallel polylog time (NC) Q1((D))  Q2((D)) D (D) 。。 Does it work? If a linear scan of D could be done in log(|D|) time:  15 seconds when D is of 1 PB instead of 1.99 days  18 seconds when D is of 1 EB rather than 5.28 years BD-tractable queries are feasible on big data 6 BD-tractable queries A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that  for any database D on which queries of Q are defined, What is the maximum size of D’ ? D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC)  Preprocessing: a common practice database people in parallelofwith more resources   one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, … not necessarily reduce the size of D BDTQ0: the set of all BD-tractable query classes 7 What query classes are BD-tractable? Boolean selection queries  Input: A dataset D  Query: Does there exist a tuple t in D such that t[A] = c? Build a B+-tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time. Graph reachability queries NL-complete  Input: A directed graph G  Query: Does there exist a path from node s to t in G? What else? Relational algebra + set recursion on ordered relational databases D. Suciu and V. Tannen: A query language for NC, PODS 1994 Some natural query classes are BD-tractable 8 Deal with queries that are not BD-tractable at aBD-tractable. node s, and visits all its children, Many query classesStarts are not pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ Breadth-Depth Search (BDS) children are visited, it continues with the node on  Input: An unordered graph G = (V, E) with a numbering on its the top of the stack, which plays the role of s nodes, and a pair (u, v) of nodes in V  Question: Is u visited before v in the breadth-depth search of G? Is this problem (query class) BD-tractable? What is P-complete? D is empty, Q is (G, (u, v)) No. The problem is well known to be P-complete!  We need PTIME to process each query (G, (u, v)) !  Preprocessing does not help us answer such queries. Can we make it BD-tractable? 9 Make queries BD-tractable Factorization: partition instances to identify a data part D for preprocessing, and a query part Q for operations Breadth-Depth Search (BDS)  Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V  Question: Is u visited before v in the breadth-depth search of G? Factorization: D is G = (V, E), Q is (u, v)   Preprocessing: (G) performs BDS on G, and returns a list M consisting of nodes in V in the same order as they are visited For all queries (u, v), whether u occurs before v can be decided by a binary search on M, in log(|M|) time after proper factorization BDTQ: The set of all query classes that can be made BD-tractable10 Fundamental problems for BD-tractability BD-tractable queries help practitioners determine what query classes are tractable on big data. Are we done yet? Why do we need reduction? No, a number of questions in connection with a complexity class! Reductions: how to transform a problem to familiar another in the class Analogous to our that we know how to solve, and hence make it BD-tractable? NP-complete problems  Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced  Name one NP-complete Why do we care?  How large is BDTQ? BDTQ0? Compared to P? NC? problem that you know Fundamental to any complexity classes: P, NP, … 11 Reductions transformations for making Departing from our familiar polynomial-time reductions, we need queries reductions that are in NC, and deal with bothBD-tractable data D and query Q! NC-factor reductions NC: a pair of NC functions that allow refactorizations (repartition data and query part), for BDTQ  F-reductions F: a pair of NC functions that do not allow refactorizations, for BDTQ0  to determine whether a query class is BD-tractable Properties:  transitivity: if Q1 NC Q2 and Q2 NC Q3, then Q1 NC Q3 (also F)  compatibility:  if Q1 NC Q2 and Q2 is in BDTQ, then so is Q1.  if Q1 F Q2 and Q2 is in BDTQ0, then so is Q1. transform a given problem to one that we know how to solve 12 Complete problems for BDTQ A query class Q is complete for BDTQ if Q is in BDTQ, and moreover, for any query class Q’ in BDTQ, Q’ NC Q  A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for any query class Q’ in BDTQ0, Q’ F Q  Is there a complete problems for BDTQ?  There exists a natural query class Q that is complete for BDTQ Breadth-Depth Search (BDS) What does this tell us? BDS is both P-complete and BDTQ-complete! 13 Is there a complete problem for BDTQ0? A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for any query class Q’ in BDTQ0, Q’ F Q  An open problem Unless P = NC, a query class complete for BDTQ0 is a witness for P \ NC   Whether P = NC is as hard as whether P = NP If we can find a complete problem for BDTQ0 and show that it is not in NC, then we solve the big open whether P = NC It is hard to find a complete problem for BDTQ0 14 Comparing with P and NC How large is BDTQ? How large is BDTQ0?  NC  BDTQ = P separation All PTIME query classes can be made BD-tractable!  Unless P = NC, NCneed  BDTQ P proper0  factorizations to answer PTIMEquery queries on big data Unless P = NC, not all PTIME classes are BD-tractable Properly contained in P PTIME BD-tractable not BD-tractable Not all polynomial-time queries are BD-tractable 15 Polynomial hierarchy revised NP and beyond Parallel polylog time P BD-tractable not BD-tractable Tractability revised for querying big data 16 What can we get from BD-tractability? Guidelines for the following. BDTQ0  What query classes are feasible on big data?  What query classes can be made feasible to answer on BDTQ big data?  How to determine whether it is feasible to answer a class Q of queries on big data? Reduce Q to a complete problem Qc for BDTQ via NC  If so, how to answer queries in Q? • Identify factorizations (NC reductions) such that Q NC Qc • Compose the reduction and the algorithm for answering queries of Qc Why we need to study theory for querying big data 17 Parallel scalability 18 Parallel query answering BD-tractability is hard to achieve. Parallel processing is widely used, given more resources Using 10000 SSD of 6G/s, a linear scan of D might take:  1.9 days/10000 = 16 seconds when D is of 1PB (1015B)  5.28 years/10000 = 4.63 days when D is of 1EB (1018B) Only ideally! interconnection network P P P M M M DB DB DB 10,000 processors How to define “better”? Parallel scalable: the more processors, the “better”? 19 Degree of parallelism -- speedup Speedup: for a given task, TS/TL,  TS: time taken by a traditional DBMS  TL: time taken by a parallel system with more resources  TS/TL: more sources mean proportionally less time for a task  Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system Speed: throughput response time Linear speedup resources Question: can we do better than linear speedup? 20 Degree of parallelism -- scaleup Scaleup: TS/TL  A task Q, and a task QN, N times bigger than Q  A DBMS MS, and a parallel DBMS ML,N times larger  TS: time taken by MS to execute Q  TL: time taken by ML to execute QN  Linear scaleup: if TL = TS, i.e., the time is constant if the resource increases in proportion to increase in problem size TS/TL resources and problem size Question: can we do better than linear scaleup? 21 Better than linear scaleup/speedup? NO, even hard to achieve linear speedup/scaleup! Give 3 reasons Startup costs: initializing each process Interference: competing for shared resources (network, disk, memory or even locks) Think of blocking in MapReduce Skew: it is difficult to divide a task into exactly equal-sized parts; the response time is determined by the largest part  Data shipment cost for shard-nothing architectures In the real world, linear scaleup is too ideal to get! A weaker criterion: the more processors are available, the less response time it takes. Linear speedup is the best we can hope for -- optimal! 22 Parallel query answering Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G   Performance What is it? Why do we Response time (aka parallel computation need to cost): worry Interval about it?from the time when Q is submitted to the time when Q(G) is returned Data shipment (aka network traffic): the total amount of data shipped between different processors, as messages 23 Performance guarantees: bounds on response time and data shipment Parallel scalability   Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Complexity t(|G|, |Q|): the time taken by a sequential algorithm with a single processor Polynomial reduction T(|G|, |Q|, n): the time taken by a parallel algorithm with n (including the cost of data processors shipment, k is a constant) Parallel scalable: if T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k) When G is big, we can still query G by adding more processors if we can afford them A distributed algorithm is useful if it is parallel scalable 24 linear scalability An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G The more processors, the less response time Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n, and in data shipment if the total amount of data shipped is a function of |Q| and n Independent of the size |G| of big G Is it always possible? Querying big data by adding more processors 25 Graph pattern matching via graph simulation  Input: a graph pattern graph Q and a graph G  Output: Q(G) is a binary relation S on the nodes of Q and G • each node u in Q is mapped to a node v in G, such that (u, v)∈ S • for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S Parallel scalable? O((| V | + | VQ |) (| E | + | EQ| )) time 26 Impossibility There exists NO algorithm for distributed graph simulation that is parallel scalable in either computation, or Why? data shipment Pattern: 2 nodes Graph: 2n nodes, distributed to n processors Possibility: when G is a tree, parallel scalable in both response time and data shipment Nontrivial to develop parallel scalable algorithms 27 Weak parallel scalability Algorithm T is weakly parallel scalable in  computation if its parallel computation cost is a function of |Q| |G|/n and |Ef|, and in  data shipment if the total amount of data shipped is a function of |Q| and |Ef| edges across different fragments Rational: we can partition G as preprocessing, such that  |Ef| is minimized (an NP-complete problem, but there are effective heuristic algorithms), and  When G grows, |Ef| does not increase substantially The cost is not a function of |G| in practice Doable: graph simulation is weakly parallel scalable 28 MRC: Scalability of MapReduce algorithms Characterize scalable MapReduce algorithms in terms of disk usage, memory usage, communication cost, CPU cost and rounds. For a constant  > 0 and a data set D, |D|1- machines, a MapReduce algorithm is in MRC if Disk: each machine uses O(|D|1-) disk, O(|D|2-2) in total. Memory: each machine uses O(|D|1-) memory, O(|D|2-2) in total. Data shipment: in each round, each machine sends or receives O(|D|1-) amount of data, O(|D|2-2) in total. CPU: in each round, each machine takes polynomial time in |D|. The number of rounds: polylog in |D|, that is, logk(|D|) the larger D is, the more processors The response time is still a polynomial in |D| 29 MMC: a revision of MRC For a constant  > 0 and a data set D, n machines, a MapReduce algorithm is in MMC if Disk: each machine uses O(|D|/n) disk, O(|D|) in total. Memory: each machine uses O(|D|/n) memory, O(|D|) in total. Data shipment: in each round, each machine sends or receives O(|D|/n) amount of data, O(|D|) in total. CPU: in each round, each machine takes O(Ts/n) time, where Ts is the time to solve the problem in a single machine. The number of rounds: O(1), a constant number of rounds. Speedup: are O(Ts/n) time Recursive computation? What algorithms in MRC? the more machines are used, thetoless time is taken Restricted MapReduce Compared with BD-tractable and parallel scalability 30 Bounded evaluability 31 Scale independence Input: A class Q of queries  Question: Can we find, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that   |DQ |  M, and  Q(D) = Q(DQ)? Q( D Independent of the size of D DQ ) Q( DQ ) Particularly useful for  A single dataset D, eg, the social graph of Facebook  Minimum DQ – the necessary amount of data for answering Q Making the cost of computing Q(D) independent of |D|! 32 Facebook: Graph Search  Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 Access constraints (real-life limits)  Facebook: 5000 friends per person  Each year has at most 366 days  Each person dines at most once per day  pid is a key for relation person How many tuples do we need to access? 33 Bounded query evaluation  Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 A query plan  Fetch 5000 pid’s for friends of p0 -- 5000 friends per person  For each pid, check whether she lives in NYC – 5000 person tuples  For each pid living in NYC, finds restaurants where theythan dined Contrast to Facebook : more 1.38in billion nodes, and over 140 billion links 2013 – 5000 * 366 tuples at most Accessing 5000 + 5000 + 5000 * 366 tuples in total 34 Access constraints Combining cardinality constraints and index On a relation schema R: X  (Y, N)    X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values Examples  friend(pid1, pid2): pid1  (pid2, 5000)  dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day  person(pid, name, city): pid  (city, 1) person 5000 friends per person pid is a key for relation Access schema: A set of access constraints 35 Finding access schema On a relation schema R: X  (Y, N)  Functional dependencies X  Y: X  (Y, 1)  Keys X: X  (R, 1)  Domain constraints, e.g., each year has at most 366 days  Real-life bounds: 5000 friends per person (Facebook)  The semantics of real-life data, e.g., accidents in the UK from How to find these? 1975-2005 • dd, mm, yy  (aid, 610) at most 610 accidents in a day • aid  (vid, 192)  Discovery: extension of function dependency discovery, TANE at most 192 vehicles in an accident Bounded evaluability: only a small number of access constraints 36 Bounded queries   Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that  |DQ |  M, and  Q(D) = Q(DQ)? Examples What are these?  The graph search query at Facebook  All Boolean conjunctive queries are bounded – Boolean: Q(D) is true or false But how to find DQ? – Conjunctive: SPC, selection, projection, Cartesian product Boundedness: to decide whether it is possible to compute Q(D) by accessing a bounded amount of data at all 37 Boundedly evaluable queries   Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that    |DQ |  M, Q(D) = Q(DQ), and moreover, effectively find DQ can be identified in time determined by Q and A? Examples  The graph search query at Facebook  All Boolean conjunctive queries are bounded but are not necessarily effectively bounded! If Q is boundedly evaluable, for any big D, we can efficiently compute Q(D) by accessing a bounded amount of data! 38 Deciding bounded evaluability   Input: A query Q, an access schema A Question: Is Q boundedly evaluable under A? Yes. doable   Conjunctive queries (SPC) with restricted query plans: • Characterization: sound and complete rules • PTIME algorithms for checking effective boundedness and for generating query plans, in |Q| and |A| What can we do? Relational algebra (SQL): undecidable • • Special cases Sufficient conditions Parameterized queries in recommendation systems, even SQL Many practical queries are in fact boundedly evaluable! 39 Techniques for querying big data 40 An approach to querying big data Given a query Q, an access schema A and a big dataset D 1. Decide whether Q is effectively bounded under A 2. If so, generate a bounded query plan for Q 3. Otherwise, do one of the following: ① Extend access schema or instantiate some parameters of Q,  77% of conjunctive queries are boundedly evaluable to make Q effectively bounded  Efficiency: 9 seconds vs. 14 hours of MySQL ② Use other tricks to make D small (to be seen shortly)  60% of graph pattern queries are boundedly evaluable ③ Compute approximate query answers to Q in D (via subgraph isomorphism)  Improvement: 4 orders of magnitudes Very effective for conjunctive queries 41 Bounded evaluability using views   Input: A class Q of queries, a set of views V, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that  |DQ |  M,  access views, and additionally a rewriting Q’ of Q using V, a bounded amount of data  Q(D) = Q’(DQ,V(D)), and  DQ can be identified in time determined by Q, V, and A? Q( D DQ ) Q’( DQ , V ) Query Q may not be boundedly evaluable, but may be boundedly evaluable with views! 42 Incremental bounded evaluability   Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q, any dataset D, and any changes D to D, a fraction DQ of D such that access an additional bounded amount of data  |D |  M, Q Q(  Q(D  D) = Q(D)  Q(D, DQ ), and  DQ can be identified in time determined by Q and A? D D Q D ) Q( D )  Q( DQ , D ) old output Query Q may not be boundedly evaluable, but may be incrementally boundedly evaluable! 43 Parallel query processing manageable sizes Divide and conquer  partition G into fragments (G1, …, Gn), distributed to various sites  upon receiving a query Q, evaluate Q on smaller Gi • evaluate Q( Gi ) in parallel • collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Q( Q( G1 ) Q( G2 G ) ) … Q( Gn ) graph pattern matching in GRAPE: 21 times faster than MapReduce Parallel processing = Partial evaluation + message passing 44 Query preserving compression The cost of query processing: f(|G|, |Q|) reduce the parameter? Query preserving compression <R, P> for a class L of queries  For any data collection G, GC = R(G) Compressing  For any Q in L, Q( G ) = P(Q, Gc) G R Q P Q( G ) Post-processing In contrast to lossless Q( ) Gc G compression, retain only No need to restore the relevant information for Q original graph G or in L. answering queries decompress the data. Query preserving! Q( Gc ) Better compression Q( GC ) ratio! 18 times faster on average for reachability queries 45 Answering queries using views The cost of query processing: f(|G|, |Q|) we compute Q(G) a without G, i.e., L Query answeringcan using views: given queryaccessing Q in a language independent of |G|? query Q’ such that and a set V views, find another  Q and Q’ are equivalent for any G, Q(G) = Q’(G)  Q’ only accesses V(G ) Q( G ) Q’( V(G) ) V(G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching The complexity is no longer a function of |G| 46 Incremental query answering 5%/week in  Real-life data is dynamic – constantly changes, ∆G Web graphs  Re-compute Q(G⊕∆G) starting from scratch?  Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Changes to the input Old output Incremental query processing:  Input: Q, G, Q(G), ∆G  Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M New output Changes to the output When changes ∆G to the data G are small, typically so are the At least twice asthe fastoutput for pattern matching for changes up to 10% changes ∆M to Q(G⊕ ∆G) Minimizing unnecessary recomputation 47 Complexity of incremental problems Incremental query answering  Input: Q, G, Q(G), ∆G  Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M The cost of query processing: a function of |G| and |Q|  incremental algorithms: |CHANGED|, the size The updating cost that isof changes in  • the input: ∆G, and • the output: ∆M inherent to the incremental problem itself Bounded: the cost is expressible as f(|CHANGED|, |Q|)? Incremental graph simulation: bounded Complexity analysis in terms of the size of changes 48 A principled approach: Making big data small  Bounded evaluable queries  Parallel query processing (MapReduce, GRAPE, etc)  Query preserving compression: convert big data to small data  Query answering using views: make big data small  Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data  ... Including but not limited to graph queries Yes, MapReduce is useful, but it is not the only way! Combinations of these can do much better than MapReduce! 49 Summary and Review  What is BD-tractability? Why do we care about it?  What is parallel scalability? Name a few parallel scalable algorithms  What is bounded evaluability? Why do we want to study it?  How to make big data “small”?  Is MapReduce the only way for querying big data? Can we do better than it?  What is query preserving data compression? Query answering using views? Bounded incremental query answering?  If a class of queries is known not to be BD-tractable, how can we process the queries in the context of big data? 50 Projects (1) Prove or disprove one of the following query classes is • BD-tractable, Pick one of these • parallel scalable • in MMC If so, give an algorithm as a proof. Otherwise, prove the impossibility but identify practical sub-classes that are scalable. The query classes include • Distance queries on graphs • Graph pattern matching by subgraph isomorphism • Graph pattern matching by graph simulation • Subgraph isomorphism and graph simulation on trees  Experimentally evaluate your algorithms  Both impossibility and possibility results are useful! 51 Projects (2) Improve the performance of graph pattern matching via subgraph isomorphism via one of the following approaches: • query-preserving graph compression • query answering using views  Prove the correctness of your algorithm, give complexity analysis and provide performance guarantees  Experimentally evaluate your algorithm and demonstrate the improvement  A research and development project 52 Projects (3)  It is known that graph pattern matching via graph simulation can benefit from: • query-preserving graph compression • query answering using views W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression) W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views) Implement one of the algorithms  Experimentally evaluate your algorithm and demonstrate the improvement  Bonus: can you combine the two approaches and verify its benefit? A development project  53 Projects (4) Find an application with a set of SPC (conjunctive) queries and a dataset  Identify access constraints on your dataset for your queries  Implement an algorithm that, given a query in your class, decide whether the query is boundedly evaluable under your access constraints  If so, generate a query plan to evaluate your queries by accessing a bounded amount of data  Experimentally evaluate your algorithm and demonstrate the improvement  A development project 54 Projects (5)   Write a survey on techniques for querying big data, covering • parallel query processing, • data compression • query answering using views • incremental query processing • … Survey: • A set of 5-6 representative papers • A set of criteria for evaluation • Evaluate each model based on the criteria • Make recommendation: what to use in different applications Develop a good understanding on the topic 55 Reading: data quality  W. Fan and F.Geerts. Foundations of data quality management. Morgan & Claypool Publishers, 2012. (available upon request) – – – – – – Data consistency (Chapter 2) Entity resolution (record matching; Chapter 4) Information completeness (Chapter 5) Data currency (Chapter 6) Data accuracy (SIGMOD 2013 paper) Deducing the true values of objects in data fusion (Chap. 7) 56 Reading for the next week http://homepages.inf.ed.ac.uk/wenfei/publication.html M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS 1999. http://web.ing.puc.cl/~marenas/publications/pods99.pdf 2. Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 2007. http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd0 7/bhattacharya-tkdd.pdf 3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf 4. W. Fan and F. Geerts，Relative information completeness, PODS, 2009. 5. Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013. 6. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001. 1. 57

Lecture note 5

Related documents

Products

Support

Lecture note 5

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib