TDD: Research Topics in Distributed Databases (2011) Description: This course covers both basic technology and advanced research topics in connection with distributed data management. Basic technology o Parallel database management o Distributed databases and distributed query evaluation Advanced topics o Schema matching, schema mapping and data exchange o Integrating data from distributed sources o Matching records from multiple data sources o Cleaning integrated data o Conditional constraints as data quality rules o Cleaning distributed data: error detection and propagation of data-quality rules Prerequisites: Prerequisite courses include Database Systems (DBS). Some background in theory is necessary. In particular, basic knowledge of logic and the theory of computation are essential. Undergraduates intending to take the course must either have a B or better in one of the 3rd year modules: Computability and Intractability or Language Semantics and Implementation; or they must seek permission from the instructor. Web page: http://homepages.inf.ed.ac.uk/wenfei/tdd/home.html Course format: This is a seminar course. Lectures are to provide background as needed. Students will be expected to read related research papers or textbook on the list below. Each student will also be expected to complete and present a course project. Grading: In keeping the research seminar nature of the course, there will be no exams. Instead, students are required to read research papers, write reviews, complete a project and present the project in class. Final grades will be determined as follows: Reviews: 28% Project: 57% Project report and presentation: 15% Review: You should read all the papers on the list below. You should also write reviews and bring reviews of those papers to class when the papers are to be discussed. Each review should be about half a page, and should consist of a compelling mix of summary, key ideas, questions you have, flaws you find, and evaluations of quality. Project: Projects will be developed during the class. A project could be either a comprehensive survey of a line of research, or consist of some design and development that deal in more depth with a topic encountered during the semester. A survey project is to be conducted by individuals, and a design project is to be developed by a small team of at most three. The individual or team should write and present a final project report. Tentative schedule and reading list (subject to change): Week 1: Overview Week 2: Parallel databases: architectures, query processing and optimization o Database Management Systems, 2nd edition, R. Ramakrishnan and J. Gehrke, Chapter 22. o Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed Database Systems) Week 3: Distributed databases: architectures, replication, query processing and optimization o Database Management Systems, Chapter 22. o Database System Concept, Part 6 (Parallel and Distributed Database Systems) Week 4: Techniques for evaluating XPath queries on distributed XML documents (Reviews are required for two of the following papers; select two out of three to review): o o o Efficient Algorithms for Processing XPath Queries (http://www.dbai.tuwien.ac.at/research/xmltaskforce/vldb2002.pdf) Using partial evaluation in distributed query evaluation (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/vldb06.pdf) Distributed query evaluation with performance guarantees (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/sigmod07.pdf) Week 5: Data exchange: schema mapping, schema matching and information preservation. o A survey of approaches to automatic schema matching (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/schema-matching.pdf) o Information preserving XML schema embedding (http://homepages.inf.ed.ac.uk/wenfei/papers/vldb05-mapping.pdf) o Putting context into schema matching (http://homepages.inf.ed.ac.uk/wenfei/papers/vldb06-matching.pdf) Week 6: Data integration. o XML queries and algebra in the Enosys integration platform (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/Enosys.pdf) o Capturing both types and constraints in data integration (http://homepages.inf.ed.ac.uk/wenfei/papers/sigmod03.pdf) o Active XML premier (ftp://ftp.inria.fr/INRIA/Projects/gemo/gemo/GemoReport-307.pdf) Week 7: Record matching. o Duplicate record detection: A survey (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/tkde07.pdf) o Reasoning about record matching rules (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/vldb09.pdf) o Real-world data is dirty: Data cleansing and the merge/purge problem (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.6676) Week 8: Data cleaning. o Data cleaning: Problems and current approaches (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/cleaning.pdf) o Consistent query answering in databases (http://portal.acm.org/citation.cfm?doid=1147376.1147391) o A cost-based model and effective heuristic for repairing constraints by value modification (http://homepages.inf.ed.ac.uk/wenfei/papers/sigmod05.pdf) Week 9: Conditional dependencies for capturing inconsistencies. o Conditional functional dependencies for data cleaning (http://homepages.inf.ed.ac.uk/wenfei/papers/icde07-cfd.pdf) o Extending dependencies with conditions (http://homepages.inf.ed.ac.uk/wenfei/papers/vldb07-a.pdf) o Discovering conditional functional dependencies (http://homepages.inf.ed.ac.uk/fgeerts/pdf/icde09.pdf) Week 10: Cleaning distributed data. o Propagating functional dependencies with conditions (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/vldb08.pdf) o Towards certain fixes with editing rules and master data (http://portal.acm.org/citation.cfm?id=1920867) o Detecting inconsistencies in distributed data (http://homepages.inf.ed.ac.uk/wenfei/tdd/reading/icde10.pdf) Week 11: Final project report and presentation. Projects (subject to change) http://homepages.inf.ed.ac.uk/wenfei/tdd/project/project.html: