TDD: Research Topics in Distributed Databases (2011)
This course covers both basic technology and advanced research topics in connection
with distributed data management.
 Basic technology
o Parallel database management
o Distributed databases and distributed query evaluation
 Advanced topics
o Schema matching, schema mapping and data exchange
o Integrating data from distributed sources
o Matching records from multiple data sources
o Cleaning integrated data
o Conditional constraints as data quality rules
o Cleaning distributed data: error detection and propagation of data-quality
Prerequisites: Prerequisite courses include Database Systems (DBS). Some background in theory is
necessary. In particular, basic knowledge of logic and the theory of computation are
essential. Undergraduates intending to take the course must either have a B or better in
one of the 3rd year modules: Computability and Intractability or Language Semantics and
Implementation; or they must seek permission from the instructor.
Web page:
Course format: This is a seminar course. Lectures are to provide background as needed. Students will
be expected to read related research papers or textbook on the list below. Each student
will also be expected to complete and present a course project.
In keeping the research seminar nature of the course, there will be no exams. Instead,
students are required to read research papers, write reviews, complete a project and
present the project in class. Final grades will be determined as follows:
 Reviews:
 Project:
 Project report and presentation:
You should read all the papers on the list below. You should also write reviews and bring
reviews of those papers to class when the papers are to be discussed. Each review should
be about half a page, and should consist of a compelling mix of summary, key ideas,
questions you have, flaws you find, and evaluations of quality.
Projects will be developed during the class. A project could be either a comprehensive
survey of a line of research, or consist of some design and development that deal in more
depth with a topic encountered during the semester. A survey project is to be conducted
by individuals, and a design project is to be developed by a small team of at most three.
The individual or team should write and present a final project report.
Tentative schedule and reading list (subject to change):
Week 1: Overview
Week 2: Parallel databases: architectures, query processing and optimization
o Database Management Systems, 2nd edition, R. Ramakrishnan and J. Gehrke, Chapter 22.
o Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6
(Parallel and Distributed Database Systems)
Week 3: Distributed databases: architectures, replication, query processing and optimization
o Database Management Systems, Chapter 22.
o Database System Concept, Part 6 (Parallel and Distributed Database Systems)
Week 4:
Techniques for evaluating XPath queries on distributed XML documents
(Reviews are required for two of the following papers; select two out of three to review):
Efficient Algorithms for Processing XPath Queries
Using partial evaluation in distributed query evaluation
Distributed query evaluation with performance guarantees
Week 5: Data exchange: schema mapping, schema matching and information preservation.
o A survey of approaches to automatic schema matching
o Information preserving XML schema embedding
o Putting context into schema matching
Week 6: Data integration.
o XML queries and algebra in the Enosys integration platform
o Capturing both types and constraints in data integration
o Active XML premier
Week 7: Record matching.
o Duplicate record detection: A survey
o Reasoning about record matching rules
o Real-world data is dirty: Data cleansing and the merge/purge problem
Week 8: Data cleaning.
o Data cleaning: Problems and current approaches
o Consistent query answering in databases
o A cost-based model and effective heuristic for repairing constraints by value modification
Week 9: Conditional dependencies for capturing inconsistencies.
o Conditional functional dependencies for data cleaning
o Extending dependencies with conditions
o Discovering conditional functional dependencies
Week 10: Cleaning distributed data.
o Propagating functional dependencies with conditions
o Towards certain fixes with editing rules and master data
o Detecting inconsistencies in distributed data
Week 11: Final project report and presentation.
Projects (subject to change)