Правительство Российской Федерации Федеральное государственное автономное учреждение высшего профессионального образования «Национальный исследовательский университет «Высшая школа экономики» Факультет Бизнес-информатика Отделение Программная инженерия Программа дисциплины "Базы данных" ( - на английском языке - ) для направления 231000.62 "Программная инженерия" подготовки бакалавра Автор программы доцент, к.т.н. А.Д. Брейман abreyman@hse.ru Рекомендована секцией УМС по бизнес-информатике Председатель ____________________ Одобрена на заседании кафедры Управления разработкой программного обеспечения (УРПО) Зав. кафедрой С.М. Авдошин « _____ » ___________________ 2013 г. « ____ » ______________________ 2013 г. Утверждена Ученым Советом факультета Бизнес-информатики Ученый секретарь __________________ « ____ » ___________________ 2013 г. Москва - 2013 Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 1 I. Summary Author of the Program: Assoc. Prof., PhD. Alexander D. Breyman (responsible lecturer) General Information about Training Course: The syllabus is prepared for teachers responsible for the course (or closely related disciplines), teaching assistants, students enrolled on the course as well as experts and statutory bodies carrying out assigned or regular accreditations in accordance with • educational standards of the State educational budget institution of the Higher Professional Education "State University – The Higher School of Economics" (HSE) categorized as "National Research University", • B.Sc. curriculum ("Software Engineering", area code 231000.62) 3rd year, 2013-2014 academic year; • B.Sc. curriculum ("Software Engineering", area code 231000.62) 4th year, 2013-2014 academic year; • Federal state educational standard of higher education in software engineering (Bachelor of Science degree) approved by the Ministry of Education and Science of RF (Russian Federation) Directive N544 on November 9th, 2009 (in Russian). The course is offered to students of the Bachelor Program «Software Engineering» (code 231000.62) in the School of Software Engineering, Faculty of Business Informatics of the National Research University - Higher School of Economics (HSE). This mandatory course belongs to professional cycle curricula unit (Б.3 unit/ Base module [Professional cycle disciplines 1.3] of 2013-2014 academic year's working syllabus) covered by the list of training courses of bachelor's program (3rd and 4th year of studies). It is a four module course, which is delivered in modules #3 and $3 of the third academic year and in modules #1 and #2 of the fourth academic year. Number of credits is 8 (4+4). Total course length is 288 academic hours. 3rd year part of course length is 144 academic hours including 80 auditory hours (40 Lecture (L) hours and 40 Practice (P) hours) and 64 Self-study (SS) hours. 4th year part of course length is 144 academic hours including 60 auditory hours (30 Lecture (L) hours and 30 Practice (P) hours) and 84 Self-study (SS) hours. Academic control forms are one home assignment, one test, and one written exam after module #3 of 3rd course. Prerequisites: The course is based on the knowledge of foundations of discrete mathematics (including logic and set theory), computer science, and computer programming. Students are assumed to understand basic data structures and their uses (lists, arrays), concepts of object-oriented programming (classes, inheritance), modular software design and method linkage, the design, implementation and testing of medium-sized problem solutions, to be familiar with some contexts in which databases are used. Course Objectives: The student should develop skills and understanding in: the design methodology for databases and verifying their structural correctness; implementing databases and applications software in the relational model as well as in map/reduce paradigm; using querying languages, primarily SQL, and other database supporting software; applying the theory behind various database models and query languages; Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 2 implementing security and integrity policies relating to databases; working in group settings to design and implement database projects. The objective of first part (first and second modules) of the course is to expose to students topics of (mostly) online transactional processing and relational database theory, database design and implementation, including entity-relationship data modeling, relational model, algebra and calculus, functional dependencies and normalization theory, relational query languages, including SQL. The objective of second part (third and fourth modules) of the course is to form professional competencies related to design and implementation of other kinds of databases, including data warehouses, online analytical data processing and big data management tools. Students will get a grasp on strengths and weaknesses of wide spectrum of approaches to data storage, search and retrieval, resulting in informed choice of database model. Abstract: First part of the course is intended to provide students with an understanding of the current theory and practice of database management systems. The course provides a solid technical overview of database management systems, using a current database products as a case studies. In addition to technical concerns, more general issues are emphasized. These include data independence, integrity, security, recovery, performance, database design principles, and database management systems internals. While first part of the course covering many of the core concepts behind database modeling, design and implementation, there are many other considerations that should be addressed for successful career in this field. The main objective of this course is to expand students’ view and introduce advanced topics, including data warehousing, online analytical data processing and big data management. The additional topics covered in this course will increase student’s proficiency in choosing appropriate technologies for projects. By the end of the course, students should have a solid grasp on classic database concepts and tools as well as on business intelligence and big data tools, which will prove to be invaluable as them progress further in computer science studies. This course studies different conceptual database models and their properties. The models that will be discussed are: Relational model for online transactional processing; Relational data warehouse; Multidimensional data warehouse; Online analytical processing; Map/reduce massive parallel data processing; Key/value, document and graph database models; Semistructured data and XML; Data stream processing. For these conceptual models the course will concentrate on the following points: Why was the database model introduced? Which of the shortcomings of other models does it address? What are the most important concepts and notions for the database model? How is the model implemented? Which are the main techniques? The importance of understanding the internals of a particular database model cannot be overemphasized as it is closely connected to its limitations. Training Objectives: After taking this course the student should have achieved the following objectives: Knows the data modelling concepts and has an understanding of relational data model. Can compose queries to relational databases using relational algebra, tuple relational calculus and SQL. Knows Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 3 methods of database design, including entity-relationship approach and normalization-based approach. Knows and is able to apply database application design and development methods usable for objectoriented program systems, including object-relational mappers, its advantages and disadvantages. Knows models and methods of internal organization of relational databases including file storage, indexing, query processing and transaction management issues. Knows the reference architectures of data warehouses and is aware of the basic functionality offered by available commercial and free data warehousing systems. Master methods and tools for creating analytical database solutions. Can choose dimensions for multidimensional database, group them into hierarchies and define aggregates. Knows principles of data integration and extract-transformload process design and implementation. Knows the map/reduce approach and its Hadoop implementation. Can create map/reduce-based solutions to large-scale massive data processing problems. Knows database models and tools of NoSQL class, including key-value stores, document databases and graph databases. Can implement systems and database applications using these tools. Knows XML document model, DTD and XML Schema. Can write and understand XPath and XQuery expressions. Knows data stream management systems, its data models and concepts. Can compose queries for executing in these systems. Students should be able to understand the language of studies models, choose and use appropriate models and programming languages, implement systems using chosen models, methods and tools. II. Topic-wise Curricula Plan № Title of the topic Total Hours Lectures Classroom hours Practice Self-study Module #3, 3rd year 1 2 3 4 5 6 7 Introduction. Data modeling. Relational Model. Relational Query Languages. SQL. Database Design: The E/R and UML Approaches. Relational Database Design. Module #3 totals 6 6 6 8 22 2 2 2 2 6 2 2 2 2 6 2 2 2 4 10 10 2 2 6 14 72 4 20 4 20 6 32 20 8 8 12 12 12 72 4 2 2 4 4 4 20 4 2 2 4 4 4 20 12 4 4 4 4 4 32 10 2 2 6 22 6 6 10 14 46 2 10 2 10 10 26 Module #4, 3rd year 8 9 10 11 12 13 Application Design and Development. Storage and File Structure. Indexing and Hashing. Query Processing. Transaction Management. Distributed and Parallel Databases. Module #4 totals Module #1, 4th year 14 15 16 Data Warehousing and Big Data Management. Data Warehouse Architectures and Models. Data Cleaning and Integration. Module #1 totals Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 4 Module #2, 4th year 17 18 19 20 Map/Reduce and Hadoop. NoSQL Databases. Semistructured data and XML. Data Stream Management Systems. Module #2 totals Total: 40 18 18 22 98 8 4 4 4 20 8 4 4 4 20 24 10 10 14 58 288 70 70 148 III. Base Textbook(s) and Recommended Readings Printed Sources: • Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. • Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. • Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. • Blaha M. Patterns of Data Modeling (Emerging Directions in Database Systems and Applications), CRC Press, 2010. — 261pp. • Kuate P.H. et al. NHybernate in Action, Manning Publications, 2009. — 400pp. • Golfarelli M., Rizzi S. Data Warehouse Design: Modern Principles and Methodologies, McGraw-Hill Osborne Media, 2009. — 480pp. • Inmon W. H., Krishnan K. Building the Unstructured Data Warehouse: Architecture, Analysis, and Design, Technics Publications, 2011. — 216pp. • Celko J. Joe Celko's Analytics and OLAP in SQL, Morgan Kaufmann, 2006. — 208pp. • Smith B.C., Clay C.R. Microsoft SQL Server 2008 MDX Step by Step, Microsoft Press, 2009. — 400pp. • Ben-Gan I. et al. Inside Microsoft SQL Server 2008: T-SQL Programming, Microsoft Press, 2009. — 832 pp. • Melton J., Buxton S. Querying XML: XQuery, XPath, and SQL/XML in context, Morgann Kaufmann, 2006. 848pp. • IV. Forms of Control Current control: attendance record, seminar-based knowledge control, group project control; - Intermediate control: written test by the end of Module 1, group project 1 presentation by the end of Module 2; written test by the end of Module 3, group project 2 presentation by the end of Module 4; - Final control: exam by the end of Module 4; - The final course grade is a sum of the following elements: 1) attendance record (A); 2) practice activities (S); 3) group projects (P1, P2); Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 5 4) written test (T); 5) exam (E). The overall and accumulated course grades Go and Ga (10-point scale each) are calculated as follows: Ga = 0.15A + 0.15S + 0.3(P1+P2) + 0.4T; Go = 0.5Ga + 0.5E. The overall and accumulated course grades Go and Ga (10-point scale each) include results achieved by students in their attendance record A, practice activities S, group project P, written test T and exam E; it is rounded up to an integer number of points. The rounding procedure accounts for students' practice activities during seminars. Intermediate assessment retakes are not allowed. Conversion of the concluding rounded grade (FE) to five-point scale grade is done in accordance with the following table: Summary Table : Correspondence of ten-point (10) to five-point (5) system's marks Ten-point scale [10] 1 2 3 4 5 6 7 8 9 10 - unsatisfactory - very bad - bad - satisfactory - quite satisfactory - good - very good - nearly excellent - excellent - brilliantly Five-point scale [5] Unsatisfactory (UnS) - 2 Satisfactory (S) - 3 Good (G) - 4 Excellent (E) - 5 V. Course Contents Тopic 1: Introduction. ♦ Outline: Course overview and logistics. History of data management approaches. Basic database system concepts. Database environment. Database users. Database development process. Database planning. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.1] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.1] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 6 Тopic 2: Data Modeling. ♦ Outline: Three level database architecture. Data model. Data independence. Inverted files. Early data models: hierarchical and network. Basic relational model concepts. Object-oriented model. Object-relational model. Semi-structured model. Semantic data models. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.1] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.1, 2.1] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 3: Relational Model. ♦ Outline: History of relational model. Advantages of relational model. Basic relational data structures. Mathematical and database relations. Relation schema. Relational database. Integrity constraints. Relation keys. Key constraint. Foreign key constraint. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.2] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.2.2] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 4: Relational Query Languages. ♦ Outline: Relational algebra. Relational algebra operations. Selection. Projection. Set operations. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 7 Renaming. Join, equijoin, antijoin, theta-join. Natural join. Division. Tuple relational calculus: atoms, formulas, queries. Domain relational calculus. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.6] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.5] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 5: Structured Query Language. ♦ Outline: SQL language history. SQL standards. Data definition and data manipulation (sub)languages. SQL data types. Table declaration. Primary keys, unique constraints, default values, nullable attributes. Check constraints. Foreign key constraints. Handling foreign key violations. Indexes. Database schema modifications. SQL query sublanguage: SELECT. Single-table queries. Filtering conditions. Logical operations IN, ALL, EXISTS. Join queries. Join types: cross, natural, inner, outer, self. Duplicates elimination. Set operations. Nested queries. Correlated nested queries. Aggregate functions. Grouping and group filtering. Query result sorting. INSERT. UPDATE . DELETE. Views: creation, use and updating. Triggers: creation, activation, execution. Multiple triggers. View materialization. SQL procedural extensions and dialects: T-SQL, PL/SQL, PgSQL. Stored procedures. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.3,4,5] Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 8 Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.6,7,8] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 6: Database Design: E/R and UML Approaches. ♦ Outline: Entity-relationship model. Entities and attributes. Entity types. Keys. E/R diagram. Relationships. Attributes and roles. Relationship type, degree and cardinality. Relationship participation constraints. Strong and weak entities. Entity type hierarchies. Specialization and generalization. Total and partial unions. Unified modeling language class diagram. Association and aggregation in UML. Generalization hierarchies in UML. Multiplicity indicators in UML. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.7] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.4] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 7: Relational Database Design. ♦ Outline: Objectives of normalization. Limitations of E/R design. Redundancy. Anomalies: insertion, deletion, update. Decomposition. Informal guidelines for relation design. Functional dependencies. Axioms of functional dependencies. Closure. Minimal cover of a set of dependencies. Desirable properties of decompositions: attributes preservation, dependency preservation, lossless join. First normal form. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 9 Full functional dependencies. Second normal form. Transitive dependency. Third normal form Boyce/Codd normal form (BCNF). Multivalued dependencies. Fourth normal form. Fifth normal form. Domain/Key normal form (DKNF). BCNF decomposition algorithm and its properties. Normalization drawbacks. Denormalization. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.8] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.3] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 8: Application Design and Development. ♦ Outline: Database access from programming languages. ODBC architecture. ODBC Drivers. ODBC connection strings. JDBC architecture. Connecting to DBMS. Preparing and executing queries. Using result sets and cursors. Handling exceptions. Transactions in JDBC. Object-relational mapping. Design patterns for data persistence. Active Record pattern. Data Mapper pattern. Hybernate. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.9] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.9] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 9: Storage and File Structure. ♦ Outline: Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 10 Memory hierarchy. Disk storage: physical disk structure, pages and blocks I/O time: seek latency, transfer time. Disk cache. RAID. SSD. SAN and NAS. Datafiles: blocks and extents. Block structure. Fixed and variable record formats. Large objects (LOBs). ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.10] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.13] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 10: Indexing and Hashing. ♦ Outline: Indexing concepts. B-tree index. B-tree insertions and deletions. Hash index. Hash functions. Extendible hashing. Bitmap index. Join index. GiST. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.11] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.14] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 11: Query processing. ♦ Outline: Query processing overview. Relational algebra translation. Query tree. Relational algebra equivalences. Heuristics for optimization. Cost-based optimization. Cost factors and estimation. External merge sort. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 11 Duplicate elimination. Implementing set operations. Sort-based and hash-based projection. Computing selection without indexes. Computing selection with clustered index. Computing selection with b-tree index. Computing selection with hash index. Computing joins: nested loops. Computing joins: block nested loops. Computing joins: sort-merge join. Computing joins: hash-join. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.12,13] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.15,16] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 12:Transaction management. ♦ Outline: Single-user and multi-user systems. Basic concepts of transactions. ACID properties. Isolation. Serial and interleaved execution. Schedules: serial, serializable. Methods to ensure serializability. Concurrency control. Optimistic and pessimistic concurrency. Two-phase locking. Locking and deadlocks. Implementing isolation levels with locks. Snapshot isolation. Timestamping. Atomicity and Durability. Write-ahead log. Redo and undo records. Recovery from crash. Checkpoints. Distributed transactions. Two-phase commit protocol. Replication. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.14,15,16] Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 12 Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.17,18,19] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 13: Distributed and parallel databases. ♦ Outline: Distributed database systems: advantages and drawbacks. Distributed database architectures. Software components and functions of distributed DBMS. Data placement. Transaction management for distributed DBMS. Locking protocols. Timestamping protocols. Commit protocols. Distributed recovery from failures. Distributed query processing. Data integration. Parallel database systems. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.17,18,19] Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.20] Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 14: Data Warehousing and Big Data Management. ♦ Outline: Role and purpose of a data warehouse. Components of a data warehouse. Data warehousing and OLAP. Semistructured data and XML. Big data concepts and notions. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.20,23,26] Franks B. Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics (Wiley and SAS Business Series), Wiley, 2012. — 336pp. Тopic 15: Data Warehouse Architecture and Models. ♦ Outline: Requirements to data management from decision support systems. Historical, summarized, integrated data. Statistical and analytical queries. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 13 Business intelligence applications. Three-tier architecture. Data warehouse, data mart. Online analytical processing (OLAP). Conceptual models for decision support. Multidimensional view on the data. Cross-tabulation. Data cubes. Operations with data cubes: roll-up, drill-down, pivot, slice & dice, select. Attribute hierarchies. Types of hierarchies. Query languages for supporting OLAP. SQL extensions: Group by cube, group by rollup. Analytic functions. Rank functions, window functions, lag/lead functions. Multidimensional expressions (MDX). Relational OLAP (ROLAP): Star schema, snowflake schema, snowflake constellation. Multi-dimensional OLAP (MOLAP): multicubes and hypercubes, sparse and dense dimensions. Indexing of dimensions: b-tree, bitmap, join indexes. Hybrid OLAP (HOLAP). Slowly changing dimensions. Temporal databases. Valid time, transaction time. Bitemporal data model. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch. 20] Golfarelli M., Rizzi S. Data Warehouse Design: Modern Principles and Methodologies, McGraw-Hill Osborne Media, 2009. — 480pp. Celko J. Joe Celko's Analytics and OLAP in SQL, Morgan Kaufmann, 2006. — 208pp. Smith B.C., Clay C.R. Microsoft SQL Server 2008 MDX Step by Step, Microsoft Press, 2009. — 400pp. Chaudhury S., Dayal U. An overview of data warehousing and OLAP technology. // SIGMOD Record, v.26 n.2, pp.507-508, 1997. Harinarayan V., Rajaraman A., and Ullman J. Implementing Data Cubes Efficiently. // In Proceedings of the 1996 ACM SIGMOD international conference on Management of data (SIGMOD '96), pp. 205-216, 1996. Jensen C., Pedersen T., Thomsen C. Multidimensional Databases and Data Warehousing, Morgan & Claypool Publishers, 2010. — 111p. Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.10.6,10.7] Rainardi V. Building a Data Warehouse: With Examples in SQL Server. Apress, 2008. — 541pp. Inmon W. H., Krishnan K. Building the Unstructured Data Warehouse: Architecture, Analysis, and Design, Technics Publications, 2011. — 216pp. Kimball R., Ross M. The Data Warehouse Toolkit, Wiley, 2002. — 447pp. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 14 Тopic 16: Data Cleaning and Integration. ♦ Outline: Data integration issues and stages. ETL process design and implementation. Error handling in ETL process. Metadata and ontologies in data integration. Data structures for ETL. Extracting data from external sources. Data profiling. Data cleaning. Data quality dimensions, issues and constraints. Typical cleaning checks. Conforming data. Data enrichment. Entity resolution. Data transformation. Loading data into warehouse. Bulk load. ♦ Core books (sources of information) Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. [Ch.21] Kimball R., Caserta J. The Data Warehouse ETL Toolkit, Wiley, 2004. — 491pp. Golfarelli M., Rizzi S. Data Warehouse Design: Modern Principles and Methodologies, McGraw-Hill Osborne Media, 2009. — 480pp. Rodrigues F., Coles M., Dye D. Pro SQL Server 2012 Integration Services, Apress, 2012. — 636 pp. Тopic 17: Map/Reduce and Hadoop. ♦ Outline: Large-scale massive data processing. Map/reduce paradigm. Hadoop map/reduce implementation. Hadoop distributed filesystem HDFS. Hadoop I/O. Developing hadoop applications. Managing job configuration. Mapper, Reducer, Combiner. Running job locally and on a cluster. Map/reduce design patterns. Pig. Pig Latin. Hive. HiveQL. HBase BigTable implementation. ♦ Core books (sources of information) White T. Hadoop: The Definitive Guide, 3rd edition, O’Reilly, 2012. — 657pp. Holmes A. Hadoop In Practice, Manning Publications, 2012. — 511pp. Miner D., Shook A. MapReduce Design Patterns, O’Reilly, 2012. — 232pp. Gates A. Programming Pig, O’Reilly, 2011. — 203pp. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 15 Тopic 18: NoSQL Databases. ♦ Outline: Hashing for distributed data storage. Consistent hashing. Dynamic hash tables (DHT). Vector clocks and conflict detection. Gossip protocol and hinted handoff. Merkle trees. Key/value store examples: Project Voldemort, Membase, Riak, Cassandra. Extended key/value (data structures) store: Redis. Document databases: MongoDB. Graph databases: Neo4j. ♦ Core books (sources of information) Tiwari S. Professional NoSQL, Wrox, 2011. — 384pp. Redmond E. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement, Pragmatic Bookshelf, 2012. — 352pp. Robinson I. Graph Databases, O’Reilly, 2013. — 224pp. Тopic 19: SemistructuredData and XML. ♦ Outline: Semistructured data for human and machine consumption. XML: tags, elements, attributes, values. Well-formed and valid documents. Namespaces. XPath: axes, node-tests, predicates. XQuery: expressions, functions. FLWOR expressions. XQuery data model. DTD and XML Schema. Simple and complex types of XML Schema. ♦ Core books (sources of information) Silberschatz A., Korth H.F., Sudarshan S. Database System Concepts, 6th ed, McGraw-Hill, 2010. — 1376pp. [Ch.23] Melton J., Buxton S. Querying XML: XQuery, XPath, and SQL/XML in context, Morgann Kaufmann, 2006. 848pp. Ben-Gan I. et al. Inside Microsoft SQL Server 2008: T-SQL Programming, Microsoft Press, 2009. — 832 pp. Garcia-Molina H., Ullman J., Widom J. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2009. — 1248pp. Elmasri R., Navathe S.B. Fundamentals of Database Systems, 6th ed., Addison Wesley, 2010. — 1200 pp. Тopic 20: Data Stream Management Systems. ♦ Outline: Data stream example applications. Data stream systems: STREAM, Aurora. Data stream models: relational and XML. Window-based processing. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 16 STREAM CQL. Query semantics. Mapping Streams to Relations and vice versa. Aurora SQuAL. Operators. Query processing. STREAM: architecture, query plan, optimizations. Aurora: architecture, optimizations. Distributed stream processing: Borealis. \♦ Core books (sources of information) Golab L., Oszu M. T. Data Stream Management (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2010. — 80 pp. Arasu A., Babu S., Widom J. The CQL Continuous Query Language: Semantic Foundations and Query Execution, VLDB Journal, 15(2), 2006. pp.121-142. Arasu A., Babcock B., Babu S., Cieslewicz J., Datar M., Ito K., Motwani R., Srivastava U., Widom J. STREAM: The Stanford Data Stream Management System. // M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams, Springer, 2009. Cherniack M., Zdonik S. Stream-Oriented Query Languages and Operators, Encyclopedia of Database Systems, Springer, 2009. pp 2848-2854. VI. Assignment topics Home assignment for group project: The course includes two group projects (one in each year of study), compulsory to all students. Students will work in groups of up to 4 students on one of the suggested topics. First project: Build an relational database application using the techniques studied in the course. Second project: Identify a Big Data problem that you would like to work on. You should specify the following: Ask a difficult question. Identify data sources that you beleive might help answering the question. Build a data warehouse aimed at answering the question. Build OLAP solution aimed at answering the question. Build Hadoop (or NoSQL) solution aimed at answering the question. Compare built solutions, describe relative strengths and weaknesses. Written tests: The written tests is a testing assignment based on the covered topics. V. Topics for course final quality assessment Exam topics: Basic database system concepts. Database environment. Database planning. Three level database architecture. Basic relational model concepts. Object-oriented model. Object-relational model. Semi-structured model. Mathematical and database relations. Relation schema. Relational database. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 17 Integrity constraints. Relation keys. Key constraint. Foreign key constraint. Relational algebra operations. Selection. Projection. Relational algebra operations. Set operations. Relational algebra operations. Join, equijoin, antijoin, theta-join. Natural join. Relational algebra operations. Division. Tuple relational calculus: atoms, formulas, queries. SQL data types. SQL. Table declaration. Primary keys, unique constraints, default values, nullable attributes. SQL. Check constraints. SQL. Foreign key constraints. Handling foreign key violations. SQL. Database schema modifications. SQL. Single-table queries. Filtering conditions. Logical operations IN, ALL, EXISTS. SQL. Join queries. Join types: cross, natural, inner, outer, self. SQL. Duplicates elimination. SQL. Set operations. SQL. Nested queries. Correlated nested queries. SQL. Aggregate functions. SQL. Grouping and group filtering. SQL. Query result sorting. SQL. INSERT. SQL. UPDATE . SQL. DELETE. SQL. Views: creation, use and updating. SQL. Triggers: creation, activation, execution. Multiple triggers. SQL. View materialization. Stored procedures. Entity-relationship model. Entities and attributes. Entity types. Keys. Entity-relationship model. Relationships. Attributes and roles. Relationship type, degree and cardinality. Relationship participation constraints. Entity type hierarchies. Specialization and generalization. Total and partial unions. Unified modeling language class diagram. Association and aggregation in UML. Generalization hierarchies in UML. Multiplicity indicators in UML. Objectives of normalization. Limitations of E/R design. Data redundancy. Anomalies: insertion, deletion, update. Functional dependencies. Axioms of functional dependencies. Closure.Minimal cover of a set of dependencies. Desirable properties of decompositions: attributes preservation, dependency preservation, lossless join. First normal form. Full functional dependencies. Second normal form. Transitive dependency. Third normal form. Boyce/Codd normal form (BCNF). Multivalued dependencies. Fourth normal form. Fifth normal form. Domain/Key normal form (DKNF). Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 18 BCNF decomposition algorithm and its properties. Normalization drawbacks. JDBC architecture, connecting to DBMS, preparing and executing queries, using result sets and cursors. Handling exceptions in JDBC. Transactions in JDBC. Object-relational mapping. Design patterns for data persistence. Active Record pattern. Design patterns for data persistence. Data Mapper pattern. Hybernate. Datafiles: blocks and extents. Block structure. Fixed and variable record formats. Large objects (LOBs). B-tree index. B-tree insertions and deletions. Hash index. Hash functions.Extendible hashing. Bitmap index. Join index. GiST. Relational algebra translation. Query tree. Relational algebra equivalences. Cost-based optimization. Cost factors and estimation. External merge sort. Duplicate elimination. Implementing set operations. Sort-based and hash-based projection. Computing selection without indexes. Computing selection with clustered index. Computing selection with b-tree index. Computing selection with hash index. Computing joins: nested loops. Computing joins: block nested loops. Computing joins: sort-merge join. Computing joins: hash-join. Basic concepts of transactions. ACID properties. Schedules: serial, serializable. Methods to ensure serializability. Optimistic and pessimistic concurrency. Two-phase locking. Locking and deadlocks. Implementing isolation levels with locks. Snapshot isolation. Write-ahead log. Redo and undo records. Recovery from crash. Distributed transactions. Two-phase commit protocol. Distributed database architectures. Software components and functions of distributed DBMS. Transaction management for distributed DBMS. Distributed recovery from failures. Distributed query processing. Data integration. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 19 Parallel database systems. Object-relational mapper. Requirements to data management from decision support systems. Extract-transform-load process. Conceptual models for decision support. Multidimensional view on the data. Operations with data cubes: roll-up, drill-down, pivot, slice & dice, select. Query languages for supporting OLAP. SQL extensions: Group by cube, group by rollup. Multidimensional expressions (MDX). View materialization: optimal set of views, partial order on views, cost model, greedy algorithm. Relational OLAP (ROLAP): Star schema, snowflake schema, snowflake constellation. Multi-dimensional OLAP (MOLAP): multicubes and hypercubes, sparse and dense dimensions. Indexing of dimensions: bitmap indexes. Indexing of dimensions: join indexes. Hybrid OLAP (HOLAP). XML: tags, elements, attributes, values. Well-formed and valid documents. XPath: axes, node-tests, predicates. XQuery expressions. XQuery functions. FLWOR expressions. XQuery data model. DTD. XML Schema: simple and complex types. ETL process design and implementation. Data cleaning. Data quality dimensions, issues and constraints. Typical cleaning checks. Entity resolution. Data transformation. Loading data into warehouse. Bulk load. Map/reduce paradigm. Hadoop map/reduce implementation. Hadoop distributed filesystem HDFS. Map/reduce design patterns. Pig. Pig Latin. Hive. HiveQL. HBase BigTable implementation. Dynamic hash tables (DHT). Vector clocks and conflict detection. Gossip protocol and hinted handoff. Key/value stores: properties and usage. Extended key/value (data structures) store Redis: properties and usage. Document database MongoDB: properties and usage. Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 20 Graph database Neo4j: properties and usage. Data stream models: relational and XML. Window-based processing. STREAM CQL. Query semantics. Mapping Streams to Relations and vice versa. Aurora SQuAL. Operators. STREAM: architecture, query plan, optimizations. Aurora: architecture, optimizations. Distributed stream processing: Borealis. The author of program: _____________________ A.D.Breyman Course «Databases» Syllabus [ 2013-2014 ac. year ] --- page 21