Databases From A to Boyce Codd What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in which multiple trajectories of interaction with the information are possible. When thinking about “the database” on this level of abstraction, the difference between a collection of discrete objects (e.g., a bunch of books) and a particular structured representation of those objects (e.g., a set of catalog records) are, for example, not so different. Manovich says exactly this. What is a database? In contrast, in the computational universe, the difference between a set of files and a set of relational tables is significant, in terms of the operations that can be performed on each. For computational purposes, a database is a set of structured data; Brookshear includes the requirement that the data be structured in a multidimensional way, so that it can be presented “from a variety of perspectives.” What is a database? Are these databases? In what way? • The books in my office. • The PCL. • YouTube. • The Internet Movie Database. • The Web. What is useful about calling any of these things databases or not? What is a database? Even in the computational universe, a “database” can exist at multiple levels of abstraction: a conceptual model of a database (as presented through entity-relationship diagrams) can be implemented via different logical models (e.g., as different tables in a relational database, or potentially as objects in an object database, or...), and these can be stored and accessed differently (for example, distributed over many servers)... Data modeling with ER diagrams Entity-relationship diagrams depict a conceptual model (schema) of data by specifying entities (things), attributes (properties of the things) and relationships between the things. ERDs document semantics, not implementation. There are many forms of notation for ERDs (Chen, the optional reading, is an early one). Data modeling with ER diagrams There is often no “right answer” when determining entities, attributes, and relationships; different entities might be defined, or entities might be defined as relationships instead. It can be difficult to predict consequences of one modeling choice vs. another. Similar decisions and consequences occur when translating a conceptual data model to a logical model for implementation (and then in the actual implementation). Data modeling with ER diagrams Let’s take a simple example to play with ERDs. We want to model the idea that students take courses and instructors teach courses. One way to do this is to have three entities: students, courses, and instructors. Silly ER example #1 Data modeling with ER diagrams Now, we could also do this by having two entities: courses and people. In this model, people would have two roles, instructor and student (Chen has an example of a marriage as a relationship between two people). Silly ER example #2 Data modeling with ER diagrams If we are interested in capturing information other information about students and instructors besides their names, then we probably want to separate them (e.g., we might want to track how many credits students have and how many requirements they have completed, and these don’t apply to instructors). If there was overlap between students and instructors, we might have some redundancy going on. We might want to keep this in mind. Back to silly ER example #1 Data modeling with ER diagrams Say we want to add the idea of grades to this model. Instructors assess student performance in courses by assigning them grades. How might we model grades? Are they entities? Are they attributes of some existing entity? Are they are relationship between entities? Let’s think about it. Does this model work? From ERD to database Databases can be implemented in different ways. Brookshear describes relational databases and object-oriented databases. In each case, the database would be implemented in a database management system (DBMS), which would hide details of the actual data composition and storage and such from application programs that use the data. Relational databases In a relational database, entities are described as rows in a table. The table is called a relation. The table columns are attributes. Each table row is also called a tuple. A relation Here is a relation that contains information about peer reviewers for an academic conference. There are some problems with this relation. A relation One immediate problem is that we have multiple values for a single attribute. That’s going to make it hard to figure out how many papers are really assigned to Barb Chen (is that one number or two numbers), or to reassign papers, etc. Revised relation We can fix this by making additional tuples so that no cell has multiple values. But then we have some redundancy in the relation; every time we assign a new paper, we need to input that name, school and city information as well. Redundancy can cause problems with data integrity. Revised relations To fix this redundancy problem, we need to split up the table into multiple tables and relate each table with a “foreign key” that links back to the original table. Then we can also include more information about the papers, which is probably necessary anyway. Revised relations Revised relations There’s still a problem with that original table, though; the city attribute isn’t really about the reviewer, is it? It’s about the school. It might not be likely that schools will change cities, but it’s still good practice to keep all the attributes in a table about the entity specified by the primary key (in this case, the reviewer). Revised relations Revised relations So that’s good. But there was still redundancy in those Paper and Expertise tables. Also we have deletion problems...if a paper isn’t assigned to a reviewer, does that mean it can’t exist in the database? How would we fix that? Revised relations Normalization Wowee, that’s a lot of tables there! Do we really want to do that? Yes. And no. On the one hand, it’s good to minimize data redundancy, because this can lead to problems with updating. On the other hand, performing lots of Join operations to put information together again can be inefficient, from a performance perspective. Database operations Databases are powerful because we can reassemble data in myriad ways. Basic relational database operations include Select, Project, and Join. Database operations Select extracts rows (tuples) from a relation. Project extracts columns (attribute values). Join combines multiple relations into a new relation. Query languages Database query languages, such as SQL, may implement the basic operations in different ways. A SQL statement may perform all three operations—Join, Select, and Project. Wait, what about objects? Indeed, an object-oriented database is a different model than a relational one. Object-oriented databases Object-oriented databases can be: • More flexible. • Better integrated with applications. However, there are many standard tools for managing relational databases, which can make them easier to administer. Very large databases For tremendous datasets such as Facebook, performance and storage become tremendously important. Facebook wrote its own system, Cassandra, that is optimized for distributed storage and speed of “massive” data. (This is a “NoSQL” database model.) “Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format.” So...what then? Is learning MySQL stupid? No; but it’s important to remember where MySQL lies in the realm of “the database.” For example, conceptual models may be stable where logical models and implementation details change.