Lecture: Databases - Perspectives on Information

advertisement
Databases
From A to Boyce Codd
What is a database?
It depends on your point of view.
For Manovich, a database is a means of structuring
information in which multiple trajectories of
interaction with the information are possible. When
thinking about “the database” on this level of
abstraction, the difference between a collection of
discrete objects (e.g., a bunch of books) and a
particular structured representation of those objects
(e.g., a set of catalog records) are, for example, not
so different. Manovich says exactly this.
What is a database?
In contrast, in the computational universe, the
difference between a set of files and a set of
relational tables is significant, in terms of the
operations that can be performed on each.
For computational purposes, a database is a set
of structured data; Brookshear includes the
requirement that the data be structured in a
multidimensional way, so that it can be presented
“from a variety of perspectives.”
What is a database?
Are these databases? In what way?
• The books in my office.
• The PCL.
• YouTube.
• The Internet Movie Database.
• The Web.
What is useful about calling any of these things
databases or not?
What is a database?
Even in the computational universe, a “database”
can exist at multiple levels of abstraction: a
conceptual model of a database (as presented
through entity-relationship diagrams) can be
implemented via different logical models (e.g.,
as different tables in a relational database, or
potentially as objects in an object database, or...),
and these can be stored and accessed differently
(for example, distributed over many servers)...
Data modeling with ER diagrams
Entity-relationship diagrams depict a conceptual
model (schema) of data by specifying entities
(things), attributes (properties of the things) and
relationships between the things.
ERDs document semantics, not implementation.
There are many forms of notation for ERDs
(Chen, the optional reading, is an early one).
Data modeling with ER diagrams
There is often no “right answer” when determining
entities, attributes, and relationships; different
entities might be defined, or entities might be
defined as relationships instead. It can be difficult to
predict consequences of one modeling choice vs.
another.
Similar decisions and consequences occur when
translating a conceptual data model to a logical
model for implementation (and then in the actual
implementation).
Data modeling with ER diagrams
Let’s take a simple example to play with ERDs.
We want to model the idea that students take
courses and instructors teach courses.
One way to do this is to have three entities:
students, courses, and instructors.
Silly ER example #1
Data modeling with ER diagrams
Now, we could also do this by having two
entities: courses and people. In this model,
people would have two roles, instructor and
student (Chen has an example of a marriage as a
relationship between two people).
Silly ER example #2
Data modeling with ER diagrams
If we are interested in capturing information other
information about students and instructors besides
their names, then we probably want to separate them
(e.g., we might want to track how many credits
students have and how many requirements they
have completed, and these don’t apply to
instructors).
If there was overlap between students and
instructors, we might have some redundancy going
on. We might want to keep this in mind.
Back to silly ER example #1
Data modeling with ER diagrams
Say we want to add the idea of grades to this
model. Instructors assess student performance in
courses by assigning them grades.
How might we model grades? Are they entities?
Are they attributes of some existing entity? Are
they are relationship between entities?
Let’s think about it.
Does this model work?
From ERD to database
Databases can be implemented in different ways.
Brookshear describes relational databases and
object-oriented databases.
In each case, the database would be implemented
in a database management system (DBMS),
which would hide details of the actual data
composition and storage and such from
application programs that use the data.
Relational databases
In a relational database, entities are described as
rows in a table. The table is called a relation.
The table columns are attributes.
Each table row is also called a tuple.
A relation
Here is a relation that contains information about
peer reviewers for an academic conference.
There are some problems with this relation.
A relation
One immediate problem is that we have multiple
values for a single attribute. That’s going to
make it hard to figure out how many papers are
really assigned to Barb Chen (is that one number
or two numbers), or to reassign papers, etc.
Revised relation
We can fix this by making additional tuples so that no
cell has multiple values. But then we have some
redundancy in the relation; every time we assign a new
paper, we need to input that name, school and city
information as well. Redundancy can cause problems
with data integrity.
Revised relations
To fix this redundancy problem, we need to split
up the table into multiple tables and relate each
table with a “foreign key” that links back to the
original table.
Then we can also include more information
about the papers, which is probably necessary
anyway.
Revised relations
Revised relations
There’s still a problem with that original table,
though; the city attribute isn’t really about the
reviewer, is it? It’s about the school. It might not
be likely that schools will change cities, but it’s
still good practice to keep all the attributes in a
table about the entity specified by the primary
key (in this case, the reviewer).
Revised relations
Revised relations
So that’s good. But there was still redundancy in
those Paper and Expertise tables. Also we have
deletion problems...if a paper isn’t assigned to a
reviewer, does that mean it can’t exist in the
database? How would we fix that?
Revised relations
Normalization
Wowee, that’s a lot of tables there! Do we really
want to do that?
Yes. And no. On the one hand, it’s good to
minimize data redundancy, because this can lead
to problems with updating. On the other hand,
performing lots of Join operations to put
information together again can be inefficient,
from a performance perspective.
Database operations
Databases are powerful because we can
reassemble data in myriad ways.
Basic relational database operations include
Select, Project, and Join.
Database operations
Select extracts rows (tuples) from a relation.
Project extracts columns (attribute values).
Join combines multiple relations into a new
relation.
Query languages
Database query languages, such as SQL, may
implement the basic operations in different ways.
A SQL statement may perform all three
operations—Join, Select, and Project.
Wait, what about objects?
Indeed, an object-oriented database is a different
model than a relational one.
Object-oriented databases
Object-oriented databases can be:
• More flexible.
• Better integrated with applications.
However, there are many standard tools for
managing relational databases, which can make
them easier to administer.
Very large databases
For tremendous datasets such as Facebook,
performance and storage become tremendously
important. Facebook wrote its own system,
Cassandra, that is optimized for distributed storage
and speed of “massive” data. (This is a “NoSQL”
database model.)
“Cassandra does not support a full relational data
model; instead, it provides clients with a simple data
model that supports dynamic control over data
layout and format.”
So...what then?
Is learning MySQL stupid?
No; but it’s important to remember where
MySQL lies in the realm of “the database.” For
example, conceptual models may be stable
where logical models and implementation details
change.
Download