Lecture 1

advertisement
Database Management Systems
CS 564
Lecture #1
(with some slides integrated from those of Raghu
Ramakrishnan, Jeff Ullman, Alon Halevy, and Dan Suciu.)
1
Yes. This is the Room for CS 564
• We moved from Humanities 1111
• All future lectures/discussions will be in this
room
• Please sit a bit closer to the screen, so that I
don’t have to shout
• Room doors are usually locked; I will unlock
15 minutes before each class
2
A Bit about Myself


Born in Vietnam
Grew up in a fishing village

Nice name:
AnHai Doan
“Nghe An” “Hai Phong”

Until my brother as born
as HaiAn Doan
3
Vietnam  Hungary  US

High school in Vietnam

Undergrad in Hungary
– had lot of beers
– learned seven languages
– Hungarian, English, C, C++, Ada, Pascal, PL/I

When iron curtain fell back in 1993,
one of the firsts to reach US to study
4
Wisconsin  Seattle  Illinois Wisconsin

Masters at Wisconsin-Milwaukee

Ph.D. at Washington-Seattle
– where I failed to take “CS 564”

started at Univ of Illinois-Urbana
– with corn, cow, campus

In Madison since 2006
– where the four major food groups are
5
Random Comments from Students
• Take instruction seriously, … gave lots of
really excellent dating advice
• All in-class examples revolve around beer
• His accent is very annoying …
• His accent is great. It’s so hard to understand
that I’m forced to concentrate in lectures …
• His accent is a bonus feature of the class.
Prepared me to work in Silicon Valley
• I now love databases …When I own Oracle, I
will pay you back.
6
What is this Course about?
• Numerous applications must deal with a lot of
data
• They typically put data into a database
• The database will be managed by a system
called database management system
• Applications then interact with this system to
access and use the database
7
An Illustration
Database management system
DB 2
DB 1
App 1
App 2
DB 3
8
Questions
• What form should the data be in?
– way back in 1970s, people suggest to store data in
tables
– so each database is a set of tables
Students
ID
First Name
Last Name
1
Barack
Obama
2
George
Bush
Addresses
ID
City
State
1
Washington DC
Washington DC
2
Dallas
TX
9
Questions
• What form should the data be in?
– each table can be thought of as a relation in the
mathematical sense
– so such a database is referred to as a relational DB
Students
ID
First Name
Last Name
1
Barack
Obama
2
George
Bush
Addresses
ID
City
State
1
Washington DC
Washington DC
2
Dallas
TX
10
So the management system is called
a relational database management
system (or RDBMS for short)
Database management system
DB 2
DB 1
App 1
App 2
DB 3
11
Since the 1970s, RDBMSs have been
studied intensively, and have taken
over the world
•
•
•
•
•
It is now a corner stone of the modern world
Powering virtually all data-intensive apps
20B industry
Bought island in Hawaii
Since then new types of data have emerged
– that would not be very well suited to be modeled as
tables
12
• New types of database management systems
have also emerged
– eg NoSQL systems
• But RDBMSs remain foundational and
pervasive, and will be so in the future
• This class focuses on RDBMSs
–
–
–
–
we will learn how to design a relational database
how to store it in an RDBMS
how to use an RDBMS
look into the internals of RDBMS
13
• Lessons that you learn in this class will carry
over to newer types of database management
systems
• You will learn fundamentals of managing a
large amount of data
– critical as the world is becoming increasingly data
centric
• Good for you when you go applying for a job
– many jobs require knowing how to use RDBMSs
• It’s fun
14
• If you are interested in more data managment
stuff
– CS 764: gory details about RDBMSs
– CS 784: newer types of data and how to manage them
(beyond RDBMSs)
15
Course Logistics
16
Prerequisite
• Must have data structure and algorithm background
– CS 367 is a must; CS 537 might be useful
• For the project
– lot of programming will be required
– in a high-level language of your own choosing (or rather your
team’s choosing)
– could be Java, C, C++, Perl, Python, etc.
– must know how to build a Web based application or be willing to
learn
17
Textbook
– There is no ideal textbook, unfortunately
– Database Management Systems, by R. Ramakrishnan
and J. Gehrke, third edition
– Database Systems: The Complete Book, by GarciaMolina, Ullman and Widom, second edition
– The best thing to do is to attend the lectures, make
notes, and read the lecture notes
– Consult the textbooks
– If you do this, you will be fine
18
Course Format
• For all students
– two 75-min lectures / week
– project: programming, 4-5 stages, may include some
basic homework questions
– a midterm and a final exam
• Attending lectures on Wed/Fri is important
• We also use the Mon slots occasionally for
make-up lectures
• So if you can’t make Monday 2:25-3:15, do not
take the class
19
• In fact, for next week I’m traveling on W and
F
• So we will have a make-up lecture on Monday,
Jan 26
20
Lectures
• Lecture slides in ppt format will be posted
shortly before or after the lecture
– are to complement the lectures
• Many issues discussed in the lectures will be
covered in the exams
– hence try to attend lectures regularly
• Will not cover ALL materials on the slides
– attending lectures will tell you which is covered and
which is not
21
Project
• Select an application that needs a database
• Build a database application from start to
finish
• Significant amount of programming
• Will be done in stages
– you will submit some work at the end of each stage
• May have to show a demo at semester end
22
Project Groups
• Project will be done in group of 3-4 students
– a lot of work, difficult to design so that
one person can do all
– learn how to work in a group: valuable skills
– groups are like broccoli, they are good for you
• Try to form groups as soon as possible
– can start by posting requests on Piazza
• There will be a deadline later for forming
groups
• If you have not formed groups by then
– we will help assign you to groups
23
More on Grouping
• All group members receive same grading
• If someone drops out, the rest pick up the
work
24
Exams
• Midterm & final
– will be announced shortly
– check dates and make sure no conflict!
• There may be some brief review before each
exam
• If you have conflicts
– do let us know in advance
• The Uncle problem
25
Tentative Grading Breakdown
•
•
•
•
Midterm: 25%
Final: 35%
Project: 40%
Will attempt to grade on an absolute scale as
much as possible
– not on a curve
26
Contacting the staff ...
27
Staff & Office Hours
• Instructor: AnHai Doan
• TAs:
– Avinaash Gupta
– Harneet Singh
• See class homepage for office hours, contact
information
28
Communications
• class homepage
– www.cs.wisc.edu/~anhai/courses/564-sp15
• mailing list: compsci564-1-s15@lists.wisc.edu
– vitally important!
– make sure to check it regularly for new announcements
• Piazza: will be set up shortly
• If you have a question/problem
–
–
–
–
talk to people in your group first
post your question on Piazza
email TA
go to office hours to talk to TA or instructor
29
Now onto database studies ...
30
At the Beginning
• A program typically consists of code + data
• Eg, need to sort 1000 numbers
– 2, 4, 6, 8, 1, 13, 9, ...
• Store these numbers in an array
• Write some code to sort
• Both code + data are stored in memory, and
mixed together
– this was typical sort programs you learned in CS 367
31
• Eventually people realized that
– the data part could be huge; maybe not sorting 1000
numbers, but 1 trillion numbers
– this posed serious problems: what happened if the
data doesn’t fit into memory?
– another issue is that many apps may want to access
and do the same thing with data
– should we write duplicate codes for each of these
apps?
– maybe we should factor out common code
– thus the motivation for databases and DB
management systems
32
An Illustration
Database management system
DB 2
DB 1
App 1
App 2
DB 3
33
Another Motivating Example
• Suppose we want to store, manipulate, and
query information about:
–
–
–
–
students
courses
professors
who takes what, who teaches what
34
Application Requirements
• store the data for a long period of time
– large amounts (100s of GB)
– protect against crashes
– protect against unauthorized use
• allow users to query/update:
– who teaches “CS 367”
– enroll “Mary” in “CS 564”
35
• allow several (100s, 1000s) users to access the
data simultaneously
• allow administrators to change the schema
– add information about TAs
36
Trying Without a DBMS
• Why Direct Implementation Won’t Work:
• Storing data: file system is limited
– size less than 4GB (on 32 bits machines)
– when system crashes we may loose data
– password-based authorization insufficient
• Query/update:
– need to write a new C++/Java program for every new
query
– need to worry about performance
37
• Concurrency: limited protection
– need to worry about interfering with other users
– need to offer different views to different users (e.g.
registrar, students, professors)
• Schema change:
– entails changing file formats
– need to rewrite virtually all applications
• Better let a database system handle it
38
What Can a DBMS Do for Us?
• Data Definition Language - DDL
• Data Manipulation Language - DML
– query language
• Storage management
• Transaction Management
– concurrency control
– recovery
• Think buying a plane ticket! Can you do it
without a DBMS?
39
What Can a DBMS Do for Us?
• Automate a lot of boring/mundane operations
on data
– so that we don’t have to program over and over
– so that we can write complex data manipulations in
just a few lines, so that we can concentrate on app
logics
• Make execution very fast
– so that it scales up to very large data sets
• Make concurrent access/modification possible
– so that many users can use the data at the same time
40
Building an Application with a DBMS
• Requirements modeling (conceptual, pictures)
– Decide what entities should be part of the application
and how they should be linked.
• Schema design and implementation
– Decide on a set of tables, attributes.
– Define the tables in the database system.
– Populate database (insert tuples).
• Write application programs using the DBMS
– way easier now that the data management is taken
care of.
41
name
category
Conceptual
Modeling
name
cid
ssn
Takes
Course
Student
quarter
Advises
Teaches
Professor
address
name
field
42
Schema Design and Implementation
• Tables:
Students:
SSN
123-45-6789
234-56-7890
Takes:
Name
Charles
Dan
…
Category
undergrad
grad
…
SSN
123-45-6789
123-45-6789
234-56-7890
Courses:
CID
CSE444
CSE541
Name
Databases
Operating systems
CID
CSE444
CSE444
CSE142
…
Quarter
fall
winter
• Separates the logical view from the physical
view of the data.
43
Querying a Database
• Find all courses that “Mary” takes
• S(tructured) Q(uery) L(anguage)
select C.name
from Students S, Takes T, Courses C
where S.name = “Mary” and
S.ssn = T.ssn and T.cid = C.cid
• Query processor figures out how to answer the
query efficiently.
44
Query Optimization
Goal:
Declarative SQL query
Imperative query execution plan:
sname
select C.name
from Students S, Takes T, Courses C
where S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
cid=cid
sid=sid
name=“Mary”
Students
Takes
Courses
Plan: tree of Relational Algebra operators,
choice of algorithms at each operator
45
Database Industry
• Relational databases are a great success of
theoretical ideas.
• Big DBMS companies are among the largest
software companies in the world.
• Oracle
• IBM (with DB2)
• Microsoft (SQL Server, Microsoft Access)
• Others
• $20B industry.
46
The Study of DBMS
• Several aspects:
– Modeling and design of databases
– Database programming: querying and update
operations
– Database implementation
• DBMS study cuts across many fields of
Computer Science: OS, languages, AI, Logic,
multimedia, theory...
47
Download