07_RelationalDatabases_worksheet

advertisement
Class 7 - Relational Databases
Exercise overview
Last week we explored the XML and JSON serialization standards and became familiar with the steps
behind validation and checking documents for well-formedness. We discovered that XML and JSON
serve related but different purposes and gained some familiarity with the issues of JSON and XML
serialization (e.g. JSON is more easily used in JavaScript but XML can preserve more detailed
metadata). Both JSON and XML followed the DOM data model in that they relied primarily on
hierarchical relationships and document-based serialization formats. This week we will look at a new
type of data model, the Relational Database (RDB).
We will learn the conceptual framework supporting RDBs as well as gain some familiarity with a
RDBMS or Relational Database Management System called Microsoft Access. This worksheet is
broken into three main sections: Getting access to Microsoft Access, Course Readings and Working
with Microsoft Access. All of our videos and examples are based on Microsoft Access 2013.
Instructions:
Work individually or in groups to complete the worksheet. When you get to a section that requires
you to select a resource to explore – pick one resource (please don’t always choose the first one!).
When asked to ‘discuss as a group’, consider your response and continue completing the worksheet.
We’re going to work with computer coding today and here’s an important note as you follow
the exercises. Computer code is shown on numbered lines and are enclosed in boxes. The
numbered lines are simply to help as a reference during instruction and should not be copied into
your program. For example a line that reads 56. p { visibility:hidden; } should simply be typed in as
p { visibility:hidden; }
Metadata Standards and Web Services
Erik Mitchell
Page 1
Suggested readings
1. Mitchell, E. (2015). Chapter 6 in Metadata Standards and Web Services in Libraries, Archives, and Museums.
Libraries Unlimited. Santa Barbara, CA.
2. Vines, Rose. (2011). Databases from scratch I: Introduction. http://geekgirls.com/2011/09/databases-from-scratch-iintroduction/
3. Vines, Rose (2011). Databases from scratch II: Simple database design. http://geekgirls.com/2011/09/databasesfrom-scratch-ii-simple-database-design/
4. Vines, Rose. (2011). Databases from scratch III: The design process. http://geekgirls.com/2011/09/databases-fromscratch-iii-relational-design-process/
5. Note - Rose Vines also has tutorials IV-VII but they get a bit more complicated than we are going to.
Optional Readings
W3Schools. (2013). Introduction to SQL. http://www.w3schools.com/sql/sqlReading discussion
Databases from Scratch sections 1-3 give us a brief introduction to why relational databases are
important. Read these sections and make sure you can answer the following questions.
Question 1. What are some examples of Relational Database applications?
Question 2. What types of tools do RDBMS programs typically include?
RDBMS applications commonly include tables, input forms, queries, and reporting functionalities.
Tables store your data and rely on relationships, data validation and data input to ensure the integrity
of the data relationships. Data input forms provide the user with an easy-to-use interface that
simplifies interaction with the database. Frequently these input forms are web-pages but many
RDBMS applications, including Microsoft Access, have built-in forms to make it easy to input and
work with data. Queries are a way of searching through a table or multiple tables using defined
criteria (e.g. show me all books published after 1990 that were written by Allen Downey). Queries
employ a search language called Structured Query Language (SQL) that is a generally accepted
standard in the relational database world. It is worth noting however that the exact syntax and
semantics of SQL varies between RDBMS platforms, making it difficult to easily migrate databases
between different RDBMSs. Finally, RDBMSs often support the creation of reports which combine
Metadata Standards and Web Services
Erik Mitchell
Page 2
the search and data combination features of queries with data layout, grouping and formatting
features. Reports are commonly written with paper in mind but also support other output formats as
well including PDF, XML, HTML and CSV.
Question 3. From the readings, what are some of the key advantages of using a well-defined
database design?
Question 4. From the readings, what role does a UniqueID or Key serve?
Question 5. From the readings, what are the core elements of the database design process?
As our readings point out, database design is an iterative process and is often imperfect. The old
saying "the perfect is the enemy of the good" is very relevant in this situation as it can be very easy to
overdesign a database for uses that were not originally intended. In our exercise we will build on our
understanding of Dublin Core as it is represented in XML files and create a Relational Database
Structure that implements the DC elements in an RDBMS data model.
The building blocks of tables
Before we move on to our tour of Microsoft Access it is important that we understand the rules behind
an RDBMS table. These tables are the building blocks of data in our RDBMS and possess rules
similar to our encoding rules in XML. In RDBMSs, tables consist of data that is represented in rows
(e.g. 1 row = 1 record), described by fields represented in columns (e.g. 1 row has 20 fields and as
such 20 columns). Each field has a name (e.g. title, creator, date) and a data type. Data types
describe the data that is contained in the field and make it possible for the RDBMS to perform some
advanced operations. For example if you tell an RDBMS that a given field is a number then you
enable mathematical operations on that field. In contrast, if you define a field as a date then you get
access to certain formatting and sorting functions for dates.
The figure below shows a generic table in Microsoft Access viewed in "Design View."
Metadata Standards and Web Services
Erik Mitchell
Page 3
Figure 1 This images shows the field names and data types for each field
Metadata Standards and Web Services
Erik Mitchell
Page 4
Figure 2 This image shows the datatypes available in Microsoft Access
In Relational Databases it is a convention and general best practice to establish an autonumbering
UniqueID/Primary Key for each table. Primary Keys are used to link tables together (e.g. recordid in
the records table links to recordid in the subjects table). Using Autonumbered UniqueIDs we get
referential integrity in our database automatically.
Review this Webpage to understand the purpose of each datatype. http://office.microsoft.com/enus/access-help/introduction-to-data-types-and-field-properties-HA010233292.aspx#BM2
Microsoft Access
I highly recommend using Microsoft Access for this exercise but if you are feeling bold or already
have experience you can also use the RDBMS of your choice. Microsoft Access is a powerful clientbased RDBMS that has all of the features mentioned in the previous section of this worksheet.
Where you can use Microsoft Access
Microsoft Access is a Windows application that is available under the UMD software license.
Windows users
Metadata Standards and Web Services
Erik Mitchell
Page 5
If you are a Windows user you may already have Microsoft Access installed on your machine. If not,
head to https://terpware.umd.edu/Windows/List/235 to download and install a free copy of Microsoft
Access 2013.
Macintosh users
If you use Macintosh you need to identify an alternative way of using Microsoft Access. You have two
options:
1. Use a Microsoft Windows machine that has Office 2010 or 2013. This may be at your public
library, at your work place or at the UMD iSchool computer lab
2. Install a virtualization environment on your OSX machine, (probably 3-4 hours of work) Install
Windows on that virtualization environment and Install Office 2013 on your virtual machine.
This sounds complicated but all of the software is either open source or freely available via
Terpware. In short, you need:
a. Virtualbox https://www.virtualbox.org/wiki/Downloads (your virtualization platform)
b. Microsoft Windows 7 (or 8 if you are brave)
https://terpware.umd.edu/Windows/Title/1817 (get the 64 bit version unless your MAC is
running Snow Leopard or earlier) Make sure you download this on your MAC so you
get the proper installer
c. Microsoft Office 2013 - make sure you get the version that matches your operating
system (e.g. 32 or 64 bit).
Download and install the applications in the order listed above. You install VirtualBox on your OSX
machine, create a virtual machine, map the ISO file you downloaded as a virtual CD Drive (Hint, you
need to rename the file you downloaded from .img to .iso), install Windows on the virtual machine,
then install Microsoft Office onto that Windows VM Environment. These videos and instructional links
may help - http://www.youtube.com/watch?v=Rag4LDoBUC0,
http://www.eos.ncsu.edu/soc/support/wom/vbox_install.php,
https://www.virtualbox.org/manual/ch01.html
Metadata Standards and Web Services
Erik Mitchell
Page 6
A Tour of Microsoft Access
Before proceeding, watch the course video on Microsoft Access, making sure that you know how to:
Open Microsoft Access and create a new database
Create a new table, query and report
Toggle between design view and datasheet view
How to open and edit the entity relationship model for your database
How to save changes to your tables, queries and reports
Create your first database
In this exercise we will create a database, catalog some items into our database, query our database
and create a simple report from our database. We will use Dublin Core as our source of data.
Step 1:
Outline your data and make logical groupings of data. What data should go into a single
table, what data should go into other tables? Using the Dublin Core 15 fields outline a
relational table structure
a. Using this dataset to catalog with
http://digital.lib.umd.edu/oaicat/OAIHandler?verb=ListRecords&metadataPrefix=oai_dc&
set=hdl_1903.1_5980
b. Browse the data and look for common fields, important features and metadata quirks.
c. Map the core 15 fields from Dublin Core into a table structure using your source dataset
as inspiration: Contributor, coverage, creator, date, description, format, identifier,
language, publisher, relation, rights, source, subject, title, type
Metadata Standards and Web Services
Erik Mitchell
Page 7
Step 2:
Break composite fields into their building blocks and assign data types and vocabularies to
data as needed. For each DC mapped field fill out the table below indicating how each field
will be broken apart.
DC Element
RDBMS field / table
Datatype / syntax
RDBMS Authority Control
Subject
Created a subjects
Datatype is short text for
Could use LCSH or some other
table and a pivot table
subject, autoid for
subject authority
for resource/subjects
subjectID
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Metadata Standards and Web Services
Erik Mitchell
Page 8
Source
Title
Type
Question 6. How did the process of breaking apart composite fields, assignment of data types
and association with vocabularies respect or violate DC conventions?
Question 7. Based on your experience what are some of the benefits and drawbacks of the
DCXML data model and the DC Relational Database model?
Benefits
Drawbacks
DC modeled in DOMXML
DC modeled in a Relational
Database
Step 3:
Group data logically and establish key relationships (e.g. primary keys for tables, inclusion
of keys in linked tables).
Step 4:
Let's begin by creating our tables in MS Access. Follow the table definitions you created
and implement your tables and joins in Microsoft Access
a. In this step you should be creating the tables, fields, datatypes, and relationships as
demonstrated in our orientation video. If this process seems completely unclear I
recommend re-watching the video and checking out our example database that
contains a take on the database mapping. Because MS Access uses icons and
toolbars (often without text captions) it can be difficult to describe each step along the
way. Use the videos that go along with this worksheet for instruction when needed
Metadata Standards and Web Services
Erik Mitchell
Page 9
Question 8. How many tables did you create to accurately represent your data?
Question 9. What are some of the advantages and drawbacks of your approach (e.g. I decided to
put everything into one table so my data is very inflexible).
Step 5:
Review design and re-group data to reduce redundancy. Ask questions about data design
and database flexibility
Question 10.
Step 6:
Did you make any modifications to your core design?
Create records in your database and evaluate data integrity. Can you efficiently create,
update and delete records?
Question 11.
At the end how many tables did your DC database have?
Question 12.
What were some of the key design issues you ran into and decisions you
made?
Question 13.
How did you decide "enough was enough" on design? What limits did you
place on your database design based on these decisions?
Creating queries
At this point you should check your database against the example database provided in the course
materials. How similar were the two databases? Did you make similar choices about splitting out
subjects and formats? Did you go into more detail or less? Did you choose different data types?
In the following section we learn how to create queries that build on our core MS Access Database.
Metadata Standards and Web Services
Erik Mitchell
Page 10
Lets start by grabbing the source database that has all of the records from our sample file cataloged
into it.
When you open the file, be sure to click "Enable Content" to be able to work with the database
Step 7:
Lets begin by exploring the database. Explore each table and briefly describe the data in
each table.
Table
Description
Collection
DCFormats
DCFormats_Record_Pivot_Table
DCRecords
DCSubject_Record_Pivot_Table
DCSubjects
SourceData_DCRecords
As you can see in this database we chose to split the data into a core "DCRecords" table with a
parent "Collection" table to represent the header information in the OAI-PMH file. We then broke out
subjects and formats since they were the two multi-field examples in our DCRecords file. This
required that we create two pivot tables, one for formats and one for subjects.
Take a few minutes to explore the database further, looking for how data types, primary keys and
data values are used. This is a good time to check out video 2 for this week - the introduction to our
database.
Question 14.
Can you design a table relationship that will connect the records with their
associated subjects?
Metadata Standards and Web Services
Erik Mitchell
Page 11
Question 15.
Can you design a table relationship that will connect the records with their
associated formats?
Lets create these two queries (hint - check the answer key to see what we are doing!)
Step 8:
Follow the course video to create these queries
Creating reports
Reports give us new ways of viewing data in databases. As we observed before, just about every
RDBMS has reporting functionality. In our database we are going to create two simple reports that
show the power behind databases. These two reports re-shuffle our metadata to create subject and
format indexes for our metadata.
Describing the report creation process is pretty complex so I recommend watching (and re-watching)
the videos. Use the example database as your guide and follow along with the video. This should
give you enough tools to get started.
Summary
In this worksheet we explored the process of creating and working with RDBM systems. We became
intimately familiar with Microsoft Access and learned how to create tables, queries and reports. We
became familiar with datatypes in tables, primary key relationships, entity relationship modeling and
query syntax.
Metadata Standards and Web Services
Erik Mitchell
Page 12
Download