Class 7 - Relational Databases Exercise overview Last week we explored the XML and JSON serialization standards and became familiar with the steps behind validation and checking documents for well-formedness. We discovered that XML and JSON serve related but different purposes and gained some familiarity with the issues of JSON and XML serialization (e.g. JSON is more easily used in JavaScript but XML can preserve more detailed metadata). Both JSON and XML followed the DOM data model in that they relied primarily on hierarchical relationships and document-based serialization formats. This week we will look at a new type of data model, the Relational Database (RDB). We will learn the conceptual framework supporting RDBs as well as gain some familiarity with a RDBMS or Relational Database Management System called Microsoft Access. This worksheet is broken into three main sections: Getting access to Microsoft Access, Course Readings and Working with Microsoft Access. All of our videos and examples are based on Microsoft Access 2013. Instructions: Work individually or in groups to complete the worksheet. When you get to a section that requires you to select a resource to explore – pick one resource (please don’t always choose the first one!). When asked to ‘discuss as a group’, consider your response and continue completing the worksheet. We’re going to work with computer coding today and here’s an important note as you follow the exercises. Computer code is shown on numbered lines and are enclosed in boxes. The numbered lines are simply to help as a reference during instruction and should not be copied into your program. For example a line that reads 56. p { visibility:hidden; } should simply be typed in as p { visibility:hidden; } Metadata Standards and Web Services Erik Mitchell Page 1 Suggested readings 1. Mitchell, E. (2015). Chapter 6 in Metadata Standards and Web Services in Libraries, Archives, and Museums. Libraries Unlimited. Santa Barbara, CA. 2. Vines, Rose. (2011). Databases from scratch I: Introduction. http://geekgirls.com/2011/09/databases-from-scratch-iintroduction/ 3. Vines, Rose (2011). Databases from scratch II: Simple database design. http://geekgirls.com/2011/09/databasesfrom-scratch-ii-simple-database-design/ 4. Vines, Rose. (2011). Databases from scratch III: The design process. http://geekgirls.com/2011/09/databases-fromscratch-iii-relational-design-process/ 5. Note - Rose Vines also has tutorials IV-VII but they get a bit more complicated than we are going to. Optional Readings W3Schools. (2013). Introduction to SQL. http://www.w3schools.com/sql/sqlReading discussion Databases from Scratch sections 1-3 give us a brief introduction to why relational databases are important. Read these sections and make sure you can answer the following questions. Question 1. What are some examples of Relational Database applications? Question 2. What types of tools do RDBMS programs typically include? RDBMS applications commonly include tables, input forms, queries, and reporting functionalities. Tables store your data and rely on relationships, data validation and data input to ensure the integrity of the data relationships. Data input forms provide the user with an easy-to-use interface that simplifies interaction with the database. Frequently these input forms are web-pages but many RDBMS applications, including Microsoft Access, have built-in forms to make it easy to input and work with data. Queries are a way of searching through a table or multiple tables using defined criteria (e.g. show me all books published after 1990 that were written by Allen Downey). Queries employ a search language called Structured Query Language (SQL) that is a generally accepted standard in the relational database world. It is worth noting however that the exact syntax and semantics of SQL varies between RDBMS platforms, making it difficult to easily migrate databases between different RDBMSs. Finally, RDBMSs often support the creation of reports which combine Metadata Standards and Web Services Erik Mitchell Page 2 the search and data combination features of queries with data layout, grouping and formatting features. Reports are commonly written with paper in mind but also support other output formats as well including PDF, XML, HTML and CSV. Question 3. From the readings, what are some of the key advantages of using a well-defined database design? Question 4. From the readings, what role does a UniqueID or Key serve? Question 5. From the readings, what are the core elements of the database design process? As our readings point out, database design is an iterative process and is often imperfect. The old saying "the perfect is the enemy of the good" is very relevant in this situation as it can be very easy to overdesign a database for uses that were not originally intended. In our exercise we will build on our understanding of Dublin Core as it is represented in XML files and create a Relational Database Structure that implements the DC elements in an RDBMS data model. The building blocks of tables Before we move on to our tour of Microsoft Access it is important that we understand the rules behind an RDBMS table. These tables are the building blocks of data in our RDBMS and possess rules similar to our encoding rules in XML. In RDBMSs, tables consist of data that is represented in rows (e.g. 1 row = 1 record), described by fields represented in columns (e.g. 1 row has 20 fields and as such 20 columns). Each field has a name (e.g. title, creator, date) and a data type. Data types describe the data that is contained in the field and make it possible for the RDBMS to perform some advanced operations. For example if you tell an RDBMS that a given field is a number then you enable mathematical operations on that field. In contrast, if you define a field as a date then you get access to certain formatting and sorting functions for dates. The figure below shows a generic table in Microsoft Access viewed in "Design View." Metadata Standards and Web Services Erik Mitchell Page 3 Figure 1 This images shows the field names and data types for each field Metadata Standards and Web Services Erik Mitchell Page 4 Figure 2 This image shows the datatypes available in Microsoft Access In Relational Databases it is a convention and general best practice to establish an autonumbering UniqueID/Primary Key for each table. Primary Keys are used to link tables together (e.g. recordid in the records table links to recordid in the subjects table). Using Autonumbered UniqueIDs we get referential integrity in our database automatically. Review this Webpage to understand the purpose of each datatype. http://office.microsoft.com/enus/access-help/introduction-to-data-types-and-field-properties-HA010233292.aspx#BM2 Microsoft Access I highly recommend using Microsoft Access for this exercise but if you are feeling bold or already have experience you can also use the RDBMS of your choice. Microsoft Access is a powerful clientbased RDBMS that has all of the features mentioned in the previous section of this worksheet. Where you can use Microsoft Access Microsoft Access is a Windows application that is available under the UMD software license. Windows users Metadata Standards and Web Services Erik Mitchell Page 5 If you are a Windows user you may already have Microsoft Access installed on your machine. If not, head to https://terpware.umd.edu/Windows/List/235 to download and install a free copy of Microsoft Access 2013. Macintosh users If you use Macintosh you need to identify an alternative way of using Microsoft Access. You have two options: 1. Use a Microsoft Windows machine that has Office 2010 or 2013. This may be at your public library, at your work place or at the UMD iSchool computer lab 2. Install a virtualization environment on your OSX machine, (probably 3-4 hours of work) Install Windows on that virtualization environment and Install Office 2013 on your virtual machine. This sounds complicated but all of the software is either open source or freely available via Terpware. In short, you need: a. Virtualbox https://www.virtualbox.org/wiki/Downloads (your virtualization platform) b. Microsoft Windows 7 (or 8 if you are brave) https://terpware.umd.edu/Windows/Title/1817 (get the 64 bit version unless your MAC is running Snow Leopard or earlier) Make sure you download this on your MAC so you get the proper installer c. Microsoft Office 2013 - make sure you get the version that matches your operating system (e.g. 32 or 64 bit). Download and install the applications in the order listed above. You install VirtualBox on your OSX machine, create a virtual machine, map the ISO file you downloaded as a virtual CD Drive (Hint, you need to rename the file you downloaded from .img to .iso), install Windows on the virtual machine, then install Microsoft Office onto that Windows VM Environment. These videos and instructional links may help - http://www.youtube.com/watch?v=Rag4LDoBUC0, http://www.eos.ncsu.edu/soc/support/wom/vbox_install.php, https://www.virtualbox.org/manual/ch01.html Metadata Standards and Web Services Erik Mitchell Page 6 A Tour of Microsoft Access Before proceeding, watch the course video on Microsoft Access, making sure that you know how to: Open Microsoft Access and create a new database Create a new table, query and report Toggle between design view and datasheet view How to open and edit the entity relationship model for your database How to save changes to your tables, queries and reports Create your first database In this exercise we will create a database, catalog some items into our database, query our database and create a simple report from our database. We will use Dublin Core as our source of data. Step 1: Outline your data and make logical groupings of data. What data should go into a single table, what data should go into other tables? Using the Dublin Core 15 fields outline a relational table structure a. Using this dataset to catalog with http://digital.lib.umd.edu/oaicat/OAIHandler?verb=ListRecords&metadataPrefix=oai_dc& set=hdl_1903.1_5980 b. Browse the data and look for common fields, important features and metadata quirks. c. Map the core 15 fields from Dublin Core into a table structure using your source dataset as inspiration: Contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, type Metadata Standards and Web Services Erik Mitchell Page 7 Step 2: Break composite fields into their building blocks and assign data types and vocabularies to data as needed. For each DC mapped field fill out the table below indicating how each field will be broken apart. DC Element RDBMS field / table Datatype / syntax RDBMS Authority Control Subject Created a subjects Datatype is short text for Could use LCSH or some other table and a pivot table subject, autoid for subject authority for resource/subjects subjectID Contributor Coverage Creator Date Description Format Identifier Language Publisher Relation Rights Metadata Standards and Web Services Erik Mitchell Page 8 Source Title Type Question 6. How did the process of breaking apart composite fields, assignment of data types and association with vocabularies respect or violate DC conventions? Question 7. Based on your experience what are some of the benefits and drawbacks of the DCXML data model and the DC Relational Database model? Benefits Drawbacks DC modeled in DOMXML DC modeled in a Relational Database Step 3: Group data logically and establish key relationships (e.g. primary keys for tables, inclusion of keys in linked tables). Step 4: Let's begin by creating our tables in MS Access. Follow the table definitions you created and implement your tables and joins in Microsoft Access a. In this step you should be creating the tables, fields, datatypes, and relationships as demonstrated in our orientation video. If this process seems completely unclear I recommend re-watching the video and checking out our example database that contains a take on the database mapping. Because MS Access uses icons and toolbars (often without text captions) it can be difficult to describe each step along the way. Use the videos that go along with this worksheet for instruction when needed Metadata Standards and Web Services Erik Mitchell Page 9 Question 8. How many tables did you create to accurately represent your data? Question 9. What are some of the advantages and drawbacks of your approach (e.g. I decided to put everything into one table so my data is very inflexible). Step 5: Review design and re-group data to reduce redundancy. Ask questions about data design and database flexibility Question 10. Step 6: Did you make any modifications to your core design? Create records in your database and evaluate data integrity. Can you efficiently create, update and delete records? Question 11. At the end how many tables did your DC database have? Question 12. What were some of the key design issues you ran into and decisions you made? Question 13. How did you decide "enough was enough" on design? What limits did you place on your database design based on these decisions? Creating queries At this point you should check your database against the example database provided in the course materials. How similar were the two databases? Did you make similar choices about splitting out subjects and formats? Did you go into more detail or less? Did you choose different data types? In the following section we learn how to create queries that build on our core MS Access Database. Metadata Standards and Web Services Erik Mitchell Page 10 Lets start by grabbing the source database that has all of the records from our sample file cataloged into it. When you open the file, be sure to click "Enable Content" to be able to work with the database Step 7: Lets begin by exploring the database. Explore each table and briefly describe the data in each table. Table Description Collection DCFormats DCFormats_Record_Pivot_Table DCRecords DCSubject_Record_Pivot_Table DCSubjects SourceData_DCRecords As you can see in this database we chose to split the data into a core "DCRecords" table with a parent "Collection" table to represent the header information in the OAI-PMH file. We then broke out subjects and formats since they were the two multi-field examples in our DCRecords file. This required that we create two pivot tables, one for formats and one for subjects. Take a few minutes to explore the database further, looking for how data types, primary keys and data values are used. This is a good time to check out video 2 for this week - the introduction to our database. Question 14. Can you design a table relationship that will connect the records with their associated subjects? Metadata Standards and Web Services Erik Mitchell Page 11 Question 15. Can you design a table relationship that will connect the records with their associated formats? Lets create these two queries (hint - check the answer key to see what we are doing!) Step 8: Follow the course video to create these queries Creating reports Reports give us new ways of viewing data in databases. As we observed before, just about every RDBMS has reporting functionality. In our database we are going to create two simple reports that show the power behind databases. These two reports re-shuffle our metadata to create subject and format indexes for our metadata. Describing the report creation process is pretty complex so I recommend watching (and re-watching) the videos. Use the example database as your guide and follow along with the video. This should give you enough tools to get started. Summary In this worksheet we explored the process of creating and working with RDBM systems. We became intimately familiar with Microsoft Access and learned how to create tables, queries and reports. We became familiar with datatypes in tables, primary key relationships, entity relationship modeling and query syntax. Metadata Standards and Web Services Erik Mitchell Page 12