DATABASE SYSTEMS Lecture 21 Dr. Adam P. Anthony Overview Database Fundamentals System components Schemas Database Models The Relational Model Survey of Database Products Data Mining Social Issues What is a Database? Non-Technical: A computer program that stores data Examples of Databases using this definition? Technical: A computer program that stores structured data in an organized fashion, and provides the ability to grant limited, or restricted access to different users based on their data requirements. Examples of Databases using this definition? Anything ruled out that we came up with above? Database Systems Initially: each department maintained their own “database” Sometimes just a text or Excel file Problems with duplication? Problems with sharing information? Now: One database holds everything, and each department is given access only to the data they need Schemas Schema full description of all the database, including what information is provided and how it interacts Used to maintain and program the database Provides a nice overview of what data a business collects and (perhaps) how they use that data Subschema Partial description of only a few relevant items Anything in a Subschema will also appear in the database’s Schema Based on providing only what the user needs to do their job Provides basic security: payroll record example The Database Management System Software applications can be enhanced by working with a database Especially on the web! (Google, Amazon, etc.) Database Management System (DBMS): has all the necessary code for organizing, manipulating data When application needs data, it just asks the DBMS to provide it More on Database Management Systems Database Management System (DBMS): A software layer that manipulates a database in response to requests from applications Schemas and subschemas are created, managed by the DBMS Distributed Database: A database stored on multiple machines Can have a different subschema for each application that accesses the DBMS Data is shared between each machine over a network DBMS will mask this organizational detail from its users Data independence: The ability to change the organization of a database without changing the application software that uses it Database Models Bits and bytes are hard to talk about! Want to reflect the organization of our data without thinking about the complex programming techniques needed to keep our data organized A Database Model provides a conceptual view of the database that gives us a ‘general idea’ of how the data is organized Flat File Model (not really a DB) Hierarchical Model Object-Oriented Model Relational Model The Relational Database Model Pretty Simple, on the surface A Relation is just a rectangular table of data An Attribute is any column in that table A Tuple is any row in the table There is complexity not in the table, but in how two different tables can interact If two relations have a common Attribute, then we can combine tuples from the two tables to get a more complex piece of data! A relation containing employee information 9-10 Designing a Good Relational Database When we set up the DBMS, we could just put all our data in one relation, like a big Excel file! But that’s no fun! And, it can cause problems Redundancy: When the same data is repeated unnecessarily Deletion Anomalies: When deleting a tuple results in the loss of important information We can prevent some of these problems using a multirelation design! A relation containing redundancy 9-12 An employee database consisting of three relations 9-13 The Employee Database Schema Employee Empl ID Name Address SSN Job Job ID Job Title Skill Code Dept Assignment Empl ID Job ID Start Date Term Date Combining Data From Multiple Tables In which departments has G. Jerry Smith worked? Designing a Relational Database Given a data problem: What are the entities/objects in this problem? How do those entities interact? Make one relational table for each entity Make one relational table for each interaction Identifying/Avoiding Redundancy Sometimes it’s obvious in the design Two columns with same/similar names Sometimes you need to focus on how the data will be used Common perpetrators: Addresses—more than one person can live at same address Organizations—more than one person belongs to a single organization Owners—a person/company can own more than one asset Creators—a creator may have many creations In-Class Example (from book) Chapter Review, question 18 (similar to assigned problem 20) Design a relational database containing information about music performers, their recordings, and the composers of the music they recorded (Avoid redundancies similar to those in Figure 9.4) SQL (Structured Query Language) Developed at IBM as a part of their Relational DBMS project Basic language humans can use to express, with precision, the data that they want from each (separate) relational table, and how it should be combined: select EmplId, Dept from ASSIGNMENT, JOB where ASSIGNMENT.JobId = JOB.JobId and ASSIGNMENT.TermData = “*” Survey of Database Products MySQL Microsoft Access Microsoft SQLServer Oracle DB MySQL Free, Open-Source database product Little support from creators, minimal software for managing the system Must know a great deal about database design fundamentals in order to use successfully on a large scale Can be done, but requires expert knowledge Good Choice for: temporary data solutions Zero-budget scenarios (but only if you have expert users to run it!) Internet applications Hiring people to run the system costs money! Facebook?? Poor Choice for: Non-Expert users MASSIVE data problems (but Facebook??) Microsoft Access Cheap, business-oriented database product Emphasis on user-friendliness, ease of development Limits on amount of data that can be stored in a single database Data may be as complex as on any other DBMS, just limited in size Good choice for: Packaged with ‘data templates’ that might automatically fit your data needs, no design required! very nice application-development features that are popular in the business world Small business settings Non-Expert Users Temporary data solutions Poor choice for: Large, enterprise-level data solutions Sharing data with a large number of people Microsoft SQLServer and Oracle DB Large, expensive enterprise-level DBMS’s Distributed DBMS support Provides both expert-level and non-technical tools for interaction with DBMS SQLServer Good For: Everything Large more-so than Oracle (if you can afford them!) jobs Bad for: Low-budget scenarios (except for SQLServer Express!) Data Mining Data Mining = Buzz word that covers any action that seeks to detect a previously unknown pattern from a large data set Class Description: Encompasses many areas of Computer Science, Mathematics, Statistics Divide data set into pre-determined groups (e.g., Mexican food lovers and Italian food lovers) Characterize each group by identifying common patterns “What type of person enjoys Mexican food?” Class Discrimination: Start with an initial data set, generate a class description for each group Given a new (incomplete) item, predict which group it will fall into: “Will this new customer prefer Mexican or Italian food?” Data Mining, Continued Cluster Analysis: Similar to Class Description but now the groups aren’t predetermined! Must find the most distinct groups AND describe them at the same time! Much more difficult Association Analysis Find patters between groups! Big in market research “Are Mexican food lovers also lovers of imported beers?” Social Impact of Databases Problems Massive amounts of personal data are being collected Often without knowledge or meaningful consent of affected people Data merging produces new, more invasive information Errors are widely disseminated and hard to correct Remedies Existing legal remedies often difficult to apply Negative publicity may be more effective