Data Warehousing TSU, Math 586 Instructors Sean perry, Micha

 SQL(pronounced “sequel”)-Structured Query Language  Syntax extensions were added by individual vendors and made their way into…  ANSI(American National Standards Institute) SQL, the standard relational databases used today.  SQL is used by most commercial database applications. No viable alt. language.  Evolves, but basic functionality remains unchanged. A database is an organized collection of data.  A database management system (DBMS) is software that allows the creation, retrieval, and manipulation of data.  A relational database management system (RDBMS) provides this functionality within the context of the relational database theory and the rules defined by Codd.  Data is all around; you make use of it every day. When registering for a class, interacting with an ATM, etc..  Data independence, a user does not need to know on which hard drive and file a particular piece of information is stored.  This provides data consistency and data integrity.  A relational database stores data in tables, essentially a two-dimensional matrix consisting of columns and rows.  Generally contains data about a single subject  Each table has a unique name that signifies the contents of the data  A database consists of many tables.  Columns in a table organize the data further.  A table consists of at least one column.  Each column represents a single, low-level detail about a set of data  The name of each column is unique within a table and identifies the data you find in that column  Each row usually represents one unique set of data within a table.  All of the columns of the rows represent respective data for the row.  Each intersection of a column and row in a table represents a value. BOOK_ID TITLE PUBLISHER 1010 The Invisible Force Literacy Circle 1011 Into The Sky Prentice Hall Column PUBLISH_DATE ROW 10/2008  No value is said to be NULL.  Blanks are not nulls.  Nulls cannot be compared or evaluated because they are unknowns.  null is not greater or less than null  null ≠null  Need to uniquely identify data within a table.  You find there is one and only one row in the table by looking for the primary key value.  A system-generated sequence number is called a synthetic or surrogate key.  Best to avoid any primary key that is subject to change.  Only one primary key per table.  More than one column in the primary key is called a composite or concatenated primary key  Rather than store rows with repetitive data, the table can be normalized to remove duplications.  A foreign key is where primary key column(s) in a table links to a column(s) in another table which provides more detail information for the original table.  Within the SQL language there are individual sublanguages  Data Manipulation Language(DML) commands allow you to query, insert, update, and delete data.  Data Definition Language(DDL) allows you to create new database structures such as tables and modify existing ones.  Data Control Language(DCL) allows you to control access to the data.  Go to page 8 in book  Answer and discuss answers in class.  Eliminates redundancy in tables, avoiding future data manipulation problems.  There are several different rules for minimizing duplication of data, called normal forms.  There are many different normalization rules but the five normal forms and the BoyceCodd normal form(BCNF) are the most accepted.  All repeating groups must be removed and placed in a new table.  This design provides more flexibility with the data and less overhead.  Minimizes the need to create multiple rows for one unique identifier. 1NF  Uses the First Normal Form rules and…  All the nonkey columns must depend on the entire primary key.  Applies only to composite primary keys.  Uses the Second Normal Form rules and…  Every nonkey column must be a fact about the primary key.  If an attribute depends on a nonkey column then that attribute needs to be moved a different table. 2NF & 3NF Book-Author table shows the violation of the second normal form This table shows the violation of the third normal form The Book & the Publisher tables in the 3rd normal form  BCNF is an elaborate version of third normal form and deals with deletion anomalies.  Fourth Normal Form tackles potential problems when three or more columns are part of the unique identifier and their dependencies to each other.  Fifth Normal Form splits the tables even further apart to eliminate all redundancy. When two tables have common column(s), they are said to have a relationship. The cardinality of a relationship is the actual number of occurrences between them.  (1:M) One-to-many relationship: The most common relationship. The shared column(s) of one table links to many rows in the other.  (1:1) One-to-one relationship: The shared column(s) are unique in each table.  (M:M) Many-to-many relationship: The shared column(s) of each table have multiple rows for each unique value.  Following the rules of relational database, an associative table or intersection table resolves the M:M problem.   Database schema diagrams are used to graphically depict the relationship between tables.  A “crow’s foot” on one end represents the 1:M relationship. BOOK_ID PUBLISHER_ID TITLE PUBLISHER_NAME PUBLISHER_ID(FK) PHONE_NO  The cardinality expresses the ratio of a parent and child table from the perspective of the parent table. It describes how many rows you may find between the two tables for a given primary key value.  The optionality of a relationship is whether or not a row is required (mandatory or optional). It shows whether one row in a table can exist without a row in the related table.  In an identifying relationship, the primary key is propagated to the child entity as part of the primary key.  In a nonidentifying relationship, the foreign key becomes one of the nonkey columns. Nulls are accepted in the foreign key column.  Round edges on a diagram mean the relationship is identifying. Sharp edges mean nonidentifying.  Requirements Analysis—Gather data requirements that identify the needs and wants of the users. One of the outputs of his phase is a list of individual data elements that need to be stored.  Conceptual Data Model—Groups the major data elements from the requirements analysis into individual entities with each individual data element referred to as an attribute. Unique identifiers or candidate keys that uniquely distinguish each row are determined. Noncritical attributes are not included in the model.   Logical Data Model—Shows that all the entities, their respective attributes, and the relationship between entities represent the business requirements, without considering the technical issues. Descriptive names and documentation is required. The complete model is called the logical data model, or entity relationship diagram (ERD). In the end, entities are fully normalized, the unique identifier for reach entity is determined, and any many-to-many relationships are resolved into associative entities. Physical Data Model—Also referred to as the database schema diagram is a graphical model of the physical design implementation of the database. There may be many physical models to choose from. This model uses different terminology. Tables instead of entities, columns instead of attributes, etc..  Entities are resolved to physical tables.  Attributes become columns with specific data types and formats.  Data integrity and consistency are created and physical storage parameters for individual tables are determined.  Indexes are designed. They are database objects that facilitate speedy access to data with the help of a specific column(s) of a table. Indices enhance query performance but they also create overhead when performing deletes/inserts/updates.      The act of adding redundancy to the physical database design. Data designers or architects sometimes purposely add redundancy to their design to increased query performance. In some application where massive amounts of detailed data are stored and summarized, denormalization is required. Data warehouse applications are database applications that benefit users who need to analyze large data sets from various angles and use this data for reporting and decision-making purposes. The primary purpose of a data warehouse is to query, report, and analyze data. Therefore, redundancy is encouraged and necessary for queries to perform efficiently.  Go to page 27 in book  Answer and discuss answers in class.  Throughout this book and the course the database for a school’s computer education program is sued as a case study on which most exercises are based.  It is important to familiarize yourself with the STUDENT case study diagram.  The STUDENT table contains data about each individual student, such as his or her name, address, employer, and the date the student registered in the program.  Data Types are found next to each column name in the diagram. Each column can contain a different type of data. Some of the possible data types are as follows: Data Type Data Description VARCHAR2(n) A variable length of alphanumeric characters with a max of n characters CHAR(n) A fixed-length of alphanumeric characters having n characters. Any unused space is padded with blanks until n is reached. NUMBER(x,y) A numeric value with a max length of x digits and y decimal places DATE Stores both date and time The COURSE table lists all the available courses that a student may take. The table consists of a PREREQUISITE, a COURSE_NO, a DESCRIPTION, and a COST column. COURSE_NO is the key.  The COURSE table represents a recursive or selfreferencing relationship.  A recursive relationship means a column within a table references another column within the same table.  In the COURSE table, the PREREQUISITE column shows which course_no is needed to be completed before the current course_no can be taken.  Nulls are allowed here because the relationship is optional.  The SECTION table includes all the individual sections a course may have. An individual course may have zero, one, or many sections, each of which can be taught in different rooms, at different times, and by different instructors. SECTION_ID is the primary key of the table. COURSE_NO is a foreign key. STATE_DATE_TIME shows the date and time the section meets, LOCATION lists the classroom, and CAPACITY shows the maximum number of students that may enroll.  A natural key is a set of column(s) that naturally uniquely identify a row. A computer generated sequence number, called a surrogate key, can be used as a substitute of the natural key..  The INSTRUCTOR_ID column is another foreign key column in the SECTION table, linking to the INSTRUCTOR table.  An INSTRUCTOR must always be assigned to a SECTION. It can never be null. But an INSTRUCTOR doesn’t have to teach a class.  A COURSE may have zero, one, or multiple sections.  The INSTRUCTOR table lists information related to an individual instructor, such as name, address, phone, and zip code. The primary key is INSTRUCTOR_ID. The ZIP column is the foreign key column to the ZIPCODE table.   Referential integrity does not allow deletion of a primary key value in a parent table that exists in a child as a foreign key table. This would create orphan rows in the child table.  The ENROLLMENT table is an intersection table between the STUDENT and the SECTION table, listing the students enrolled in the various section. It has a composite primary key of STUDENT_ID and SECTION_ID. ENROLL_DATE contains the date the student registered and FINAL_GRADE lists the student’s final grade. A STUDENT may be enrolled in zero, one, or many sections. The GRADE_TYPE table is a lookup table for other tables as it relate to grade information. GRADE_TYPE_CODE in the primary key and lists the unique category of grade. DESCRIPTION describes the abbreviated code.  The GRADE table lists all the grades related to the section in which a student is enrolled. GRADE_CODE_OCCURENCE is a sequence number listing the order of the grade, while NUMERIC_GRADE lists actual grade value. GRADE_CONVERSION in the numeric grade converted to a letter. The primary key columns are STUDENT_ID, SECTION_ID, GRADE_TYPE_CODE, and GRADE_CODE_OCCURENCE.   The GRADE_TYPE_WEIGHT table aids in computation of the final grade a student receives for an individual section. Different sections can have finals computed differently. The final grade is determined by using the individual grades of the student and section in the GRADE table in conjunction with this table. The primary key consists of the SECTION_ID and GRADE_TYPE_CODE columns. A GRADE_TYPE_CODE can exist zero, one, or many times.  Go to page 41 in book  Answer and discuss answers in class.  Quiz will be given at the beginning of our next class

Data Warehousing TSU, Math 586 Instructors Sean perry, Micha

Related documents

Products

Support

Data Warehousing TSU, Math 586 Instructors Sean perry, Micha

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib