Introduction to Database Topics of the Lecture • The components of a relational database: Lecture 2 2.1 An example database: the Beijing Olympic Games 2.2 Tables 2.3 Tables are Sets 2.4 Relationships & Relational Maps 2.5 Domains 2.6 Missing data and “null” values 2.7 Integrity Constraints © from notes by james d. reye @qut 1 2 2.1 An example database: the Beijing Olympic Games Tables in the Beijing Olympics Games Databases • Created to illustrate the ideas of relational databases – Only singles swimming events, and final results – Only 4 tables: Countries, Competitors, Events & Results • Only partial views of the tables are shown here Swimming 3 • Countries: one row for each country. 4 • Competitors: one row for each competitor (“athlete”). Countries Competitors CountryCode CountryName AFG Afghanistan AHO Netherlands Antilles ALB Albania ALG AND ANG ANT ARG ARM Algeria Andorra Angola Antigua and Barbuda Argentina Armenia ARU ASA AUS Aruba American Samoa Australia AUT AZE Austria Azerbaijan BAH Bahamas CompetitorNum 5 GivenName FamilyName Gender DateOfBirth CountryCode 210375 Leisel 210413 Andrew 210444 Linda Jones Lauterstein Mackenzie female male female 1985-08-30 AUS 1987-05-22 AUS 1983-12-14 AUS 210491 Alice 210500 Kenrick Mills Monk female male 1986-05-23 AUS 1988-01-01 AUS 210512 Patrick 210565 Adam Murphy Pine male male 1984-02-22 AUS 1976-02-28 AUS 210588 Shayne 210592 Stephanie 210596 Brenton 210638 Jessicah 210639 Melanie Reese Rice Rickard Schipper Schlanger female female male female female 1982-09-15 AUS 1988-06-17 AUS 1983-10-19 AUS 1986-11-19 AUS 1986-08-31 AUS 210674 Christian 210675 Nicholas Sprenger Sprenger male male 1985-12-19 AUS 1985-05-14 AUS 210680 Craig 210690 Eamon Stevens Sullivan male male 1980-07-23 AUS 1985-08-30 AUS 6 1 • Results: one row for each competitor in each swimming final event • Events: one row for each swimming final event. – So, if there were 8 competitors in each of 10 events then there would be 80 rows in this table). Events EventId EventGender Distance Style DateHeld StartTime SWM054 SWM014 SWW054 men men women 400 individual medley 400 freestyle 400 individual medley 10-Aug-08 10-Aug-08 10-Aug-08 10:03 10:24 10:42 EventId CompetitorNum SWW411 SWW021 SWM031 SWW014 SWM411 SWM012 women women men women men men 400 freestyle relay 100 butterfly 100 breaststroke 400 freestyle 400 freestyle relay 200 freestyle 10-Aug-08 11-Aug-08 11-Aug-08 11-Aug-08 11-Aug-08 12-Aug-08 11:21 10:24 10:30 11:17 11:26 10:16 SWM054 SWM054 SWM054 221565 232061 221213 1 2 3 4 5 6 243.84 World Record 246.16 European Record 248.09 SWM054 SWM054 207531 207546 4 5 7 3 252.16 252.47 SWM054 SWM054 232071 211784 6 7 2 1 252.84 253.38 SWM054 SWW010 SWW010 SWW010 SWW010 201943 202658 222009 210153 244639 8 1 2 3 4 8 3 4 5 2 255.40 24.06 Olympic Record 24.07 Americas Record 24.17 24.25 SWW010 SWW010 217815 247253 5 6 6 1 24.26 24.63 SWW010 SWW010 217807 217283 7 8 7 8 24.65 24.77 SWW041 SWM041 SWW031 women men women 100 backstroke 100 backstroke 100 breaststroke 12-Aug-08 12-Aug-08 12-Aug-08 10:23 10:31 10:48 SWW012 SWM022 women men 200 freestyle 200 butterfly 13-Aug-08 13-Aug-08 10:14 10:21 SWW052 women 200 individual medley 13-Aug-08 11:12 Results 7 Place Lane ElapsedTime Note 8 1. no two rows are exactly the same, i.e. no duplicate rows; 2.2 Tables (“relations”) 2. the order of the rows has no effect on the meaning of the data; • All data is stored in tables (aka “relations”). 3. the order of the columns has no effect on the meaning of the data; • Although the terms “table” and “relation” are often used to mean the same thing in a relational database, strictly speaking a relation is a (two-dimensional) table with the following six rules: 4. each cell in the table contains only a single value, i.e. you can’t put a list of values in a single cell (such as a list of a person’s children); 5. each column has a distinct name, i.e. a name which is not the same as another column in the same table; 9 6. all the values in each column are sensible, given the column’s name, e.g. you can’t put phone numbers in a date-of-birth column. 10 Competitors CompetitorNum GivenName FamilyName Gender DateOfBirth CountryCode 210375 Leisel 210413 Andrew 210444 Linda Jones Lauterstein Mackenzie female male female 1985-08-30 AUS 1987-05-22 AUS 1983-12-14 AUS 210491 Alice 210500 Kenrick Mills Monk female male 1986-05-23 AUS 1988-01-01 AUS 210512 Patrick 210565 Adam Murphy Pine male male 1984-02-22 AUS 1976-02-28 AUS 210588 Shayne 210592 Stephanie 210596 Brenton 210638 Jessicah 210639 Melanie Reese Rice Rickard Schipper Schlanger female female male female female 1982-09-15 AUS 1988-06-17 AUS 1983-10-19 AUS 1986-11-19 AUS 1986-08-31 AUS 210674 Christian 210675 Nicholas Sprenger Sprenger male male 1985-12-19 AUS 1985-05-14 AUS 210680 Craig 210690 Eamon Stevens Sullivan male male 1980-07-23 AUS 1985-08-30 AUS 2.3 Tables are Sets • Recall, a set is a collection of things that (typically) have something in common. • E.g. a shopping list is a collection of items that we want to buy. 11 12 2 • Follow normal mathematical rules that apply to every set: – no duplicates, i.e. a set cannot contain the same thing more than once. E.g. the list of numbers “1, 2, 2, 3” is not a set, but “1, 2, 3” is. – the order of the things has no effect on the meaning of the set. E.g. the set {John, Sue, Kila, Ben} is the same as the set {Ben, John, Sue, Kila}. • When talking about databases, we say that a table is a set of rows. E.g., we can say that the Events table is a set of events. • Because a table is a set of rows, the first two rules for tables (in slide 10) are essentially the same as the two rules given just above. 2.4 Relationships and Relational Maps • A database is a collection of tables. But, it’s not just an arbitrary collection of tables. There must be some kinds of relationships between the tables. I.e. they must have something in common. • E.g, the Countries and Competitors table each contain a column named CountryCode. • This shared name is a good indication that the two columns both contain the same kind of data, and this is confirmed by looking at the actual values in each column, e.g. “AUS”, “CAN” and “NZL”. 13 14 Countries CountryCode CountryName ARG Argentina ARM Armenia ARU Aruba ASA American Samoa AUS Australia AUT • Knowing this allows us to match up information in one table with another. Austria Competitors AZE GivenName BAH 210375 Leisel 210413 Andrew 210444 Linda Jones Bahamas female Lauterstein male Mackenzie female 1985-08-30 AUS 1987-05-22 AUS 1983-12-14 AUS 210491 Alice 210500 Kenrick Mills Monk female male 1986-05-23 AUS 1988-01-01 AUS 210512 Patrick 210565 Adam Murphy Pine male male 1984-02-22 AUS 1976-02-28 AUS 210588 Shayne 210592 Stephanie 210596 Brenton 210638 Jessicah 210639 Melanie Reese Rice Rickard Schipper Schlanger female female male female female 1982-09-15 AUS 1988-06-17 AUS 1983-10-19 AUS 1986-11-19 AUS 1986-08-31 AUS 210674 Christian 210675 Nicholas Sprenger Sprenger male male 1985-12-19 AUS 1985-05-14 AUS 210680 Craig 210690 Eamon Stevens Sullivan male male 1980-07-23 AUS 1985-08-30 AUS CompetitorNum Azerbaijan FamilyName Gender DateOfBirth CountryCode • For any database, having a diagram (relational map) showing these relationships makes it easier to understand the data in each table, as well as the overall structure of the database. 15 16 • Note that each relationship is drawn in a special way, with: a straight line ( | ) at one end; and a crow’s foot (/|\) at the other end. This is called a one-to-many relationship, a very important concept in working with databases. 17 18 3 Events EventId SWM054 SWM014 SWW054 EventGender Distance Style men 400 individual medley men 400 freestyle women 400 individual medley DateHeld 10-Aug-08 10-Aug-08 10-Aug-08 10:03 10:24 10:42 SWW411 SWW021 women women 10-Aug-08 11-Aug-08 11:21 10:24 SWM031 SWW014 SWM411 SWM012 SWW041 SWM041 SWW031 men women men men women menEventId SWW012 SWM022 SWW052 400 freestyle relay 100 butterfly StartTime 100 breaststroke 11-Aug-08 10:30 400 freestyle 11-Aug-08 11:17 400 freestyle relay 11-Aug-08 11:26 200 freestyle 12-Aug-08 10:16 Results 100 backstroke 12-Aug-08 10:23 CompetitorNum Lane ElapsedTime 100 backstrokePlace 12-Aug-08 10:31 SWM054 women SWM054 women 100 221565 breaststroke 200 232061 freestyle 1 2 3 4 12-Aug-08 5 13-Aug-08 SWM054 men SWM054 women 200 221213 butterfly 200 207531 individual medley4 6 13-Aug-08 7 13-Aug-08 • If you go looking for them, you’ll find that one-to-many relationships are very common in the real world, e.g. – a university has many campuses; – a house has many inhabitants; – a CD has many tracks, and so on. Note 243.84 10:48 World Record 246.16 10:14 European Record 248.09 10:21 SWM054 207546 5 3 252.16 11:12 252.47 SWM054 SWM054 232071 211784 6 7 2 1 252.84 253.38 SWM054 SWW010 SWW010 SWW010 SWW010 201943 202658 222009 210153 244639 8 1 2 3 4 8 3 4 5 2 255.40 19 24.06 Olympic Record 24.07 Americas Record 24.17 24.25 SWW010 SWW010 217815 247253 5 6 6 1 24.26 24.63 SWW010 SWW010 217807 217283 7 8 7 8 24.65 24.77 • Because most databases contain information about things in the real world, it is not surprising that one-to-many relationships occur between the rows in database tables. 20 • The concepts of columns and domains are closely related, but they are not the same. There are two major differences. 2.5 Domains – Firstly, a domain is a set of allowable values. Whereas a column is a set of values that are actually being used, (at any point in time). • As stated previously (section 2.2), all the values in each column must be sensible, given the column’s name. – The range of allowable values for a given column is called a domain. – Secondly, the column’s values are actually stored as part of each table in the database, whereas the domain’s values are specified by rules, rather than by storing each possible value. • Domains may allow only a small number of values, e.g. a Gender domain may only allow the values “male” and “female”. • Or, a domain may allow a large number of values, e.g. a StudentNumber domain may allow any number in the range 1 to 99,999,999. 21 2.6 Missing data and “null” values: a short introduction • Sometimes, in the real world, when we come to insert a new row in a table, the values of one or more columns are “missing”, i.e. we don’t have values for them. • Depending on the circumstances, we may want to: (a) use a default value, e.g. if an applicant for a new credit card doesn't ask for a specific credit limit, then they get a default limit; (b) record that a value is not applicable, e.g. some people don't have middle-names, home phone numbers, etc; (c) record that the value is unknown to the person entering the data, e.g. a person may forget to put their date-of-birth on a job application form. 23 22 • Apart from forgetfulness, there can be many reasons why information is unknown. For example: (i) it may relate to a future event, e.g. a competitor’s time in a race before the race has taken place. (ii) there may be insufficient resources (such as time or equipment) to obtain the information immediately, e.g. when a patient is admitted to an emergency-ward, the cause of their illness may initially be unknown. • Using a default value -- case (a) -- is not a problem for databases, as we get a value to put in the database, even if it is just the default value. • But, when inapplicable or unknown -- cases (b) and (c) -we don’t really have anything to put in the database. 24 4 • To handle this, relational databases allow the special value, null, to be used to indicate that information is missing. • Null is not an ordinary value (like 0, 1, 2 or 3), but rather is used to show the lack of a specific value. For this reason, a null value cannot be displayed on the screen like an ordinary value. As a substitute, programs often display things such as: “null”, “?” or “-“. • Be aware that null values can be used in all types of columns, i.e. columns that contains numbers, names, dates, times, etc (when allowed by the person who creates the database). • Using null values has the advantages that: – it is a clear, uniform way of recording that data is missing; and – the DBMS is aware that the data is missing and so can process it appropriately. • But, although null values are a significant improvement over reserved values, they introduce new complexities (as we will see later). • For this reason, it is best to use null values sparingly, only allowing them in columns where they are definitely needed, rather than allowing them in every column. 25 26 2.7 Integrity Constraints 2.7.1 • Primary keys Primary Keys • As mentioned previously (section 2.2), two rows in a table cannot be exactly the same, i.e. no duplicate rows. • Foreign keys • To ensure this, each table must have a primary key. A primary key consists of one or more columns such that no two rows have the same value in those columns, e.g. • Referential Integrity Table Name Countries Competitors Events Results Table’s Primary Key CountryCode CompetitorNum EventId EventId, CompetitorNum 27 28 Countries Events CountryCode CountryName ARG Argentina ARM Armenia ARU Aruba ASA American Samoa AUS Australia EventId SWM054 SWM014 SWW054 EventGender Distance Style men 400 individual medley men 400 freestyle women 400 individual medley women women men women men men women EventId men SWM054 women SWM054 400 freestyle relay 10-Aug-08 11:21 100 butterfly 11-Aug-08 10:24 100 breaststroke 11-Aug-08 10:30 400 freestyle 11-Aug-08 11:17 400 freestyle relay 11-Aug-08 11:26 200 freestyle 12-Aug-08 10:16 Results 100 backstroke Place 12-Aug-08 10:23 CompetitorNum Lane ElapsedTime Note 221565 1 12-Aug-08 4 243.84 100 backstroke 10:31World Record 232061 2 12-Aug-08 5 246.16 100 breaststroke 10:48European Record 221213 200 freestyle 207531 200 butterfly 210413 Andrew 210444 Linda Lauterstein Mackenzie male female 1985-08-30 AUS 1987-05-22 AUS 1983-12-14 AUS 210491 Alice 210500 Kenrick Mills Monk female male 1986-05-23 AUS 1988-01-01 AUS SWW411 SWW021 SWM031 SWW014 SWM411 SWM012 SWW041 SWM041 SWW031 210512 Patrick 210565 Adam Murphy Pine male male 1984-02-22 AUS 1976-02-28 AUS SWW012 SWM022 women SWM054 men SWM054 210588 Shayne 210592 Stephanie 210596 Brenton 210638 Jessicah 210639 Melanie Reese Rice Rickard Schipper Schlanger female female male female female 1982-09-15 AUS 1988-06-17 AUS 1983-10-19 AUS 1986-11-19 AUS 1986-08-31 AUS SWW052 women SWM054 210674 Christian 210675 Nicholas Sprenger Sprenger male male 1985-12-19 AUS 1985-05-14 AUS 210680 Craig 210690 Eamon Stevens Sullivan male male 1980-07-23 AUS 1985-08-30 AUS AUT AZE CompetitorNum GivenName BAH 210375 Leisel Austria Competitors Azerbaijan Gender FamilyName Jones Bahamas female DateOfBirth CountryCode 29 DateHeld StartTime 10-Aug-08 10:03 10-Aug-08 10:24 10-Aug-08 10:42 3 4 207546medley 5 200 individual 6 13-Aug-08 7 13-Aug-08 248.09 10:14 3 13-Aug-08 252.16 10:21 252.47 11:12 SWM054 SWM054 232071 211784 6 7 2 1 252.84 253.38 SWM054 SWW010 201943 202658 8 1 8 3 255.40 24.06 Olympic Record SWW010 SWW010 SWW010 222009 210153 244639 2 3 4 4 5 2 24.07 Americas Record 24.17 24.25 SWW010 SWW010 217815 247253 5 6 6 1 24.26 24.63 SWW010 SWW010 217807 217283 7 8 7 8 24.65 24.77 30 5 • To avoid problems, there are actually three properties that a primary key must have: 1. Uniqueness: the primary key has a different value for each row. 2. Irreducibility: if a column is removed from the primary key, then the remaining columns do not have the uniqueness property. 3. Entity Integrity: Columns in a primary key cannot contain null values. • Two other important things about primary keys: 1. Being a primary key in one table does not mean that the same column is a primary key in another table. • E.g. CountryCode is a primary key in the Countries table, but is not a primary key in the Competitors table. 2. The uniqueness property must hold for all current rows and any possible future rows. 31 32 Countries CountryCode CountryName ARG Argentina ARM Armenia ARU Aruba ASA American Samoa AUS Australia CompetitorNum AUT Austria Competitors AZE GivenName Azerbaijan FamilyName Gender – A primary key is called: • simple, if it consists of a single column; • composite, if it consists of two or more columns. Both types occur frequently, in real-world databases. DateOfBirth CountryCode BAH 210375 Leisel 210413 Andrew 210444 Linda Jones Bahamas female Lauterstein male Mackenzie female 1985-08-30 AUS 1987-05-22 AUS 1983-12-14 AUS 210491 Alice 210500 Kenrick Mills Monk female male 1986-05-23 AUS 1988-01-01 AUS 210512 Patrick 210565 Adam Murphy Pine male male 1984-02-22 AUS 1976-02-28 AUS 210588 Shayne 210592 Stephanie 210596 Brenton 210638 Jessicah 210639 Melanie Reese Rice Rickard Schipper Schlanger female female male female female 1982-09-15 AUS 1988-06-17 AUS 1983-10-19 AUS 1986-11-19 AUS 1986-08-31 AUS 210674 Christian 210675 Nicholas Sprenger Sprenger male male 1985-12-19 AUS 1985-05-14 AUS 210680 Craig 210690 Eamon Stevens Sullivan male male 1980-07-23 AUS 1985-08-30 AUS 33 – A table only ever has one primary key. If there is more than one possibility, the choice is typically made by: 34 • Although seeing actual rows helps us to understand the tables, it is useful to have a more concise notation for listing the names of the tables and their columns. • preferring narrower columns to wider ones, and • In relational notation, the four tables are: • preferring simple primary keys to composite primary keys. Countries (CountryCode, CountryName) Competitors (CompetitorNum, GivenName, FamilyName, Gender, DateOfBirth, CountryCode) – The other possible primary keys are called alternate keys, and are used much like primary keys, e.g. to check that duplicates don’t occur. (However, they are not quite the same as primary keys.) Events (EventId, EventGender, Distance, Style, DateHeld, StartTime) Results (EventId, CompetitorNum, Place, Lane, ElapsedTime, Note) 35 36 6 2.7.2 Foreign Keys • Look at the column names on the relational map for the Olympics database, all these column names are primary keys of the tables at the “one” end of the one-to-many relationships. • There is a special term for the columns at the “many” end of these relationships. These are called foreign keys. • A foreign key is a column (or group of columns), whose values match the values held in a primary key, e.g… 37 Table Name Countries Competitors Events Results Table’s Foreign Keys none CountryCode none (i) EventId; (ii) CompetitorNum 38 • Referenced Primary Key none. CountryCode in the Countries table. none. (i) EventId in the Events table; (ii) CompetitorNum in the Competitors table. Foreign keys are another vital part of relational databases, for two major reasons: 1. “They are the glue that holds the database together.” (Chris Date) 2. They prevent junk from getting into the database, by allowing "referential integrity" constraints (see next section), i.e. ensuring that the data in one table is consistent with the data in another table. 39 • Three other important things about foreign keys: 1. They can be composite, just like primary keys. (There are no examples of this in the Olympics database, but there are some in the lab sessions.) 2. While foreign keys usually have the same column names as the referenced primary key, this is not a hard and fast rule. The designer of the database can choose different names for the foreign key columns (at the risk of confusing the users of the database). 40 3. While the foreign key and referenced primary key are usually in different tables, this is not always the case. For example, consider the table: Employees StaffNumber 1 2 3 4 5 6 7 StaffName Susan Boss John Supervisor Joe Labourer Wendy Worker Mary Leader Peter Programmer Harry Hacker StaffNumberOfManager – 1 2 2 1 5 5 StaffNumberOfManager Susan Boss John Supervisor Joe Labourer Wendy Worker 41 Mary Leader Peter Programmer Employees Harry Hacker 42 7 Countries 2.7.3 CountryCode CountryName ARG Argentina ARM Armenia ARU Aruba ASA American Samoa AUS Australia Referential Integrity • Hand-in-hand with the idea of having foreign keys is the concept of referential integrity. CompetitorNum • A database has referential integrity, if all the foreign key values match values in the referenced primary keys (or may possibly be null). • As previously mentioned, referential integrity is important for preventing junk data entering the database, – e.g. preventing a Competitor row having “XYZ” as the CountryCode, because there is no such CountryCode in the Countries table. 43 AUT Austria Competitors AZE GivenName Azerbaijan FamilyName Gender DateOfBirth CountryCode BAH 210375 Leisel 210413 Andrew 210444 Linda Jones Bahamas female Lauterstein male Mackenzie female 1985-08-30 AUS 1987-05-22 AUS 1983-12-14 AUS 210491 Alice 210500 Kenrick Mills Monk female male 1986-05-23 AUS 1988-01-01 AUS 210512 Patrick 210565 Adam Murphy Pine male male 1984-02-22 AUS 1976-02-28 AUS 210588 Shayne 210592 Stephanie 210596 Brenton 210638 Jessicah 210639 Melanie Reese Rice Rickard Schipper Schlanger female female male female female 1982-09-15 AUS 1988-06-17 AUS 1983-10-19 AUS 1986-11-19 AUS 1986-08-31 AUS 210674 Christian 210675 Nicholas Sprenger Sprenger male male 1985-12-19 AUS 1985-05-14 AUS 210680 Craig 210690 Eamon Stevens Sullivan male male 1980-07-23 AUS 1985-08-30 AUS 44 8