Introduction to Information and Computer Science: Databases and SQL Audio Transcript Slide 1 Welcome to Introduction to Information and Computer Science: Databases and SQL. This is Lecture (a). The component, Introduction to Information and Computer Science, is a basic overview of computer architecture; data organization, representation and structure; structure of programming languages; networking and data communication. It also includes basic terminology of computing. Slide 2 The Objectives for Databases and SQL are to: Define and describe the purpose of databases Define a relational database Describe data modeling and normalization Describe the structured query language (SQL) Define the basic data operations for relational databases and how to implement them in SQL Design a simple relational database and create corresponding SQL commands Examine the structure of a healthcare database component Slide 3 It is important to review some computer science basics in order to understand the details of information storage. Remember that for a computer, data consists of ones and zeros. In other words, every data value is represented by a combination of binary ones and zeroes, or simply values of on and off, for example, the number 01000001 [zeroone-zero-zero-zero-zero-zero-one] on this slide. When a computer system examines this value it does not know what it represents. It depends on the system application’s knowledge of the underlying data. If the application is a text editor, it knows that this value represents a capital ‘A’ as defined by the American Standard Code for Information Interchange, better known as ASCII [as key]. On the other hand, in the context of data in a spreadsheet, the cell formatting may indicate that the stored binary number actually represents the number 65. That binary data may also be used in many other ways, including as a central processing unit instruction or as part of an audio or video file. Slide 4 One very large component of computer systems is the management of data. Consider the information maintained on a personal computer – this might include programs, photos, music, videos, tax returns and class papers, just to name a few. Some files may Health IT Workforce Curriculum Introduction to Information and Computer Science Version 3.0 / Spring 2012 Databases and SQL Lecture a This material (Comp4_Unit6a) was developed by Oregon Health and Science University funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000015. 1 remain unchanged; others might be modified over time, such as revisions to a class paper. Now consider an electronic healthcare record (EHR) system that may contain information for thousands or tens of thousands of patients. With this volume of information, it is important that information be stored efficiently, is quickly accessible, and has the capacity to be updated. Slide 5 Data can be stored electronically in different ways. The first way is to store it in a simple text file. Another is to store it a spreadsheet, which is more powerful than a simple text file. Finally, data can be stored in databases, which is the topic of this unit. Before discussing databases, this lecture will provide information about the other options for data storage and when they are appropriate to use. Slide 6 A file is a collection of information, or data, stored in a single electronic location. How that information is stored in files is also important. Files can contain text or data that is not readable by humans. If data is to be accessed by a person, then it needs to be human-readable; however, a computer system may use a different format – it just needs to know how to interpret the data. For example, an audio file and a text file contain information that is stored in different formats. A text editor cannot edit an audio file, and a music player cannot play a text file. An audio file is not readable by humans, but its data can be interpreted by a music player and converted to the music that humans listen to. Slide 7 Files are stored on file systems, which every computer has. Because of that, every computer has the ability to create, use and store files. Files can be easily shared; email and shared drives are some options for sharing. They're also used by many applications; for example, a PowerPoint presentation is stored in a file. Also, many scientific applications and instruments use input data files and/or generate output data files. For example, genomic data is often stored in large data files that are searched and parsed by special programs. On the other hand, files have limitations. The security of files is limited to that of the file system. Also, by default, multiple users cannot use the same file at the same time. Usually, one user can open the file for editing; any additional users open a read-only copy of the file. Finally, using files to store structured data with relationships can result in redundancy and inconsistency as shown in the following example. Slide 8 This slide shows a file that contains contact information for individuals and their organizations. It contains names, such as Bill Robeson, Walter Schmidt and Mary Health IT Workforce Curriculum Introduction to Information and Computer Science Version 3.0 / Spring 2012 Databases and SQL Lecture a This material (Comp4_Unit6a) was developed by Oregon Health and Science University funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000015. 2 Small and the corresponding organization names and addresses – there are only two different organizations in this small sample. Slide 9 After review of the previous slide, answer the following questions: Do Bill and Albert work for the same company? What is the difference between Catherine and Walter’s information? If a computer application was looking at the data would it be able to tell that there was an issue with the addresses for Catherine and Walter? Can you sort this list by last name? Could you sort a list of 10,000 contacts? Slide 10 While it is easy to see that Bill and Albert work for the same company, note that in the file, Bill’s company name is Community Hospital Inc. and Albert’s is Community Hospital Incorporated. This There is a similar issue with Catherine and Walter’s information. Catherine’s address is 14 12th Street with street spelled out. Walter has the same address, but notice in this case that 14 12th St. uses the abbreviation for street. Humans can easily handle these variations in data and determine that they are the same. However, a computer system, even one with an artificial intelligence system, would have significantly greater challenges in determining that the companies and addresses are the same. And while sorting these few entries may be feasible just looking at the list, sorting a file with 10,000 contacts would be extremely time-consuming without the use of technology. Slide 11 A bigger challenge might be if “Community Hospital, Inc.” becomes “Community General”. If this change were done manually, or with an automated system, every single instance of “Community Hospital” would have to be identified in the data. Additionally, every different representation of “Community Hospital”, for example, “Community Hospital, Inc.” and “Community Hospital Incorporated”, would also need to be located; and there is no guarantee that the word “Community” was spelled correctly in every instance. Once all of the entries are identified, each one needs to be modified to correctly read “Community General”, If done manually, this still has the potential for human data-entry error and in a large system would be very time-consuming. If it is done using a simple search and replace automated function, it may not take as long, but it may or may not result in partial changes to other existing records, for example, Portland Community Hospital being changed to Portland Community General. Health IT Workforce Curriculum Introduction to Information and Computer Science Version 3.0 / Spring 2012 Databases and SQL Lecture a This material (Comp4_Unit6a) was developed by Oregon Health and Science University funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000015. 3 Slide 12 Spreadsheet applications were first developed for businesses to automate accounting tasks. Today, spreadsheets are widely used for storing, manipulating and presenting data. Today's spreadsheet applications perform calculations using predefined or usercreated formulas. They provide features for easily sorting and filtering data and can even perform data analysis. Advanced spreadsheet users can create very complex calculations and relationships between data. Spreadsheets have become very powerful tools for data analysis and manipulation. However, they still have the same limitations as plain text files as shown on the following slide. Slide 13 Here is an example of an OpenOffice Calc spreadsheet. (Other spreadsheet applications include Microsoft Excel, IBM Lotus Symphony, Corel QuatroPro, Apple Numbers and Google documents spreadsheets). The data is organized into numbered rows and lettered columns; column names can be provided in the first row. The data itself does not look very different from the data in the simple text file; however, there are vast options for manipulating and presenting this data on the menus above the data. We can sort the data very easily and quickly, unlike plain text files. Regardless, spreadsheets have the same problems as the text file--there is data defined multiple times (company name and address) which is inefficient and error prone. Slide 14 Since spreadsheets are just a special type of file, they have similar advantages and disadvantages. While spreadsheets require a special application such as Microsoft Excel, these applications are widely available. Spreadsheet applications provide powerful calculations and basic sorting and filtering. But like files, they have limited security, multiple user access and a potential for redundant and inconsistent data. Spreadsheets are good for doing calculations on static snapshots of datasets, but they aren't the best solution for long term storage and access of data. Slide 15 So what exactly is a database? A database is a structured data collection which is accessed electronically. in other words, it is information stored on a computer for access through a computer program. The text file used in this lecture that contained the contact information can be considered to be a very simple database – it contains organized, though not necessarily consistent, information, that might be accessed through a text editor. A relational database is a database that maintains relationships between data elements and are the focus of this unit. Slide 16 The concept of a relational database was first published by E.F. Codd in “Communications of the ACM” in June 1970. Codd held the view that users should not Health IT Workforce Curriculum Introduction to Information and Computer Science Version 3.0 / Spring 2012 Databases and SQL Lecture a This material (Comp4_Unit6a) was developed by Oregon Health and Science University funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000015. 4 have to keep track of how the information is stored in a computer in order to use it. To quote Codd, “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).” In other words, users should not have to know whether the binary bits discussed previously in this lecture represent a capital A or the number 65, or even what they are related to – rather, the system should keep track of that information once it is provided by the user. So a relational database is an organized collection of data accessible by electronic means where the information type and information relationships are maintained internally by the system itself. Slide 17 A relational database consists of one or more tables defined by the database designer in a meaningful fashion. A table is a collection of information organized into rows and columns. Each table contains one or more rows of data. The data in a row is ordered by columns, and each column is of a known and specified type where the data and type are independent. The order of rows in the table is irrelevant, but the order of the columns in the row is significant. Slide 18 A relational database has quite a number of advantages over files and spreadsheets. Database systems are designed to be highly secure; control to data can be precisely defined. In addition, databases are designed to be accessed and modified by multiple users at the same time. Relationships between tables support organized data that prevents data redundancy and inconsistency. The highly optimized underlying data structures used by the relational database result in highly efficient and fast access. Because a database system is designed for the specific purpose of data organization, the basic operations of retrieving, adding, modifying and deleting data are more efficient than general-purpose applications and storage such as spreadsheets and files. Furthermore, relationships and efficient access allow for complex queries and searches of data. On the other hand, databases are complex systems that require expertise to install, maintain and use. There are free, open-source databases, but the commercially available databases are very expensive. In comparison, files and spreadsheets are more widely available and easy to use. Also, data in databases is not as easily analyzed using complex data calculations. Instead, data is usually exported from databases into a spreadsheet or data file for statistical software. Slide 19 This concludes Lecture (a) of Databases and SQL. There are several options for data storage including files, spreadsheets or databases. Files and spreadsheets are widely available and are good for data computations. Databases are very secure and Health IT Workforce Curriculum Introduction to Information and Computer Science Version 3.0 / Spring 2012 Databases and SQL Lecture a This material (Comp4_Unit6a) was developed by Oregon Health and Science University funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000015. 5 optimized systems for storing, accessing and modifying data over the long term. Multiple users can access and modify data at the same time. Furthermore, relationships are stored in a database along with the data which allows for less data redundancy and inconsistency as well as for complex queries. Slide 20 References slide. No audio. Health IT Workforce Curriculum Introduction to Information and Computer Science Version 3.0 / Spring 2012 Databases and SQL Lecture a This material (Comp4_Unit6a) was developed by Oregon Health and Science University funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000015. 6