Chapter 6: Databases and Information Management 6.1 Organizing Data in a Traditional File Environment An effective info system provides users with accurate, timely and relevant information File Organization Terms of Concepts A computer system organizes data in a hierarchy that starts with bits and bytes and progresses to fields, records, files, and databases o Bit: smallest unit of data a computer can handle o Byte: group of bits that represent a single character, e.g. letter, number, symbol o Field: grouping of characters into a word, a group of words, or a complete number o Record: group of related fields such as a student’s name, the course taken, date, grade, etc. o File: group of records of the same type o Database: group of related files, e.g. student database o Entity: a person, place, thing or event about which we store and maintain information. o Attribute: each characteristic or quality describing a particular entity, e.g. Student_ID Problems with the Traditional File Environment Each functional area (e.g. accounting, finance, HR, etc.) developed their own systems & data files Problems: o Data redundancy and inconsistency o Program-data dependence o Inflexibility o Poor data security o Inability to share data among applications Data Redundancy and Inconsistency Data redundancy: presence of duplicate data in multiple data files so that the same data are stored in more than one place or location. Data inconsistency: same attribute may have different values o E.g. the attribute ‘Date’ is not updated for the entity ‘COURSE’ across the different databases/systems Additional confusion may result from using different coding systems to represent values for an attribute can’t integrate data from different sources Program-Data Dependence Program-data dependence: the coupling of data stored in files and the specific programs required to update and maintain those files so that changes in programs require changes to the data A change in the software program could change the data, which may then not be compatible with other programs Lack of Flexibility Traditional file systems can deliver schedule reports but not unanticipated information requests in a timely fashion It may be possible, but expensive Poor Security Little control or management of data leads to little control over access and dissemination of data Lack of Data Sharing and Availability It is virtually impossible to access data in a timely manner because pieces of information in different files and different parts of the organization cannot be related to one another 6.2 Different values of the same piece of information The Database Approach to Data Management Another definition of database is a collection of data organized to serve many applications efficiently by centralizing the data and controlling redundant data Data appears to be stored in one location Database Management Systems Database management system (DBMS): software that permits an organization to centralize data, management them efficiently, and provide access to the stored data by application programs. DBMS retrieves information from the database and presents it to an application program o Traditional data files would be the other way around DBMS separates the logical and physical view o Logical view – data as they would be perceived by end users o Physical view – how data are actually organized and structured on physical storage media DBMS makes the physical database available for different logical views as required by users How a DBMS Solves the Problems of the Traditional File Environment Reduces data redundancy and inconsistency by minimizing isolated files in which the same data are repeated Uncouples program data, enabling ad hoc data queries Enables the organization to centrally manage data, their use and their security Relational DBMSs Relational DBMS: a type of logical database model that treats data as if they were stored in twodimensional tables. It can relate data stored in one table to data in another as long as the two tables share a common data element. A row of the table = record = tuple Key field: field in a record that uniquely identifies instances of that record so that it can be retrieved, updated or sorted. Primary key: unique identifier for all the information in any row of the table Foreign key: field in a database table that enables users to find related information in another database table. Operations of a Relational DBMS Relational database tables can be combined easily to deliver data required by users, provided that any two tables share a common data element 3 basic operations: o Select o Join Creates a subset consisting of all records in the file that meet the stated criteria Combines relational tables to provide the user with more information than is available in individual tables o Project Creates a subset consisting of columns in a table, permitting the user to create new tables that contain only the information required Object-Oriented DBMSs Many newer applications require databases that can store and retrieve not only structured numbers and characters but also drawings, images, photographs, audio and full-motion video o DBMS that organize data into rows and columns are not well-suited for this purpose Object-oriented DBMS: stores the data and procedures that act on those data as objects that can be automatically retrieved and shared o Becoming popular b/c can be used to manage multimedia components or Java applets in Web applications o Slow in processing a large number of transactions Object-relation DBMS: DBMS with capabilities of both object-oriented and relational DBMS Capabilities of Database Management Systems 1. Data definition: capability to specify the structure of the content of the database 2. Data dictionary: automated or manual file that stores definitions of data elements and their o Used to create database tables and to define the characteristics of the fields in each table characteristics 3. Data manipulation language Querying and Reporting Data manipulation language: DBMS’s specialized language that is used to add, change, delete and retrieve the data in the database o Contains commands that permit end users and programming specialists to extra data from the database to satisfy info requests and develop applications Structured Query Language (SQL): standard data manipulation language for relational database management systems Microsoft Access and other DBMSs include capabilities for report generation so that the data of interest can be displayed in a more structured and polished format than would be possible by just querying Crystal Reports (popular report generator) Designing Databases Normalization and Entity-Relationship Diagrams Conceptual database design describes how data elements in the database are to be grouped Design process identifies o Relationships among data elements o Most efficient way of grouping data elements together to meet business information requirements o Redundant data elements o Groupings of data elements for specific application programs Normalization: the process of creating small, stable yet flexible and adaptive data structures from complex groups of data Repeating data groups: unnnormalized data wherein there can be multiple records associated with multiple records from another table Referential integrity: rules to ensure that relationships between linked database tables remain consistent Entity-relationship diagram: a methodology for documenting databases illustrating the relationship between various entities in the database. If the business does not get its data model right, the system will not be able to serve the business well end up working with data that is inaccurate, incomplete or difficult to retrieve Distributing Databases Distributed database: database stored in more than one location o Partitioned database Parts of the database are stored and maintained physically in one location and other parts are stored and maintained in other locations so that each remote processor has the necessary data to serve its local area o Duplicate database Duplicate the central database at all remote locations Advantages: o Reduce the vulnerability of a single, massive central site o Increase service and responsiveness to local users and often can run on smaller, less expensive computers 6.3 Disadvantages: o Depart from central data standards and definitions o Pose security problems by widely distributing access to sensitive data Using Databases to Improve Business Performance and Decision Making Data Warehouses Concise, reliable information about current operations, trends and changes across the entire company can be a problem if data in different parts of the organization What is a Data Warehouse? Data warehouse: database that stores current and historical data of potential interest to decision makers throughout the company o Consolidates and standardizes information from different operational databases so that the information can be used for management analysis and decision making. o Concept: Take data from internal & external data sources Extract and transform Data warehouse that serves as an information directory and allows data access & analyses Data Marts Data mart: subset of a data warehouse in which a summarized or highly focused portion of the organization’s data is placed in a separate database for a specific population of users o Smaller, decentralized data warehouse o E.g. sales & marketing data marts to deal with customer information Business Intelligence, Multidimensional Data Analysis, and Data Mining Business intelligence (BI): applications and technologies to help users make better business decisions o Keep track of transactions o Develop knowledge about customers, competitors and internal operations by finding patterns and insights o Change decision-making behaviour to achieve higher profitability o Database -> Data warehouse -> BI Online Analytical Process (OLAP) OLAP: capability for manipulating and analyzing large volumes of data from multiple perspectives o E.g. product, pricing, cost, region or time period Enables users to obtain online answers to ad hoc questions in a fairly rapid amount of time, even when the data are stored in very large databases, such as sales figures for multiple years Cube analogy Data Mining Data mining: analysis of large pools of data to find patterns and rules that caneb used to guide decision making and predict future behaviour o Associations o Sequences o Occurrences linked to a single event Events linked over time Classification Patterns that describe the group to which an item belongs by examining existing items that have been classified and by inferring a set of rules o o Clustering Similar to classification Find different groupings within data Forecasting Uses a series of existing values/patterns to forecast what other values will be Perform high-level analyses of patterns or trends, but also provide more detail when needed Predictive analysis: uses data-mining techniques, historical data and assumptions about future conditions to predict outcomes of events o E.g. the probability a customer will respond to an offer or purchase a specific product Text Mining and Web Mining Unstructured data, most in the form of text files, is believed to account for more than 80% of an organization’s useful information (e.g. e-mails, transcripts, memos, etc.) Text mining: discovery of patterns and relationships from large sets of unstructured data o E.g. businesses might turn to text mining to analyze transcripts of calls to customer service centres to identify major service and repair issues Web mining: discovery and analysis of useful patterns and information from the World Wide Web o E.g. understand customer behaviour, evaluate a website’s effectiveness, quantify the success of marketing campaigns o E.g. Google Trends, Google Insights Web mining = searching for data patterns through content, structure & usage mining o Web content mining o Web structure mining o Process of extracting knowledge from Web content Examines data related to the structure of a particular website Web usage mining Examines user interaction/behaviour data recorded by a Web server whenever requests for a website’s resources are received Databases and the Web Many companies now use the Web to make some of the information in their internal databases available to customers and business partners E.g. buying stuff online – after the user goes to the website, the Web browser software requests data from the organization’s database, communicated through HTML commands Database server: a computer in a client/server environment that is responsible for running a DBMS to process SQL statements and perform data management tasks Web browser Internet Web server Application server Database Server Database o Many back-end databases cannot interpret commands written in HTML o Application server is the middleware working between the Web server & database server “translator” of HTML to SQL Handles all application operations, including transaction processing and data access, between browser-based computers and a company’s back-end business applications or databases Takes requests from the Web server, runs the business logic to process transactions based on those requests, and provides connectivity to the organization’s back-end systems or databases Software for handling these programs could be a Common Gateway Interface (CGI) script Advantages: o Web browser software is much easier to use than proprietary query tools o Few or no changes to the internal database o Costs less to add a Web interface in front of a legacy system than to redesign and rebuild the system to improve user access 6.4 MySpace is a massive database of users’ personal information entirely new business Managing Data Resources Establishing an Information Policy Information policy: the organization’s rules for sharing, disseminating, acquiring, standardizing, classifying and inventorying information. o Specific procedures and accountabilities that identify: Which users and organizational units can SHARE information Where information can be DISTRIBUTED Who is responsible for UPDATING & MAINTAINING the information Data administration: specific policies and procedures through which data can be managed as an organizational resource o Developing information policy o Planning for data o Overseeing logical database design o Data dictionary development o Monitoring how information systems specialists and end-user groups use data Data governance: the policies and processes for managing the availability, usability, integrity and security of the data employed in an enterprise, with special emphasis on promoting privacy, security, data quality and compliance with government regulations. Database administration: a special organizational function for managing the organization’s data resources, concerned with information policy, data planning, maintenance of data dictionaries, and data quality standards. Ensuring Data Quality Inaccurate, untimely or inconsistent data leads to incorrect decisions, product recalls and financial losses If a database is properly designed and enterprise-wide data standards established, duplicate or inconsistent data elements should be minimal Many errors result from data input, e.g. misspellings or incorrect codes Data quality audit: structured survey of the accuracy and level of completeness of the data in an information system o Can be performed by surveying entire data files, samples from data files, or end users for their perceptions of data quality Data cleansing: also known as data scrubbing, activities for detecting and correcting data in a database that are incorrect, incomplete, improperly formatted or redundant. o Enforces consistency among different sets of data that originated in separate information systems