CIS 2200 Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Copyright © 2017, 2014, 2011 Pearson Education, Inc. All Rights Reserved. Databases and Information Management An effective information system provides users with accurate, timely, and relevant information. Accurate information is free of errors. Information is timely when it is available to decision makers when it is needed. Information is relevant when it is useful and appropriate for the types of work and decisions that require it. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Data Management Helps the Charlotte Hornets Learn More About Their Fans Problem ◦ Large volumes of data in isolated databases ◦ Outdated data management technology Solutions ◦ SAP HANA ◦ Data warehouse ◦ FanTracker Illustrates the importance of data management for better decision making and customer analysis Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved File Organization Terms and Concepts A bit represents the smallest unit of data a computer can handle. A group of bits, called a byte, represents a single character, which can be a letter, a number, or another symbol. A grouping of characters into a word, a group of words, or a complete number (such as a person’s name or age) is called a field. A group of related fields, such as the student’s name, the course taken, the date, and the grade, comprises a record; a group of records of the same type is called a file. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved File Organization Terms and Concepts • Database: Group of related files • File: Group of records of same type • Record: Group of related fields, describes an entity • Field: Group of characters as word(s) or number(s) • Entity: Person, place, thing on which we store information • Attribute: Each characteristic, or quality, describing entity For example, Student_ID, Course, Date, and Grade are attributes of the entity COURSE. The specific values that these attributes can have are found in the fields of the record describing the entity COURSE. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved The Data Hierarchy Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Problems with the Traditional File Environment Files maintained separately by different departments • • • • • • Data redundancy Data inconsistency Program-data dependence Lack of flexibility Poor security Lack of data sharing and availability Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Problems with the Traditional File Environment Data redundancy is the presence of duplicate data in multiple data files so that the same data are stored in more than one place or location. Data redundancy occurs when different groups in an organization independently collect the same piece of data and store it independently of each other. Data redundancy wastes storage resources and also leads to data inconsistency, where the same attribute may have different values. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Problems with the Traditional File Environment Program-data dependence refers to the coupling of data stored in files and the specific programs required to update and maintain those files such that changes in programs require changes to the data. Lack of Flexibility is a traditional file system can deliver routine scheduled reports after extensive programming efforts, but it cannot deliver ad hoc reports or respond to unanticipated information requirements in a timely fashion. Poor Security: Because there is little control or management of data, access to and dissemination of information may be out of control. Management might have no way of knowing who is accessing or even making changes to the organization’s data. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Problems with the Traditional File Environment Lack of Data Sharing and Availability Because pieces of information in different files and different parts of the organization cannot be related to one another, it is virtually impossible for information to be shared or accessed in a timely manner. Information cannot flow freely across different functional areas or different parts of the organization. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Traditional File Processing Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Database Management Systems Database Database technology cuts through many of the problems of traditional file organization. A more rigorous definition of a database is a collection of data organized to serve many applications efficiently by centralizing the data and controlling redundant data. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Database Management Systems Database management system (DBMS) Database technology cuts through many of the problems of traditional file organization. A more rigorous definition of a database is a collection of data organized to serve many applications efficiently by centralizing the data and controlling redundant data. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Database Management Systems Database management system (DBMS) The logical view presents data, as they would be perceived by end users or business specialists, whereas the physical view shows how data are actually organized and structured on physical storage media. ◦ ◦ ◦ ◦ Controls redundancy Eliminates inconsistency Uncouples programs and data Enables organization to centrally manage data and data security Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Human Resources Database with Multiple Views Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Relational D B M S Contemporary DBMS use different database models to keep track of entities, attributes, and relationships. The most popular type of DBMS today for PCs as well as for larger computers and mainframes is the relational DBMS. Relational databases represent data as two-dimensional tables (called relations). Tables may be referred to as files. Each table contains data on an entity and its attributes. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Relational D B M S Table: grid of columns and rows ◦ Rows (tuples): Records for different entities ◦ Fields (columns): Represents attribute for entity ◦ Key field: Field used to uniquely identify each record ◦ Primary key: Field in table used for key fields ◦ Foreign key: Primary key used in second table as look-up field to identify records from original table Each table in a relational database has one field that is designated as its primary key. This key field is the unique identifier for all the information in any row of the table and this primary key cannot be duplicated. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Relational Database Tables Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Operations of a Relational D B M S Three basic operations used to develop useful sets of data ◦ SELECT ◦ Creates subset of data of all records that meet stated criteria ◦ JOIN ◦ Combines relational tables to provide user with more information than available in individual tables ◦ PROJECT ◦ Creates subset of columns in table, creating tables with only the information specified Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Capabilities of Database Management Systems Data definition capability A DBMS includes capabilities and tools for organizing, managing, and accessing the data in the database. The most important are its data definition language, data dictionary, and data manipulation language. DBMS have a data definition capability to specify the structure of the content of the database. It would be used to create database tables and to define the characteristics of the fields in each table. This information about the database would be documented in a data dictionary. A data dictionary is an automated or manual file that stores definitions of data elements and their characteristics. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Capabilities of Database Management Systems Querying and reporting ◦ Data manipulation language DBMS includes tools for accessing and manipulating information in databases. Most DBMS have a specialized language called a data manipulation language that is used to add, change, delete, and retrieve the data in the database. This language contains commands that permit end users and programming specialists to extract data from the database to satisfy information requests and develop applications. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Capabilities of Database Management Systems Structured Query Language (SQL) Users of DBMS for large and midrange computers, such as DB2, Oracle, or SQL Server, would employ SQL to retrieve information they needed from the database. Many DBMShave report generation capabilities for creating polished reports (Microsoft Access) Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Access Data Dictionary Features Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Example of an S Q L Query Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved An Access Query Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Designing Databases Conceptual design vs. physical design The database requires both a conceptual design and a physical design. The conceptual, or logical, design of a database is an abstract model of the database from a business perspective, whereas the physical design shows how the database is actually arranged on direct-access storage devices. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Designing Databases Normalization ◦ The process of creating small, stable, yet flexible and adaptive data structures from complex groups of data. Streamlining complex groupings of data to minimize redundant data elements and awkward many-to-many relationships Referential integrity ◦ Rules used by RDBMS to ensure relationships between tables remain consistent ◦ When one table has a foreign key that points to another table, you may not add a record to the table with the foreign key unless there is a corresponding record in the linked table. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Designing Databases Entity-relationship diagram To create an efficient database, you must know what the relationships are among the various data elements, the types of data that will be stored, and how the organization will need to manage the data. Database designers document their data model with an entityrelationship diagram. This diagram illustrates the relationship between the entitiesA correct data model is essential for a system serving the business well. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Non-Relational Databases and Databases in the Cloud Non-relational database management systems use a more flexible data model and are designed for managing large data sets across many distributed machines and for easily scaling up or down. They are useful for accelerating simple queries against large volumes of structured and unstructured data, including web, social media, graphics, and other forms of data that are difficult to analyze with traditional SQL-based tools.Nonrelational databases: “No SQL” ◦ More flexible data model ◦ Data sets stored across distributed machines ◦ Easier to scale ◦ Handle large volumes of unstructured and structured data Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Non-Relational Databases and Databases in the Cloud Cloud-based data management services have special appeal for webfocused startups or small to medium-sized businesses seeking database capabilities at a lower cost than in-house database products. Databases in the cloud ◦ Appeal to start-ups, smaller businesses ◦ Amazon Relational Database Service, Microsoft SQL Azure ◦ Private clouds Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Distributed Databases A distributed database is one that is stored in multiple physical locations. Parts or copies of the database are physically stored in one location and other parts or copies are maintained in other locations. Spanner makes it possible to store information across millions of machines in hundreds of data centers around the globe, with special time-keeping tools to synchronize the data precisely in all of its locations and ensure the data are always consistent. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Blockchain Blockchain is a distributed database technology that enables firms and organizations to create and verify transactions on a network nearly instantaneously without a central authority. The system stores transactions as a distributed ledger among a network of computers The information held in the database is continually reconciled by the computers in the network. • Distributed ledgers in a peer-to-peer distributed database • Maintains a growing list of records and transactions shared by all • Encryption used to identify participants and transactions • Used for financial transactions, supply chain, and medical records • Foundation of Bitcoin, and other crypto currencies Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved The Challenge of Big Data Big data Most data collected by organizations used to be transaction data that could easily fit into rows and columns of relational database management systems. We are now witnessing an explosion of data from web traffic, email messages, and social media content (tweets, status messages), as well as machine-generated data from sensors (used in smart meters, manufacturing sensors, and electrical meters) or from electronic trading systems. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved The Challenge of Big Data Big data We now use the term big data to describe these data sets with volumes so huge that they are beyond the ability of typical DBMS to capture, store, and analyze. Volumes too great for typical DBMS ◦ Petabytes, exabytes of data Big data is often characterized by the “3Vs”: the extreme volume of data, the wide variety of data types and sources, and the velocity at which data must be processed. Requires new tools and technologies to manage and analyze Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Business Intelligence Infrastructure A contemporary infrastructure for business intelligence has an array of tools for obtaining useful information from all the different types of data used by businesses today, including semistructured and unstructured big data in vast quantities. These capabilities include data warehouses and data marts, Hadoop, in-memory computing, and analytical platforms. Some of these capabilities are available as cloud services. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Business Intelligence Infrastructure A data warehouse is a database that stores current and historical data of potential interest to decision makers throughout the company. The data originate in many core operational transaction systems, such as systems for sales, customer accounts, and manufacturing, and may include data from website transactions. Data warehouse ◦ Stores current and historical data from many core operational transaction systems ◦ Consolidates and standardizes information for use across enterprise, but data cannot be altered ◦ Provides analysis and reporting tools Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Business Intelligence Infrastructure Data marts A data mart is a subset of a data warehouse in which a summarized or highly focused portion of the organization’s data is placed in a separate database for a specific population of users. ◦ Subset of data warehouse ◦ Typically focus on single subject or line of business Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Business Intelligence Infrastructure Hadoop is an open source software framework managed by the Apache Software Foundation that enables distributed parallel processing of huge amounts of data across inexpensive computers. It breaks a big data problem down into sub-problems, distributes them among up to thousands of inexpensive computer processing nodes, and then combines the result into a smaller data set that is easier to analyze. Key services ◦ Hadoop Distributed File System (HDFS): data storage ◦ MapReduce: breaks data into clusters for work ◦ Hbase: No SQL database Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Business Intelligence Infrastructure Another way of facilitating big data analysis is to use in-memory computing, which relies primarily on a computer’s main memory (RAM) for data storage. (Conventional DBMS use disk storage systems.) Users access data stored in system primary memory, thereby eliminating bottlenecks from retrieving and reading data in a traditional, disk-based database and dramatically shortening query response times. In-memory computing. ◦ Can reduce hours/days of processing to seconds ◦ Requires optimized hardware Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Business Intelligence Infrastructure Commercial database vendors have developed specialized highspeed analytic platforms using both relational and nonrelational technology that are optimized for analyzing large data sets. Analytic platforms feature preconfigured hardware-software systems that are specifically designed for query processing and analytics.Analytic platforms. ◦ High-speed platforms using both relational and nonrelational tools optimized for large datasets Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Analytical Tools: Relationships, Patterns, Trends Tools for consolidating, analyzing, and providing access to vast amounts of data to help users make better business decisions Online Analytical Processing (OLAP) OLAP supports multidimensional data analysis, enabling users to view the same data in different ways using multiple dimensions. Each aspect of information—product, pricing, cost, region, or time period—represents a different dimension. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Analytical Tools: Relationships, Patterns, Trends Data mining is more discovery-driven. Data mining provides insights into corporate data that cannot be obtained with OLAP by finding hidden patterns and relationships in large databases and inferring rules from them to predict future behavior. The patterns and rules are used to guide decision making and forecast the effect of those decisions. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Data Mining Finds hidden patterns, relationships in datasets ◦ Example: customer buying patterns Infers rules to predict future behavior Types of information obtainable from data mining: ◦ Associations ◦ Sequences ◦ Classification ◦ Clustering ◦ Forecasting Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Analytical Tools: Relationships, Patterns, Trends Text mining Unstructured data, most in the form of text files, is believed to account for more than 80 percent of useful organizational information and is one of the major sources of big data that firms want to analyze. Text mining tools are now available to help businesses analyze data related to Email, memos, call center transcripts, survey responses, legal cases, patent descriptions, and service reports which are all valuable for finding patterns and trends that will help employees make better business decisions. . Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Analytical Tools: Relationships, Patterns, Trends Web mining The web is another rich source of unstructured big data for revealing patterns, trends, and insights into customer behavior. The discovery and analysis of useful patterns and information from the World Wide Web are called web mining. Businesses might turn to web mining to help them understand customer behavior, evaluate the effectiveness of a particular website, or quantify the success of a marketing campaign. ◦ Web content mining ◦ Web structure mining ◦ Web usage mining Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Databases and the Web Many companies use the web to make some internal databases available to customers or partners In a client/server environment, the DBMS resides on a dedicated computer called a database server. The DBMS receives the SQL requests and provides the required data. Middleware transfers information from the organization’s internal database back to the web server for delivery in the form of a web page to the user. Typical configuration includes: ◦ Web server ◦ Application server/middleware/CGI scripts ◦ Database server (hosting DBMS) Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Databases and the Web There are a number of advantages to using the web to access an organization’s internal databases. First, web browser software is much easier to use than proprietary query tools. Second, the web interface requires few or no changes to the internal database. It costs much less to add a web interface in front of a legacy system than to redesign and rebuild the system to improve user access. Advantages of using the web for database access: ◦ Ease of use of browser software ◦ Web interface requires few or no changes to database ◦ Inexpensive to add web interface to system Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Establishing an Information Policy An information policy specifies the organization’s rules for sharing, disseminating, acquiring, standardizing, classifying, and inventorying information. Information policy lays out specific procedures and accountabilities, identifying which users and organizational units can share information, where information can be distributed, and who is responsible for updating and maintaining the information. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Establishing an Information Policy Data administration is responsible for the specific policies and procedures through which data can be managed as an organizational resource. Data governance deals with the policies and processes for managing the availability, usability, integrity, and security of the data employed in an enterprise with special emphasis on promoting privacy, security, data quality, and compliance with government regulations. Database administration a database design and management group within the corporate information systems division that is responsible for defining and organizing the structure and content of the database and maintaining the database Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved Ensuring Data Quality More than 25 percent of critical data in Fortune 1000 company databases are inaccurate or incomplete Before new database is in place, a firm must: ◦ Identify and correct faulty data ◦ Establish better routines for editing data once database in operation Data quality audit which is a structured survey of the accuracy and level of completeness of the data in an information system. Data cleansing, also known as data scrubbing, consists of activities for detecting and correcting data in a database that are incorrect, incomplete, improperly formatted, or redundant. Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved