Uploaded by Lawrence Likhtenstein

CIS2200Chapter 6 Foundations of Business Intelligence Databases and Information Management

advertisement
CIS 2200
Chapter 6
FOUNDATIONS OF BUSINESS INTELLIGENCE:
DATABASES AND INFORMATION MANAGEMENT
Copyright © 2017, 2014, 2011 Pearson Education, Inc. All Rights Reserved.
Databases and Information
Management
An effective information system provides users with
accurate, timely, and relevant information. Accurate
information is free of errors. Information is timely when it is
available to decision makers when it is needed.
Information is relevant when it is useful and appropriate for
the types of work and decisions that require it.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Data Management Helps the Charlotte
Hornets Learn More About Their Fans
Problem
◦ Large volumes of data in isolated databases
◦ Outdated data management technology
Solutions
◦ SAP HANA
◦ Data warehouse
◦ FanTracker
Illustrates the importance of data management for better
decision making and customer analysis
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
File Organization Terms and
Concepts
A bit represents the smallest unit of data a computer can
handle.
A group of bits, called a byte, represents a single character,
which can be a letter, a number, or another symbol. A grouping
of characters into a word, a group of words, or a complete
number (such as a person’s name or age) is called a field.
A group of related fields, such as the student’s name, the course
taken, the date, and the grade, comprises a record;
a group of records of the same type is called a file.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
File Organization Terms and
Concepts
• Database: Group of related files
• File: Group of records of same type
• Record: Group of related fields, describes an entity
• Field: Group of characters as word(s) or number(s)
• Entity: Person, place, thing on which we store information
• Attribute: Each characteristic, or quality, describing entity
For example, Student_ID, Course, Date, and Grade are attributes
of the entity COURSE. The specific values that these attributes
can have are found in the fields of the record describing the
entity COURSE.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
The Data Hierarchy
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Problems with the Traditional File
Environment
Files maintained separately by different departments
•
•
•
•
•
•
Data redundancy
Data inconsistency
Program-data dependence
Lack of flexibility
Poor security
Lack of data sharing and availability
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Problems with the Traditional File
Environment
Data redundancy is the presence of duplicate data in multiple
data files so that the same data are stored in more than one
place or location.
Data redundancy occurs when different groups in an
organization independently collect the same piece of data and
store it independently of each other.
Data redundancy wastes storage resources and also leads
to data inconsistency, where the same attribute may have
different values.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Problems with the Traditional File
Environment
Program-data dependence refers to the coupling of data stored
in files and the specific programs required to update and
maintain those files such that changes in programs require
changes to the data.
Lack of Flexibility is a traditional file system can deliver routine
scheduled reports after extensive programming efforts, but it
cannot deliver ad hoc reports or respond to unanticipated
information requirements in a timely fashion.
Poor Security: Because there is little control or management of
data, access to and dissemination of information may be out of
control. Management might have no way of knowing who is
accessing or even making changes to the organization’s data.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Problems with the Traditional File
Environment
Lack of Data Sharing and Availability
Because pieces of information in different files and different
parts of the organization cannot be related to one another, it is
virtually impossible for information to be shared or accessed in
a timely manner. Information cannot flow freely across different
functional areas or different parts of the organization.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Traditional File Processing
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Database Management Systems
Database
Database technology cuts through many of the problems of
traditional file organization.
A more rigorous definition of a database is a collection of data
organized to serve many applications efficiently by centralizing the
data and controlling redundant data.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Database Management Systems
Database management system (DBMS)
Database technology cuts through many of the problems of traditional
file organization.
A more rigorous definition of a database is a collection of data
organized to serve many applications efficiently by centralizing the
data and controlling redundant data.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Database Management Systems
Database management system (DBMS)
The logical view presents data, as they would be perceived by end
users or business specialists, whereas the physical view shows how
data are actually organized and structured on physical storage media.
◦
◦
◦
◦
Controls redundancy
Eliminates inconsistency
Uncouples programs and data
Enables organization to centrally manage data and data security
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Human Resources Database with
Multiple Views
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Relational D B M S
Contemporary DBMS use different database models to keep
track of entities, attributes, and relationships.
The most popular type of DBMS today for PCs as well as for
larger computers and mainframes is the relational DBMS.
Relational databases represent data as two-dimensional tables
(called relations). Tables may be referred to as files. Each table
contains data on an entity and its attributes.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Relational D B M S
Table: grid of columns and rows
◦ Rows (tuples): Records for different entities
◦ Fields (columns): Represents attribute for entity
◦ Key field: Field used to uniquely identify each record
◦ Primary key: Field in table used for key fields
◦ Foreign key: Primary key used in second table as look-up
field to identify records from original table
Each table in a relational database has one field that is designated as
its primary key. This key field is the unique identifier for all the
information in any row of the table and this primary key cannot be
duplicated.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Relational Database Tables
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Operations of a Relational D B M S
Three basic operations used to develop useful sets of data
◦ SELECT
◦ Creates subset of data of all records that meet stated criteria
◦ JOIN
◦ Combines relational tables to provide user with more
information than available in individual tables
◦ PROJECT
◦ Creates subset of columns in table, creating tables with only
the information specified
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Capabilities of Database
Management Systems
Data definition capability
A DBMS includes capabilities and tools for organizing, managing,
and accessing the data in the database. The most important are
its data definition language, data dictionary, and data
manipulation language.
DBMS have a data definition capability to specify the structure
of the content of the database. It would be used to create
database tables and to define the characteristics of the fields in
each table. This information about the database would be
documented in a data dictionary. A data dictionary is an
automated or manual file that stores definitions of data
elements and their characteristics.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Capabilities of Database
Management Systems
Querying and reporting
◦ Data manipulation language
DBMS includes tools for accessing and manipulating information in
databases. Most DBMS have a specialized language called a data
manipulation language that is used to add, change, delete, and
retrieve the data in the database.
This language contains commands that permit end users and
programming specialists to extract data from the database to satisfy
information requests and develop applications.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Capabilities of Database
Management Systems
Structured Query Language (SQL)
Users of DBMS for large and midrange computers, such as DB2, Oracle,
or SQL Server, would employ SQL to retrieve information they needed
from the database.
Many DBMShave report generation capabilities for creating polished
reports (Microsoft Access)
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Access Data Dictionary Features
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Example of an S Q L Query
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
An Access Query
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Designing Databases
Conceptual design vs. physical design
The database requires both a conceptual design and a physical
design.
The conceptual, or logical, design of a database is an abstract
model of the database from a business perspective, whereas the
physical design shows how the database is actually arranged on
direct-access storage devices.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Designing Databases
Normalization
◦ The process of creating small, stable, yet flexible and adaptive
data structures from complex groups of data. Streamlining
complex groupings of data to minimize redundant data elements
and awkward many-to-many relationships
Referential integrity
◦ Rules used by RDBMS to ensure relationships between tables
remain consistent
◦ When one table has a foreign key that points to another table,
you may not add a record to the table with the foreign key unless
there is a corresponding record in the linked table.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Designing Databases
Entity-relationship diagram
To create an efficient database, you must know what the relationships
are among the various data elements, the types of data that will be
stored, and how the organization will need to manage the data.
Database designers document their data model with an entityrelationship diagram. This diagram illustrates the relationship
between the entitiesA correct data model is essential for a system
serving the business well.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Non-Relational Databases and
Databases in the Cloud
Non-relational database management systems use a more
flexible data model and are designed for managing large data
sets across many distributed machines and for easily scaling up
or down. They are useful for accelerating simple queries against
large volumes of structured and unstructured data, including
web, social media, graphics, and other forms of data that are
difficult to analyze with traditional SQL-based tools.Nonrelational databases: “No SQL”
◦ More flexible data model
◦ Data sets stored across distributed machines
◦ Easier to scale
◦ Handle large volumes of unstructured and structured data
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Non-Relational Databases and
Databases in the Cloud
Cloud-based data management services have special appeal for webfocused startups or small to medium-sized businesses seeking
database capabilities at a lower cost than in-house database products.
Databases in the cloud
◦ Appeal to start-ups, smaller businesses
◦ Amazon Relational Database Service, Microsoft SQL Azure
◦ Private clouds
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Distributed Databases
A distributed database is one that is stored in multiple physical
locations. Parts or copies of the database are physically stored in
one location and other parts or copies are maintained in other
locations.
Spanner makes it possible to store information across millions of
machines in hundreds of data centers around the globe, with
special time-keeping tools to synchronize the data precisely in
all of its locations and ensure the data are always consistent.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Blockchain
Blockchain is a distributed database technology that enables
firms and organizations to create and verify transactions on a
network nearly instantaneously without a central authority.
The system stores transactions as a distributed ledger among a
network of computers The information held in the database is
continually reconciled by the computers in the network.
• Distributed ledgers in a peer-to-peer distributed database
• Maintains a growing list of records and transactions shared by all
• Encryption used to identify participants and transactions
• Used for financial transactions, supply chain, and medical records
• Foundation of Bitcoin, and other crypto currencies
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
The Challenge of Big Data
Big data
Most data collected by organizations used to be transaction data
that could easily fit into rows and columns of relational database
management systems.
We are now witnessing an explosion of data from web traffic,
email messages, and social media content (tweets, status
messages), as well as machine-generated data from sensors
(used in smart meters, manufacturing sensors, and electrical
meters) or from electronic trading systems.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
The Challenge of Big Data
Big data
We now use the term big data to describe these data sets with
volumes so huge that they are beyond the ability of typical
DBMS to capture, store, and analyze.
Volumes too great for typical DBMS
◦ Petabytes, exabytes of data
Big data is often characterized by the “3Vs”: the
extreme volume of data, the wide variety of data types and
sources, and the velocity at which data must be processed.
Requires new tools and technologies to manage and analyze
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Business Intelligence
Infrastructure
A contemporary infrastructure for business intelligence has an
array of tools for obtaining useful information from all the
different types of data used by businesses today, including semistructured and unstructured big data in vast quantities.
These capabilities include data warehouses and data marts,
Hadoop, in-memory computing, and analytical platforms. Some
of these capabilities are available as cloud services.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Business Intelligence
Infrastructure
A data warehouse is a database that stores current and
historical data of potential interest to decision makers
throughout the company. The data originate in many core
operational transaction systems, such as systems for sales,
customer accounts, and manufacturing, and may include data
from website transactions.
Data warehouse
◦ Stores current and historical data from many core
operational transaction systems
◦ Consolidates and standardizes information for use across
enterprise, but data cannot be altered
◦ Provides analysis and reporting tools
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Business Intelligence
Infrastructure
Data marts
A data mart is a subset of a data warehouse in which a
summarized or highly focused portion of the organization’s data
is placed in a separate database for a specific population of
users.
◦ Subset of data warehouse
◦ Typically focus on single subject or line of business
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Business Intelligence
Infrastructure
Hadoop is an open source software framework managed by the
Apache Software Foundation that enables distributed parallel
processing of huge amounts of data across inexpensive
computers.
It breaks a big data problem down into sub-problems,
distributes them among up to thousands of inexpensive
computer processing nodes, and then combines the result into a
smaller data set that is easier to analyze.
Key services
◦ Hadoop Distributed File System (HDFS): data storage
◦ MapReduce: breaks data into clusters for work
◦ Hbase: No SQL database
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Business Intelligence
Infrastructure
Another way of facilitating big data analysis is to use in-memory
computing, which relies primarily on a computer’s main
memory (RAM) for data storage. (Conventional DBMS use disk
storage systems.)
Users access data stored in system primary memory, thereby
eliminating bottlenecks from retrieving and reading data in a
traditional, disk-based database and dramatically shortening
query response times. In-memory computing.
◦ Can reduce hours/days of processing to seconds
◦ Requires optimized hardware
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Business Intelligence
Infrastructure
Commercial database vendors have developed specialized highspeed analytic platforms using both relational and nonrelational technology that are optimized for analyzing large data
sets.
Analytic platforms feature preconfigured hardware-software
systems that are specifically designed for query processing and
analytics.Analytic platforms.
◦ High-speed platforms using both relational and nonrelational tools optimized for large datasets
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Analytical Tools: Relationships,
Patterns, Trends
Tools for consolidating, analyzing, and providing access to vast
amounts of data to help users make better business decisions
Online Analytical Processing (OLAP)
OLAP supports multidimensional data analysis, enabling users to
view the same data in different ways using multiple dimensions.
Each aspect of information—product, pricing, cost, region, or
time period—represents a different dimension.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Analytical Tools: Relationships,
Patterns, Trends
Data mining is more discovery-driven. Data mining provides
insights into corporate data that cannot be obtained with OLAP
by finding hidden patterns and relationships in large databases
and inferring rules from them to predict future behavior.
The patterns and rules are used to guide decision making and
forecast the effect of those decisions.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Data Mining
Finds hidden patterns, relationships in datasets
◦ Example: customer buying patterns
Infers rules to predict future behavior
Types of information obtainable from data mining:
◦ Associations
◦ Sequences
◦ Classification
◦ Clustering
◦ Forecasting
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Analytical Tools: Relationships,
Patterns, Trends
Text mining
Unstructured data, most in the form of text files, is believed to
account for more than 80 percent of useful organizational
information and is one of the major sources of big data that
firms want to analyze.
Text mining tools are now available to help businesses analyze data
related to Email, memos, call center transcripts, survey responses,
legal cases, patent descriptions, and service reports which are all
valuable for finding patterns and trends that will help employees
make better business decisions.
.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Analytical Tools: Relationships,
Patterns, Trends
Web mining
The web is another rich source of unstructured big data for revealing
patterns, trends, and insights into customer behavior. The discovery
and analysis of useful patterns and information from the World
Wide Web are called web mining.
Businesses might turn to web mining to help them understand
customer behavior, evaluate the effectiveness of a particular
website, or quantify the success of a marketing campaign.
◦ Web content mining
◦ Web structure mining
◦ Web usage mining
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Databases and the Web
Many companies use the web to make some internal databases
available to customers or partners
In a client/server environment, the DBMS resides on a dedicated
computer called a database server. The DBMS receives the SQL
requests and provides the required data. Middleware transfers
information from the organization’s internal database back to
the web server for delivery in the form of a web page to the
user.
Typical configuration includes:
◦ Web server
◦ Application server/middleware/CGI scripts
◦ Database server (hosting DBMS)
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Databases and the Web
There are a number of advantages to using the web to access an
organization’s internal databases. First, web browser software is
much easier to use than proprietary query tools.
Second, the web interface requires few or no changes to the
internal database. It costs much less to add a web interface in
front of a legacy system than to redesign and rebuild the system
to improve user access.
Advantages of using the web for database access:
◦ Ease of use of browser software
◦ Web interface requires few or no changes to database
◦ Inexpensive to add web interface to system
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Establishing an Information Policy
An information policy specifies the organization’s rules for
sharing, disseminating, acquiring, standardizing, classifying, and
inventorying information.
Information policy lays out specific procedures and
accountabilities, identifying which users and organizational units
can share information, where information can be distributed,
and who is responsible for updating and maintaining the
information.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Establishing an Information Policy
Data administration is responsible for the specific policies and
procedures through which data can be managed as an
organizational resource.
Data governance deals with the policies and processes for
managing the availability, usability, integrity, and security of the
data employed in an enterprise with special emphasis on
promoting privacy, security, data quality, and compliance with
government regulations.
Database administration a database design and management
group within the corporate information systems division that is
responsible for defining and organizing the structure
and content of the database and maintaining the database
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Ensuring Data Quality
More than 25 percent of critical data in Fortune 1000 company
databases are inaccurate or incomplete
Before new database is in place, a firm must:
◦ Identify and correct faulty data
◦ Establish better routines for editing data once database in
operation
Data quality audit which is a structured survey of the accuracy
and level of completeness of the data in an information system.
Data cleansing, also known as data scrubbing, consists of
activities for detecting and correcting data in a database that are
incorrect, incomplete, improperly formatted, or redundant.
Copyright © 2020, 2018, 2016 Pearson Education, Inc. All Rights Reserved
Download