Slides - Zhangxi Lin's - Texas Tech University

advertisement
ISQS 6339, Business Intelligence
Data Warehousing
Zhangxi Lin
Texas Tech University
1
1
Outlines

So far students have learned
◦ Basic concepts of business intelligence
◦ The definition and importance of data warehouse

In this lecture, the following topics will be covered
◦ SQL Server 2008 data mart case study
 How to access data in a network directory
 How to access SQL Server 2008 on the Citrix Server
 How to load data from an Excel file to a database
◦ Data warehouse overview
◦ Data warehouse architecture
◦ Data integration
ISQS 6347, Data & Text Mining
2
Data Warehousing
Definitions and Concepts

Data warehouse
◦ Video – Overview of data warehouse 2’38”
A physical repository where relational data are
specially organized to provide enterprise-wide,
cleansed data in a standardized format
 Benefits of data warehouse 3’18”
3
Data mart

Definition
A localized data warehouse that stores only relevant data
to a department or even an individual
◦ Dependent data mart
A subset that is created directly from a data warehouse
◦ Independent data mart
A small data warehouse designed for a strategic business unit or a
department
4
Data Mart
- The IMW Case
IMW, standing for Internet Media Works!, is an ASP in
real estate information services. It is headquartered
in Austin, Texas. CEO is Gary Anderson.
Web page: http://www.inetworks.com
5
About IMW
Based in Austin, Texas, IMW (Internet Media Works!) is an
ASP, specialized mainly in web-based application development,
database integration, and web development and hosting for
all kinds of businesses.
 IMW has been more successful in selling its e-business
services for commercial real estate. Its services include lead
generation, real estate transaction management, property
listing, realtor membership management, real estate indices,
real estate auctions, etc., with COMMREX as a complete ebusiness solution.
 IMW used to have up to 6 full-time employees and a few
part-time employees.

6
ISQS 6347, Data & Text Mining
IMW’s Services
IMW’s Web-Based Application Services
Public User
Application
Services
Public User Support
Website Hosting
Services
Optional Website
Hosting Services
Core Membership
Database Services
Optional Membership
Database Services
Core Property Listing
Database Services
Optional Property Listing
Database Services
Networking and System Operation Services
Internet Service Provider’s Services
7
ISQS 6347, Data & Text Mining
Why need Data Mart?

Data mart complements the centralized data
warehousing based on UDM model, for the
situations where UDM cannot be used
◦
◦
◦
◦
Legacy databases
Data are from nondatabase sources
No physical connection the centralized data warehouse
Data are not clean
8
Data Mart Structures

Fact tables
◦ Measures

Dimension tables
◦ Dimensions and Hierarchies
◦ Attributes (or columns)

Dimensional modeling – Stars and Snowflakes
9
Measures




A numeric quantity expressing some of the
organization’s performance. The information
represented by this quantity is used to support or
evaluate the decision making and performance of
the organization.
A measure is also called a fact
The table holding measure information is called as a
fact table
Dimensions vs. Measures 2’38”
Commrex Real Estate Operational
Database
Users: property listors, webmaster, marketing manager of IMW
 Objective: Encourage realtors to use the online ASP services with

the best information services to increase IMW’s revenue.
Value Chain
◦ Listors create their account
◦ Listors post their real estate properties to the web-based database
services and pay listing fees
◦ Property buyers search the website-based database and buy
properties from listors. This is the incentive for listors to use the ASP
services
 Business Processes
◦ Listor sign up
◦ Listor account management
◦ Property data posting
◦ Property search
◦ Property database maintenance

11
11
IMW’s Database ERD Model
Property Listing Database
TransactionID
Membership Database
M:1
UserID
Property ID
PropID
Listor ID
M:M
Property Type
M:1
Listor ID

Listor Name

Property Type
Type Name
Address
Company ID
Subtype 1
City

Subtype 2

Chapter

UpdateDate
Feature
Subtype n


Legends
Primary Key
M:M
Functions
Specializations

Company ID
Comp Name

Address
Secondary Key

Telephone #
Link to a table

ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
12
Commrex Data Warehousing


Users: CEO of IMW, IMW business analyst, IMW marketing
manager
Analytic themes
◦ Fast retrieval of business key performance indicators (KPIs)
◦ Decision making on business promotions

Applications
◦ Geographic distribution of property listings
◦ Scorecard for main performance indicators
◦ Dashboard

Questions
◦
◦
◦
◦
How to model data warehouse?
What are required in data transformation and preprocessing?
Any missing dimension for data ware housing?
How to perform routine data warehouse updates – frequency,
timing, etc.
IMW’s Data Warehouse Dimensional Model
Property Listing Fact
Property Type
Dimension
Membership Dimension
Property ID
Listor ID
Listor ID

Listor Name
PropType
PropType

SubName
Address
Company ID
City


Chapter
…
UpdateDate
Legends
Primary Key
Features
Specializations


Year
Company ID
Quarter
Comp Name
Month
Date

Secondary Key
Functions
Company
Dimension

Address

Telephone #

Link to a table
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
14
Data Warehouse Overview
15
Data Warehousing
Characteristics

Basic characteristics of data warehousing
◦
◦
◦
◦

Subject oriented
Integrated
Time variant (time series)
Nonvolatile (not allow to change)
Others
◦
◦
◦
◦
◦
Web based
Relational/multidimensional
Client/server
Real-time
Include metadata
16
Data Warehousing
Process Overview

Data in DW are constantly accumulated.
◦ Organizations continuously collect data, information, and
knowledge at an increasingly accelerated rate and store them in
computerized systems

The number of users is constantly increasing.
◦ The number of users needing to access the information continues
to increase as a result of improved reliability and availability of
network access, especially the Internet

The organization using data warehouse relied on DW
more and more
17
Data Warehousing
More Concepts



Operational data stores (ODS)
A type of database often used as an interim area for a data
warehouse, especially for customer information files
Enterprise data warehouse (EDW)
A large-scale data warehouse used across the enterprise for
decision support. It integrates different sources of information
into a consolidated information system.
Metadata (Video 1’41”)
Data about data. In a data warehouse, metadata describe the
contents of a data warehouse and the manner of its use
◦ Syntactic metadata, structural metadata, and semantic metadata
18
Data Warehousing
Process Overview
19
Data Warehousing
Process Overview

The major components of a data warehousing
process
◦
◦
◦
◦
◦
◦
Data sources
Data extraction
Data loading
Comprehensive database
Metadata
Middleware tools
20
Data Warehouse
Architectures
21
Three Parts of Data Warehouse



The data warehouse that contains the data and
associated software
Data acquisition (back-end) software that extracts data
from legacy systems and external sources, consolidates
and summarizes them, and loads them into the data
warehouse
Client (front-end) software that allows users to access
and analyze data from the warehouse
22
Three-Tier Data Warehouse
23
Alternative Data Warehouse
Architectures (1)
24
Alternative Data Warehouse
Architectures (2)
25
Alternative Data Warehouse
Architectures (3)
26
Alternative Data Warehouse
Architectures (4)
27
Alternative Data Warehouse
Architectures (5)
28
Architectures Comparison
29
Teradata’s EDW
30
Hadoop – for BI in the Cloudera



Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
Hadoop makes it possible to run applications on
systems with thousands of nodes involving thousands
of terabytes.
Hadoop was inspired by Google's MapReduce, a
software framework in which anapplication is broken
down into numerous small parts. Doug Cutting,
Hadoop's creator, named the framework after his child's
stuffed toy elephant.
31
Apache Hadoop

The Apache Hadoop framework is
composed of the following modules :
◦ Hadoop Common - contains libraries and
utilities needed by other Hadoop modules
◦ Hadoop Distributed File System (HDFS).
◦ Hadoop YARN - a resource-management
platform responsible for managing compute
resources in clusters and using them for
scheduling of users' applications.
◦ Hadoop MapReduce - a programming
model for large scale data processing.
ISQS 6339, Data Mgmt & BI
32
MapReduce
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.
33
How Hadoop Operates
ISQS 6339, Data Mgmt & BI
34
Cloudera’s Hadoop System
35
Hadoop 2: Big data's big leap forward



The new Hadoop is the Apache Foundation's attempt to
create a whole new general framework for the way big
data can be stored, mined, and processed.
The biggest constraint on scale has been Hadoop’s job
handling. All jobs in Hadoop are run as batch processes
through a single daemon called JobTracker, which
creates a scalability and processing-speed bottleneck.
Hadoop 2 uses an entirely new job-processing
framework built using two daemons:
ResourceManager, which governs all jobs in the
system, and NodeManager, which runs on each
Hadoop node and keeps the ResourceManager
informed about what's happening on that node.
ISQS 6339, Data Mgmt & BI
36
MapReduce 2.0 – YARN
(Yet Another Resource Negotiator)
ISQS 6339, Data Mgmt & BI
37
Teradata Big Data Platform
2013-12-02
林漳希 @清华大学
38
Dell representation of the
Hadoop ecosystem
39
Nokia’s Big Data Architechture
2013-12-02
林漳希 @清华大学
40
Comparison between big data platform and
traditional BI platform
2013-12-02
林漳希 @清华大学
41
Resolving legacy problem – Dual platform
2013-12-02
林漳希 @清华大学
42
Ten factors that potentially affect
the architecture selection decision
1. Information interdependence
between organizational units
2. Upper management’s
information needs
3. Urgency of need for a data
warehouse
4. Nature of end-user tasks
5. Constraints on resources
6. Strategic view of the data
warehouse prior to
implementation
7. Compatibility with existing
systems
8. Perceived ability of the in-house
IT staff
9. Technical issues
10. Social/political factors
43
Data Integration
44
Data Integration

Integration that comprises three major processes:
◦ data access,
◦ data federation, and
◦ change capture.


When these three processes are correctly implemented, data
can be accessed and made accessible to an array of ETL and
analysis tools and data warehousing environments
ETL Tools 4’56”
45
Data Integration
Enterprise application integration (EAI)
A technology that provides a vehicle for pushing data from source systems into a
data warehouse, including application functionality integration. Recently serviceoriented architecture (SOA) is applied
 Enterprise information integration (EII)
An evolving tool space that promises real-time data integration from a variety
of sources, such as relational databases, Web services, and multidimensional
databases
 Extraction, transformation, and load (ETL)
A data warehousing process that consists of extraction (i.e., reading data from a
database), transformation (i.e., converting the extracted data from its previous
form into the form in which it needs to be so that it can be placed into a data
warehouse or simply another database), and load (i.e., putting the data into the
data warehouse)

46
Transformation Tools: To purchase or to
Build in-House
Issues affect whether an organization will purchase data transformation
tools or build the transformation process itself
◦ Data transformation tools are expensive
◦ Data transformation tools may have a long learning curve
◦ It is difficult to measure how the IT organization is doing until it has
learned to use the data transformation tools
 Important criteria in selecting an ETL tool
◦ Ability to read from and write to an unlimited number of data source
architectures
◦ Automatic capturing and delivery of metadata
◦ A history of conforming to open standards
◦ An easy-to-use interface for the developer and the functional user

47
Open Source Software for Big Data
Oracle VM VirtualBox
 Cloudera Hadoop - Get Started With
Enterprise Hadoop
 Hortonworks Data Platform Hortonworks.com
 Google Hadoop Solutions - google.com
 Hadoop on Google Cloud Platform
 Hadoop & NoSQL - MarkLogic.com

48
Structure and Components of Business
Intelligence
MS SQL Server 2008
SSMS
SSIS
SSAS
BIDS
SSRS
SAS
EG
SAS
EM
49
Exercise 1 – Walk through data
warehousing process

Learning Objectives
◦ To gain a general impression how to use SQL Server 2008 to implement a
data mart

Tasks
◦
◦
◦
◦
◦

Create your database with SSMS, named as ISQS6339_lastname
Import data from Commrex_2011.xls
Use SSMS to create a ERD diagram
Create a SSAS project using BIDS
Define data source, data source view, and cube
Deliverable:
◦ One-page printout of the screenshot of the cube diagram
50
Download