Slides from Lecture 18 - Courses - University of California, Berkeley

advertisement
Data Warehousing
University of California, Berkeley
School of Information Management
and Systems
SIMS 257: Database Management
IS 257 – Fall 2004
2004.11.15- SLIDE 1
Lecture Outline
• Review
– Application of Object Relational DBMS – the
Berkeley Environmental Digital Library
• Data Warehouses
• Introduction to Data Warehouses
• Data Warehousing
– (Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)
IS 257 – Fall 2004
2004.11.15- SLIDE 2
Lecture Outline
• Review
– Application of Object Relational DBMS – the
Berkeley Environmental Digital Library
• Data Warehouses
• Introduction to Data Warehouses
• Data Warehousing
– (Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)
IS 257 – Fall 2004
2004.11.15- SLIDE 3
A Digital Library Infrastructure Model
Originators
Index
Services
Repositories
Network
Users
IS 257 – Fall 2004
2004.11.15- SLIDE 4
UC Berkeley Digital Library Project
• Focus: Work-centered digital information
services
• Testbed: Digital Library for the California
Environment
• Research: Technical agenda supporting
user-oriented access to large distributed
collections of diverse data types.
• Part of the NSF/NASA/DARPA Digital
Library Initiative (Phases 1 and 2)
IS 257 – Fall 2004
2004.11.15- SLIDE 5
The Environmental Library - Contents
• As of late 2002, the collection represents
over one terabyte of data, including over
183,000 digital images, about 300,000
pages of environmental documents, and
over 2 million records in geographical and
botanical databases.
IS 257 – Fall 2004
2004.11.15- SLIDE 6
Botanical Data:
• The CalFlora Database contains
taxonomical and distribution information
for more than 8000 native California
plants. The Occurrence Database includes
over 600,000 records of California plant
sightings from many federal, state, and
private sources. The botanical databases
are linked to the CalPhotos collection of
California plants, and are also linked to
external collections of data, maps, and
photos.
IS 257 – Fall 2004
2004.11.15- SLIDE 7
Geographical Data:
• Much of the geographical data in the collection
has been used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data
represents maps and imagery that have been
processed for inclusion as layers in our GIS
Viewer. This includes Digital Ortho Quads and
DRG maps for the S.F. Bay Area.
IS 257 – Fall 2004
2004.11.15- SLIDE 8
Documents:
• Most of the 300,000 pages of digital documents are
environmental reports and plans that were provided by
California state agencies. This collection includes
documents, maps, articles, and reports on the California
environment including Environmental Impact Reports
(EIRs), educational pamphlets, water usage bulletins,
and county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game (DFG),
San Diego Association of Governments (SANDAG), and
many other agencies. Among the most frequently
accessed documents are County General Plans for
every California county and a survey of 125 Sacramento
Delta fish species.
IS 257 – Fall 2004
2004.11.15- SLIDE 9
Multivalent Documents
Cheshire Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Webster’s 7th Collegiate
Dictionary
Table Layer
History of The Classical World
kdk
dkd
kdk
The jsfj sjjhfjs jsjj
jsjhfsjf sjhfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
ksfksjfkskflk sjfjksf
kjsfkjsfkjshf sjfsjfjks
ksfjksfjksjfkthsjir\\
ks
ksfjksjfkksjkls’ks
klsjfkskfksjjjhsjhuu
sfsjfkjs
taksksh
sksksk
skksksk
Network
Protocols &
Resources
OCR Layer
OCR Mapping
Layer
Modernjsfj sjjhfjs jsjj
jsjhfsjf sslfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
Scanned
Page
Image
kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl
Table 1.
IS 257 – Fall 2004
2004.11.15- SLIDE 10
IS 257 – Fall 2004
2004.11.15- SLIDE 11
IS 257 – Fall 2004
2004.11.15- SLIDE 12
IS 257 – Fall 2004
2004.11.15- SLIDE 13
GIS Viewer Example
http://elib.cs.berkeley.edu/annotations/gis/buildings.html
IS 257 – Fall 2004
2004.11.15- SLIDE 14
IS 257 – Fall 2004
2004.11.15- SLIDE 15
IS 257 – Fall 2004
2004.11.15- SLIDE 16
IS 257 – Fall 2004
2004.11.15- SLIDE 17
Blobworld: use regions for retrieval
• We want to find general objects
 Represent images based on coherent
regions
IS 257 – Fall 2004
2004.11.15- SLIDE 18
IS 257 – Fall 2004
2004.11.15- SLIDE 19
IS 257 – Fall 2004
2004.11.15- SLIDE 20
Lecture Outline
• Review
– Application of Object Relational DBMS – the
Berkeley Environmental Digital Library
• Data Warehouses
• Introduction to Data Warehouses
• Data Warehousing
– (Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)
IS 257 – Fall 2004
2004.11.15- SLIDE 21
Overview
• Data Warehouses and Merging
Information Resources
• What is a Data Warehouse?
• History of Data Warehousing
• Types of Data and Their Uses
• Data Warehouse Architectures
• Data Warehousing Problems and Issues
IS 257 – Fall 2004
2004.11.15- SLIDE 22
Problem: Heterogeneous Information Sources
“Heterogeneities are
everywhere”
Personal
Databases
Scientific Databases



Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
IS 257 – Fall 2004
World
Wide
Web
Slide credit: J. Hammer
2004.11.15- SLIDE 23
Problem: Data Management in Large Enterprises
• Vertical fragmentation of informational
systems (vertical stove pipes)
• Result of application (user)-driven
development of operational systems
Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...
Sales Administration
IS 257 – Fall 2004
Finance
Manufacturing
...
Slide credit: J. Hammer
2004.11.15- SLIDE 24
Goal: Unified Access to Data
Integration System
World
Wide
Web
Digital Libraries
Scientific Databases
Personal
Databases
• Collects and combines information
• Provides integrated view, uniform user interface
• Supports sharing
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 25
The Traditional Research Approach
• Query-driven (lazy, on-demand)
Clients
Integration System
Metadata
...
Wrapper
Source
Wrapper
Source
Wrapper
...
Source
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 26
Disadvantages of Query-Driven Approach
• Delay in query processing
– Slow or unavailable information sources
– Complex filtering and integration
• Inefficient and potentially expensive for
frequent queries
• Competes with local processing at sources
• Hasn’t caught on in industry
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 27
The Warehousing Approach
• Information
integrated in
advance
• Stored in WH
for direct
querying and
analysis
Extractor/
Monitor
Source
IS 257 – Fall 2004
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Extractor/
Monitor
...
Source
Slide credit: J. Hammer
2004.11.15- SLIDE 28
Advantages of Warehousing Approach
• High query performance
– But not necessarily most current information
• Doesn’t interfere with local processing at
sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
• Has caught on in industry
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 29
Not Either-Or Decision
• Query-driven approach still better for
– Rapidly changing information
– Rapidly changing information sources
– Truly vast amounts of data from large
numbers of sources
– Clients with unpredictable needs
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 30
Data Warehouse Evolution
Relational
Databases
1960
1975
Company
DWs
1980
PC’s and
Spreadsheets
End-user
Interfaces
1985
1990
Data Replication
Tools
1995
2000
Information“Middle Data
Based
Revolution
Ages”
Management
1st DW
Article
DW
Confs.
TIME
“Prehistoric
Times”
“Building the
DW”
Inmon (1992)
Vendor DW
Frameworks
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 31
What is a Data Warehouse?
“A Data Warehouse is a
– subject-oriented,
– integrated,
– time-variant,
– non-volatile
collection of data used in support of
management decision making
processes.”
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11
IS 257 – Fall 2004
2004.11.15- SLIDE 32
DW Definition…
• Subject-Oriented:
– The data warehouse is organized around the
key subjects (or high-level entities) of the
enterprise. Major subjects include
•
•
•
•
•
IS 257 – Fall 2004
Customers
Patients
Students
Products
Etc.
2004.11.15- SLIDE 33
DW Definition…
• Integrated
– The data housed in the data warehouse are
defined using consistent
•
•
•
•
IS 257 – Fall 2004
Naming conventions
Formats
Encoding Structures
Related Characteristics
2004.11.15- SLIDE 34
DW Definition…
• Time-variant
– The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business
IS 257 – Fall 2004
2004.11.15- SLIDE 35
DW Definition…
• Non-volatile
– Data in the data warehouse are loaded and
refreshed from operational systems, but
cannot be updated by end-users
IS 257 – Fall 2004
2004.11.15- SLIDE 36
What is a Data Warehouse?
A Practitioners Viewpoint
• “A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and
made available to end users in a way they
can understand and use it in a business
context.”
• -- Barry Devlin, IBM Consultant
IS 257 – Fall 2004
2004.11.15-J.SLIDE
37
Slide credit:
Hammer
A Data Warehouse is...
• Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
• Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
• Optimized differently from transactionoriented db
• User interface aimed at executive decision
makers and analysts
IS 257 – Fall 2004
2004.11.15- SLIDE 38
… Cont’d
• Large volume of data (Gb, Tb)
• Non-volatile
– Historical
– Time attributes are important
• Updates infrequent
• May be append-only
• Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 39
Warehouse is a Specialized DB
Standard DB
•
•
•
•
•
•
•
Mostly updates
Many small transactions
Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users (e.g.,
clerical users)
Warehouse
• Mostly reads
• Queries are long and
complex
• Gb - Tb of data
• History
• Lots of scans
• Summarized, reconciled
data
• Hundreds of users (e.g.,
decision-makers,
analysts)
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 40
Summary
Business
Information Guide
Data
Warehouse
Catalog
Business Information
Interface
Data
Warehouse
Data Warehouse
Population
Enterprise
Modeling
Operational Systems
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 41
Warehousing and Industry
• Warehousing is big business
– $2 billion in 1995
– $3.5 billion in early 1997
– Predicted: $8 billion in 1998 [Metagroup]
• WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7TB in warehouse
– 40-50GB per day
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 42
Types of Data
• Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
• Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
• Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 43
Data Warehousing Architecture
IS 257 – Fall 2004
2004.11.15- SLIDE 44
“Ingest”
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source/ File
IS 257 – Fall 2004
Extractor/
Monitor
Source / DB
Extractor/
Monitor
...
Source / External
2004.11.15- SLIDE 45
Data Warehouse Architectures:
Conceptual View
• Single-layer
Operational
systems
Informational
systems
– Every data element is stored once only
“Real-time data”
– Virtual warehouse
• Two-layer
– Real-time + derived data
– Most commonly used approach in
– industry today
Operational
systems
Informational
systems
Derived Data
Real-time data
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 46
Three-layer Architecture: Conceptual View
• Transformation of real-time data to derived
data really requires two steps
Operational
systems
Informational
systems
Derived Data
Reconciled Data
View level
“Particular informational
needs”
Physical Implementation
of the Data Warehouse
Real-time data
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 47
Issues in Data Warehousing
• Warehouse Design
• Extraction
– Wrappers, monitors (change detectors)
• Integration
– Cleansing & merging
• Warehousing specification & Maintenance
• Optimizations
• Miscellaneous (e.g., evolution)
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 48
Data Warehousing: Two Distinct Issues
• (1) How to get information into warehouse
– “Data warehousing”
• (2) What to do with data once it’s in
warehouse
– “Warehouse DBMS”
• Both rich research areas
• Industry has focused on (2)
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 49
Data Extraction
• Source types
– Relational, flat file, WWW, etc.
• How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 50
Wrapper
Converts data and queries from one data model to
another
Data
Model
A
Queries
Data
Data
Model
B
Extends query capabilities for sources with
limited capabilities
Queries
Wrapper
Source
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 51
Wrapper Generation
• Solution 1: Hard code for each source
• Solution 2: Automatic wrapper generation
Wrapper
Wrapper
Generator
Definition
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 52
Data Transformations
• Convert data to uniform format
– Byte ordering, string termination
– Internal layout
• Remove, add & reorder attributes
– Add key
– Add data to get history
• Sort tuples
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 53
Monitors
• Goal: Detect changes of interest and
propagate to integrator
• How?
– Triggers
– Replication server
– Log sniffer
– Compare query results
– Compare snapshots/dumps
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 54
Data Integration
• Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
• Rule-based
• Actions
–
–
–
–
–
–
Resolve inconsistencies
Eliminate duplicates
Integrate into warehouse (may not be empty)
Summarize data
Fetch more data from sources (wh updates)
etc.
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 55
Data Cleansing
• Find (& remove) duplicate tuples
– e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
– Attribute values that don’t match
• Patch missing, unreadable data
• Notify sources of errors found
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 56
Warehouse Maintenance
• Warehouse data  materialized view
– Initial loading
– View maintenance
• View maintenance
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 57
Differs from Conventional View Maintenance...
• Warehouses may be highly aggregated
and summarized
• Warehouse views may be over history of
base data
• Process large batch updates
• Schema may evolve
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 58
Differs from Conventional View Maintenance...
• Base data doesn’t participate in view
maintenance
– Simply reports changes
– Loosely coupled
– Absence of locking, global transactions
– May not be queriable
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 59
Warehouse Maintenance Anomalies
• Materialized view maintenance in loosely
coupled, non-transactional environment
• Simple example
Data
Warehouse
Sold (item,clerk,age)
Sold = Sale
Emp
Integrator
Sales
Sale(item,clerk)
IS 257 – Fall 2004
Comp.
Emp(clerk,age)
Slide credit: J. Hammer
2004.11.15- SLIDE 60
Warehouse Maintenance Anomalies
Data
Warehouse
Sold (item,clerk,age)
Integrator
Sales
Sale(item,clerk)
Comp.
Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1)  integrator adds Sale
(Mary,25)
4. (2)  integrator adds (Computer,Mary)
Emp
5. View incorrect (duplicate tuple)
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 61
Maintenance Anomaly - Solutions
• Incremental update algorithms (ECA,
Strobe, etc.)
• Research issues: Self-maintainable views
– What views are self-maintainable
– Store auxiliary views so original + auxiliary
views are self-maintainable
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 62
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk) Emp(clerk,age)
• Inserts into Emp
– If Emp.clerk is key and Sale.clerk is
foreign key (with ref. int.) then no effect
• Inserts into Sale
– Maintain auxiliary view:
– Emp-clerk,age(Sold)
• Deletes from Emp
– Delete from Sold based on clerk
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 63
Self-Maintainability: Examples
• Deletes from Sale
Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant
• Auxiliary views for self-maintainability
– Must themselves be self-maintainable
– One solution: all source data
– But want minimal set
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 64
Partial Self-Maintainability
• Avoid (but don’t prohibit) going to sources
Sold=Sale(item,clerk)
Emp(clerk,age)
• Inserts into Sale
– Check if clerk already in Sold, go to source
if not
– Or replicate all clerks over age 30
– Or ...
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 65
Warehouse Specification (ideally)
View Definitions
Warehouse
Configuration
Module
Integration
rules
Warehouse
Change
Detection
Requirements
Integrator
Extractor/
Monitor
Extractor/
Monitor
Metadata
Extractor/
Monitor
...
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 66
Optimization
• Update filtering at extractor
– Similar to irrelevant updates in constraint and
view maintenance
• Multiple view maintenance
– If warehouse contains several views
– Exploit shared sub-views
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 67
Additional Research Issues
•
•
•
•
Historical views of non-historical data
Expiring outdated information
Crash recovery
Addition and removal of information
sources
– Schema evolution
Slide credit: J. Hammer
IS 257 – Fall 2004
2004.11.15- SLIDE 68
More Information on DW
• Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999.
• Devlin, Barry, Data Warehouse, from
Architecture to Implementation. Addison-Wesley,
1997.
• Inmon, W.H., Building the Data Warehouse.
John Wiley, 1992.
• Widom, J., “Research Problems in Data
Warehousing.” Proc. of the 4th Intl. CIKM Conf.,
1995.
• Chaudhuri, S., Dayal, U., “An Overview of Data
Warehousing and OLAP Technology.” ACM
SIGMOD Record, March 1997.
IS 257 – Fall 2004
2004.11.15- SLIDE 69
Download