cse4701chap26and28and29

advertisement
Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e
Database System Architectures, Data
CSE
4701
Mining/Warehousing, Web DB
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
steve@engr.uconn.edu
http://www.engr.uconn.edu/~steve
(860) 486 - 4818


A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps26.28.29-1
Classical and Distributed Architectures

CSE
4701


Classic/Centralized DBMS Dominated the
Commercial Market from 1970s Forward
Problems of this Approach
 Difficult to Scale w.r.t. Performance Gains
 If DB Overloaded, replace with a Faster Computer
 this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a
Number of Issues
 Improved Performance
 Putting Data “Near” Location where it is Needed
 Replication of Data for Fault Tolerance
 Vertical and Horizontal Partitioning of DB Tuples
Chaps26.28.29-2
Common Features of Centralized DBMS

CSE
4701


Data Independence
 High-Level Representation via Conceptual and
External Schemas
 Physical Representation (Internal Schema) Hidden
Program Independence
 Multiple Applications can Share Data
 Views/External Schema Support this Capability
Reduction of Program/Data Redundancy
 Single, Unique, Conceptual Schema
 Shared Database
 Almost No Data Redundancy
 Controlled Data Access Reduces Inconsistencies
 Programs Execute with Consistent Results
Chaps26.28.29-3
Common Features of Centralized DBMS

CSE
4701
Promote Sharing: Automatically Provided via CC
 No Longer Programmatic Issue
 Most DBMS Offer Locking for Key Shared Data
 Oracle Allows Locks on Data Item (Attributes)
 For Example, Controlling Access to Shared Identifier




Coherent and Central DB Administration
Semantic DB Integrity via the Automatic Enforcement
of Data Consistency via Integrity Constraints/Rules
Data Resiliency
 Physical Integrity of Data in the Presence of Faults
and Errors
 Supported by DB Recovery
Data Security: Control Access for Authorized Users
Against Sensitive Data
Chaps26.28.29-4
Shared Nothing Architecture

CSE
4701



In this Architecture, Each DBMS
Operates Autonomously
There is No Sharing
 Three Separate DBMSs on
Three Different Computers
Applications/Clients Must Know
About the External Schemas of
all Three DBMSs for
 Database Retrieval
 Client Processing
Complicates Client
 Different DBMS Platforms
(Oracle, Sybase, Informix, ..)
 Different Access Modes
(Query, Embedded, ODBC)
 Difficult for SWE to Code
Chaps26.28.29-5
Difficulty in Access – Manage Multiple APIs

CSE
4701
Each Platform has a Different API
 API1 , API3 , …. , APIn
 An App Programmer Must Utilize All three APIs which
could differ by PL – C++, C, Java, REST, etc.
 Any interactions Across 3 DBs – must be
programmatically handled without DB Capabilities
API1
API2
APIn
Chaps26.28.29-6
NW Architecture with Centralized DB

CSE
4701
High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide
 Clients at Any Site can Access Repository
 Data May be “Far” Away - Increased Access Time
 In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2
 Inefficient, no Replication w.r.t. Failure
Chaps26.28.29-7
Fully Distributed Architecture

CSE
4701



The Five Sites (Chicago, SF, LA, NY, Atlanta) each
have a “Portion” of the Database - its Distributed
Replication is Possible for Fault Tolerance
Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join)
Increased Transaction Processing Complexity
Chaps26.28.29-8
Distributed Database Concepts

CSE
4701


A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes a Unit of
execution (a transaction) in a distributed manner.
A distributed database (DDB) can be defined as
 Collection of multiple logically related database
distributed over a computer network
 Distributed database management system as a
software system that manages a distributed
database while making the distribution transparent
to the user.
Chaps26.28.29-9
Goals of DDBMS

CSE
4701



Support User Distribution Across Multiple Sites
 Remote Access by Users Regardless of Location
 Distribution and Replication of Database Content
Provide Location Transparency
 Users Manipulate their Own Data
 Non-Local Sites “Appear” Local to Any User
Provide Transaction Control Akin to Centralized Case
 Transaction Control Hides Distribution
 CC and Serializability - Must be Extended
Minimize Communications Cost
 Optimize Use of Network - a Critical Issue
 Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
Chaps26.28.29-10
Goals of DDBMS

CSE
4701


Improve Response Time for DB Access
 Use a More Sophisticated Load Control for
Transaction Processing
 However, Synchronization Across Sites May
Introduce Additional Overhead
System Availability
 Site Independence in the Presence of Site Failure
 Subset of Database is Always Available
 Replication can Keep All Data Available, Even
When Multiple Sites Fail
Modularity
 Incremental Growth with the Addition of Sites
 Dedicate Sites to Specific Tasks
Chaps26.28.29-11
Advantages of DDBMS

CSE
4701

There are Four Major Advantages
Transparency
 Distribution/NW Transparency
 User Doesn’t Know about NW Configuration (Location
Transparency)
 User can Find Object at any Site (Naming
Transparency)

Replication Transparency (see next PPT)
 User Doesn’t Know Location of Data
 Replicas are Transparently Accessible

Fragmentation Transparency
 Horizontal Fragmentation (Distribute by Row)
 Vertical Fragmentation (Distribute by Column)
Chaps26.28.29-12
Data Distribution and Replication
CSE
4701
Chaps26.28.29-13
Other Advantages of DDBMS

CSE
4701


Increased Reliability and Availability
 Reliability - System Always Running
 Availability - Data Always Present
 Achieved via Replication and Distribution
 Ability to Make Single Query for Entire DDBMS
Improved Performance
 Sites Able to Utilize Data that is Local for
Majority of Queries
Easier Expansion
 Improve Performance of Site by
 Upgrading Processor of Computer
 Adding Additional Disks
 Splitting a Site into Two or More Sites

Expansion over Time as Business Grows
Chaps26.28.29-14
Challenges of DDBMS

CSE
4701



Tracking Data - Meta Data More Complex
 Must Track Distribution (where is the Data)
 V & H Fragmentation (How is Data Split)
 Replication (Multiple Copies for Consistency)
Distributed Query Processing
 Optimization, Accessibility, etc., More Complex
 Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time
Distributed Transaction Processing
 TP Potentially Spans Multiple Sites
 Submit Query to Multiple Sites
 Collect and Collate Results
Distributed Concurrency Control Across Nodes
Chaps26.28.29-15
Challenges of DDBMS

CSE
4701



Replicated Data Management
 TP Must Choose the Replica to Access
 Updates Must Modify All Replica Copies
Distributed Database Recovery
 Recovery of Individual Sites
 Recovery Across DDBMS
Security
 Local and Remote Authorization
 During TP, be Able to Verify Remote Privileges
Distributed Directory Management
 Meta-Data on Database - Local and Remote
 Must maintain Replicas of this - Every Site Tracks
the Meta-Data for All Sites
Chaps26.28.29-16
A Complete Schema with Keys ...
CSE
4701
Keys Allow us to
Establish Links
Between Relations
Chaps26.28.29-17
…and Corresponding DB Tables
CSE
4701
which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
Chaps26.28.29-18
…with Remaining DB Tables
CSE
4701
Chaps26.28.29-19
What is Fragmentation?

CSE
4701

Fragmentation Divides a DB Across Multiple Sites
Two Types of Fragmentation
 Horizontal Fragmentation
 Given a Relation R with n Total Tuples, Spread Entire
Tuples Across Multiple Sites
 Each Site has a Subset of the n Tuples
 Essentially Fragmentation is a Selection

Vertical Fragmentation
 Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites
 Essentially Fragmentation is a Projection
 Not Generally Utilized in Practice

In Both Cases, Sites can Overlap for Replication
Chaps26.28.29-20
Horizontal Fragmentation

CSE
4701




A horizontal subset of a relation which contain those
of tuples which satisfy selection conditions.
Consider Employee relation with condition DNO = 5
All tuples satisfying this create a subset which will be
a horizontal fragment of Employee relation.
A selection condition may be composed of several
conditions connected by AND or OR.
Derived horizontal fragmentation:
 Partitioning of a primary relation to other
secondary relations which are related with Foreign
keys.
Chaps26.28.29-21
Horizontal Fragmentation

Site 2 Tracks All Information Related to Dept. 5
CSE
4701
Chaps26.28.29-22
Horizontal Fragmentation

CSE
4701

Site 3 Tracks All Information Related to Dept. 4
Note that an Employee Could be Listed in Both Cases,
if s/he Works on a Project for Both Departments
Chaps26.28.29-23
Refined Horizontal Fragmentation

CSE
4701


Further Fragment from Site
2 based on Dept. that
Employee Works in
Notice that G1 + G2 + G3 is
the Same as WORKS_ON5
there is no Overlap
Chaps26.28.29-24
Refined Horizontal Fragmentation

CSE
4701


Further Fragment from Site
3 based on Dept. that
Employee Works in
Notice that G4 + G5 + G6 is
the Same as WORKS_ON4
Note Some Fragments can
be Empty
Chaps26.28.29-25
Vertical Fragmentation

CSE
4701


Subset of a relation created via a subset of columns.
 A vertical fragment of a relation will contain
values of selected columns.
 There is no selection condition used in vertical
fragmentation.
 A strict vertical slice/partition
Consider the Employee relation.
 A vertical fragment of can be created by keeping
the values of Name, Bdate, Sex, and Address.
Since no condition for creating a vertical fragment
 Each fragment must include the primary key
attribute of the parent relation Employee.
 All vertical fragments of a relation are connected.
Chaps26.28.29-26
Vertical Fragmentation Example

CSE
4701

Partition the Employee Table as Below
Notice Each Vertical Fragment Needs Key Column
EmpDemo
EmpSupvrDept
Chaps26.28.29-27
Homogeneous DDBMS

CSE
4701
Homogeneous
 Identical Software (w.r.t. Database)
 One DB Product (e.g., Oracle, DB2, Sybase) is
Distributed and Available at All Sites
 Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc.
 Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or REST API …)
Chaps26.28.29-28
Non-Federated Heterogeneous DDBMS

CSE
4701
Non-Federated Heterogeneous
 Different Software (w.r.t. Database)
 Multiple DB Products (e.g., Oracle at One Site,
Access another, Sybase, Informix, etc.)
 Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems)
 Varied Programmatic Access - SWEs Must Know
All Platforms/Client Software Complicated
 Very Close to Shared Nothing Architecture
Chaps26.28.29-29
Federated DDBMS

CSE
4701

Federated
 Multiple DBMS
Platforms Overlaid
with a Global
Schema View
 Single External
Schema Combines
Schemas from all
Sites
Multiple Data Models
 Relational in one
Component DBS
 Object-Oriented in
another DBS
 Hierarchical in a
3rd DBS
Chaps26.28.29-30
Federated DBMS Issues

CSE
4701



Differences in Data Models
 Reconcile Relational vs. Object-Oriented Models
 Each Different Model has Different Capabilities
 These Differences Must be Addressed in Order to
Present a Federated Schema
Differences in Constraints
 Referential Integrity Constraints in Different DBSs
 Different Constraints on “Similar” Data
 Federated Schema Must Deal with these Conflicts
Differences in Query Languages
 SQL-89, SQL-92, SQL2, SQL3
 Specific Types in Different DBMS (Oracle Blobs )
Differences in Key Processing & Timestamping
Chaps26.28.29-31
Heterogeneous Distributed Database Systems

CSE
4701

Federated: Each site may run different database system but the
data access is managed through a single conceptual schema.
 The degree of local autonomy is minimum.
 Each site must adhere to a centralized access policy
 There may be a global schema.
Multi-database: There is no one conceptual global schema
 For data access a schema is constructed dynamically as
needed by the application software.
Object Unix Relational
Unix
Oriented Site 5
Site 1
Hierarchical
Window
Communications
Site 4
network
Object
Oriented
Network
DBMS
Site 3
Linux
Site 2
Linux
Relational
Chaps26.28.29-32
Query Processing in Distributed Databases
Issues
CSE
4701

Cost of transferring data (files and results) over the network.
 This cost is usually high so some optimization is necessary.
 Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size =
106 bytes.
Fname
Minit
Lname
SSN
Bdate
Address
Sex
Salary
Superssn
Dno
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size =
3,500 bytes.
Dname
Dnumber
Mgrssn
Mgrstartdate
 Q: For each employee, retrieve employee name and department
name Where the employee works.
 Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Chaps26.28.29-33
Query Processing in Distributed Databases

CSE
4701
Result
 The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
 Suppose each result tuple is 40 bytes long.
 The query is submitted at site 3 and the result is
sent to this site.
 Problem: Employee and Department relations are
not present at site 3.
Chaps26.28.29-34
Query Processing in Distributed Databases

CSE
4701
Strategies:
1. Transfer Employee and Department to site 3.
 Total transfer bytes = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
 Query result size = 40 * 10,000 = 400,000 bytes. Total
transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
 Total bytes transferred = 400,000 + 3500 = 403,500 bytes.

Optimization criteria: minimizing data transfer.

Preferred approach: strategy 3.
Chaps26.28.29-35
Query Processing in Distributed Databases

CSE
4701

Consider the query
 Q’: For each department, retrieve the department
name and the name of the department manager
Relational Algebra expression:
 Fname,Lname,Dname (Employee
Mgrssn = SSN
Department)
Chaps26.28.29-36
Query Processing in Distributed Databases

CSE
4701
Result of query has 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
 Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000
bytes.
 Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
 Total transfer size = 4000 + 3500 = 7500 bytes.

Preferred strategy: Choose strategy 3.
Chaps26.28.29-37
Query Processing in Distributed Databases

CSE
4701
Now suppose the result site is 2. Possible strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
 Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join
at site 1 and send the result back to site 2.
 Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Chaps26.28.29-38
DDBS Concurrency Control and Recovery

CSE
4701
Distributed Databases encounter a number of
concurrency control and recovery problems which are
not present in centralized databases, including:
 Dealing with multiple copies of data items
 How are they All Updated if Needed?

Failure of individual sites
 How are Queries Restarted or Rerouted?

Communication link failure
 Network Failure

Distributed commit
 How to Know All Updates Done at all Sites?

Distributed deadlock
 How to Detect and Recover?
Chaps26.28.29-39
Data Warehousing and Data Mining

CSE
4701

Data Warehousing
 Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
 Underlying Infrastructure in Support of Mining
 Provides Means to Interact with Multiple DBs
 OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
 Discovery of Information in a Vast Data Sets
 Search for Patterns and Common Features based
 Discover Information not Previously Known
 Medical Records Accessible Nationwide
 Research/Discover Cures for Rare Diseases

Relies on Knowledge Discovery in DBs (KDD)
Chaps26.28.29-40
What is Purpose of a Data Warehouse?

CSE
4701





Traditional databases are not optimized for data access but have
to balance the requirement of data access to ensure integrity
Most data warehouse users need only read access, but need the
access to be fast over a large volume of data.
Most of the data required for data warehouse analysis comes
from multiple databases and these analysis are recurrent and
predictable to be able to design software meet requirements.
Critical for tools that provide decision makers with information
to make decisions quickly and reliably based on historical data.
Aforementioned Charactereistics achieved by Data
Warehousing and Online analytical processing (OLAP)
W. H Inmon characterized a data warehouse as:
 “A subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management’s decisions.”
Chaps26.28.29-41
Data Warehousing and OLAP

CSE
4701


A Data Warehouse
 Database is Maintained Separately from an
Operational Database
 “A Subject-Oriented, Integrated, Time-Variant,
and Non-Volatile Collection of Data in Support for
Management’s Decision Making Process
[W.H.Inmon]”
OLAP (on-Line Analytical Processing)
 Analysis of Complex Data in the Warehouse
 Attempt to Attain “Value” through Analysis
 Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
 Organized Data for a Subset of an Organization
Chaps26.28.29-42
Conceptual Structure of Data Warehouse

CSE
4701
Data Warehouse processing involves
 Cleaning and reformatting of data
 OLAP
 Data Mining
Back Flushing
Data Warehouse
OLAP
Cleaning
Databases
Other Data Inputs
Data
Reformatting
Metadata
DSSI
EIS
Data
Mining
Updates/New Data
Chaps26.28.29-43
Building a Data Warehouse

CSE
4701
Option 1
 Leverage Existing
Repositories
 Collate and Collect
 May Not Capture All
Relevant Data

Option 2
 Start from Scratch
 Utilize Underlying
Corporate Data
Corporate
data warehouse
Option 1:
Consolidate Data Marts
Option 2:
Build from
scratch
Data Mart
...
Data Mart
Data Mart
Data Mart
Corporate data
Chaps26.28.29-44
Comparison with Traditional Databases

CSE
4701




Data Warehouses are mainly optimized for appropriate
data access
 Traditional databases are transactional
 Optimized for both access mechanisms and
integrity assurance measures.
Data warehouses emphasize historical data as their
support time-series and trend analysis.
Compared with transactional databases, data
warehouses are nonvolatile.
In transactional databases, transaction is the
mechanism change to the database.
In warehouse, data is relatively coarse grained and
refresh policy is carefully chosen, usually incremental.
Chaps26.28.29-45
Classification of Data Warehouses

CSE
4701

Generally, Data Warehouses are an order of
magnitude larger than the source databases.
The sheer volume of data is an issue, based on which
Data Warehouses could be classified as follows.
 Enterprise-wide data warehouses
 Huge projects requiring massive investment of time and
resources.

Virtual data warehouses
 Provide views of operational databases that are
materialized for efficient access.

Data marts
 Generally targeted to a subset of organization, such as a
department, and are more tightly focused.
Chaps26.28.29-46
Data Warehouse Characteristics

CSE
4701



Utilizes a “Multi-Dimensional” Data Model
Warehouse Comprised of
 Store of Integrated Data from Multiple Sources
 Processed into Multi-Dimensional Model
Warehouse Supports of
 Times Series and Trend Analysis
 “Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB
 Doesn’t Dramatically Change Over Time
 Updates at Regular Intervals
 Specific Refresh Policy Regarding Some Data
Chaps26.28.29-47
Three Tier Architecture
CSE
4701
monitor
External data sources
OLAP Server
integrator
Summarization
report
Operational databases
Extraxt
Transform
Load
Refresh
serve
Data Warehouse
Query report
Data mining
metadata
Data marts
Chaps26.28.29-48
Data Modeling for Data Warehouses

CSE
4701

Traditional Databases generally deal with twodimensional data (similar to a spread sheet).
 However, querying performance in a multidimensional data storage model is much more
efficient.
Data warehouses can take advantage of this feature as
generally these are
 Non volatile
 The degree of predictability of the analysis that
will be performed on them is high.
Chaps26.28.29-49
What is a Multi-Dimensional Data Cube?

CSE
4701



Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
Aggregate Raw Data!
Chaps26.28.29-50
Multi-Dimensional Schemas

CSE
4701



Supporting Multi-Dimensional Schemas Requires
Two Types of Tables:
 Dimension Table: Tuples of Attributes for Each
Dimension
 Fact Table: Measured/Observed Variables with
Pointers into Dimension Table
Star Schema
 Characterizes Data Cubes by having a Single Fact
Table for Each Dimension
Snowflake Schema
 Dimension Tables from Star Schema are
Organized into Hierarchy via Normalization
Both Represent Storage Structures for Cubes
Chaps26.28.29-51
Data Modeling for Data Warehouses

CSE
4701
Advantages of a multi-dimensional model
 Multi-dimensional models lend themselves readily
to hierarchical views in what is known as
 roll-up display &
 drill-down display.

The data can be directly queried in any
combination of dimensions, bypassing complex
database queries.
Chaps26.28.29-52
Data Warehouse Design

CSE
4701


Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates
and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
 Each Tuple in Fact Table Represents the Content
of One Dimension
 Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables
 Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
Chaps26.28.29-53
Sample Fact Tables
CSE
4701
Chaps26.28.29-54
Example of Star Schema
CSE
4701
Product
Date
Date
Month
Year
Sale Fact Table
Date
ProductNo
ProdName
ProdDesc
Categoryu
Product
Store
Customer
Unit_Sales
Store
StoreID
City
State
Country
Region
Dollar_Sales
Customer
CustID
CustName
CustCity
CustCountry
Chaps26.28.29-55
A Second Example of Star Schema …
CSE
4701
Chaps26.28.29-56
and Corresponding Snowflake Schema
CSE
4701
Chaps26.28.29-57
Multi-dimensional Schemas

CSE
4701
Fact Constellation
 Fact constellation is a set of tables that share some
dimension tables.
 However, fact constellations limit the possible
queries for the warehouse.
Chaps26.28.29-58
Fact Table i2b2 (Integrating Biology &Bedside)
CSE
4701
Chaps26.28.29-59
Data Warehouse Issues

CSE
4701

Data Acquisition
 Extraction from Heterogeneous Sources
 Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
 Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value?
 Transition of Data into Data Model of Warehouse
 Loading of Data into the Warehouse
Other Issues Include:
 How Current is the Data? Frequency of Update?
 Availability of Warehouse? Dependencies of Data?
 Distribution, Replication, and Partitioning Needs?
 Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
Chaps26.28.29-60
OLAP Strategies

CSE
4701

OLAP Strategies
 Roll-Up: Summarization of Data
 Drill-Down: from the General to Specific (Details)
 Pivot: Cross Tabulate the Data Cubes
 Slice and Dice: Projection Operations Across
Dimensions
 Sorting: Ordering Result Sets
 Selection: Access by Value or Value Range
Implementation Issues
 Persistent with Infrequent Updates (Loading)
 Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes
 Recovery Less Critical - Mostly Read Only
 Temporal Aspects of Data (Versions) Important
Chaps26.28.29-61
Knowledge Discovery

CSE
4701



Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully
Knowledge Discovery
 Technology to Extract Interesting Knowledge
(Rules, Patterns, Regularities, Constraints) from a
Vast Data Set
 Process of Non-trivial Extraction of Implicit,
Previously Unknown, and Potentially Useful
Information from Large Collection of Data
Data Mining
 A Critical Step in the Knowledge Discovery
Process
 Extracts Implicit Information from Large Data Set
KDD: Knowledge Discovery and Data Mining
Chaps26.28.29-62
Steps in a KDD Process

CSE
4701








Learning the Application Domain (goals)
Gathering and Integrating Data
Data Cleaning
Data Integration
Data Transformation/Consolidation
Data Mining
 Choosing the Mining Method(s) and Algorithm(s)
 Mining: Search for Patterns or Rules of Interest
Analysis and Evaluation of the Mining Results
Use of Discovered Knowledge in Decision Making
Important Caveats
 This is Not an Automated Process!
 Requires Significant Human Interaction!
Chaps26.28.29-63
Processing in a Data Warehouse

CSE
4701
Processing Types are Varied and Include:
 Roll-up: Data is summarized with increasing
generalization
 Drill-Down: Increasing levels of detail are
revealed
 Pivot: Cross tabulation is performed
 Slice and dice: Performing projection operations
on the dimensions.
 Sorting: Data is sorted by ordinal value.
 Selection: Data is available by value or range.
 Derived attributes: Attributes are computed by
operations on stored derived values.
Chaps26.28.29-64
On-Line Analytical Processing

CSE
4701

Data Cube
 A Multidimensonal Array
 Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Product
Store
Date
Sale
acron
Rolla,MO 7/3/99 325.24
budwiser LA,CA
5/22/99 833.92
large pants NY,NY
2/12/99 771.24
Pants
Diapers
Beer
Nuts
West
East
3’ diaper Cuba,MO 7/30/99 81.99
Region
Central
Mountain
South
Jan
Feb March April
Date
Chaps26.28.29-65
Examples of Data Mining

CSE
4701
The Slicing Action
 A Vertical or Horizontal Slice Across Entire Cube
Months
Slice
on city Atlanta
Products Sales
Products Sales
Months
Multi-Dimensional Data Cube
Chaps26.28.29-66
Examples of Data Mining

CSE
4701
The Dicing Action
 A Slide First Identifies on Dimension
 A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Months
Products Sales
Products Sales
Months
March 2000
Electronics
Atlanta
Dice on Electronics and Atlanta
Chaps26.28.29-67
Examples of Data Mining
Drill Down - Takes a Facet (e.g.,
Q1)
and Decomposes into Finer Detail
Jan Feb March
Products Sales
CSE
4701
Drill down
on Q1
Roll Up
on Location
(State, USA)
Roll Up: Combines Multiple Dimensions
From Individual Cities to State
Q1 Q2 Q3 Q4
Products Sales
Products Sales
Q1 Q2 Q3 Q4
Chaps26.28.29-68
Mining Other Types of Data

Analysis and Access Dramatically More Complicated!
CSE
4701
Spatial databases
Multimedia databases
World Wide Web
Time series data
Geographical and Satellite Data
Chaps26.28.29-69
Advantages/Objectives of Data Mining

CSE
4701


Descriptive Mining
 Discover and Describe General Properties
 60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
 Infer Interesting Properties based on Available
Data
 People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
 Order from Chaos
 Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about
Trends, Behavior, etc.
 Impact on Marketing Strateg
Chaps26.28.29-70
Data Mining Methods

CSE
4701
Association
 Discover the Frequency of Items Occurring
Together in a Transaction or an Event
 Example
 80% Customers who Buy Milk also Buy Bread
Hence - Bread and Milk Adjacent in Supermarket
 50% of Customers Forget to Buy Milk/Soda/Drinks
Hence - Available at Register

Prediction
 Predicts Some Unknown or Missing Information
based on Available Data
 Example
 Forecast Sale Value of Electronic Products for Next
Quarter via Available Data from Past Three Quarters
Chaps26.28.29-71
Association Rules

CSE
4701



Motivated by Market Analysis
Rules of the Form
 Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn
Example
 “Beer ^ Soft Drink  Pop Corn”
Problem: Discovering All Interesting Association
Rules in a Large Database is Difficult!
 Issues
 Interestingness
 Completeness
 Efficiency

Basic Measurement for Association Rules
 Support of the Rule
 Confidence of the Rule
Chaps26.28.29-72
Data Mining Methods

CSE
4701
Classification
 Determine the Class or Category of an Object
based on its Properties
 Example
 Classify Companies based on the Final Sale Results in
the Past Quarter

Clustering
 Organize a Set of Multi-dimensional Data Objects
in Groups to Minimize Inter-group Similarity is
and Maximize Intra-group Similarity
 Example
 Group Crime Locations to Find Distribution Patterns
Chaps26.28.29-73
Classification

CSE
4701



Classification is the process of learning a model that
is able to describe different classes of data.
Learning is supervised as the classes to be learned are
predetermined.
Learning is accomplished by using a training set of
pre-classified data.
The model produced is usually in the form of a
decision tree or a set of rules.
Chaps26.28.29-74
One Classification Example
CSE
4701

Rule extracted from the decision tree of Figure 28.7.
IF 50K > salary >= 20K
AND age >=25
THEN class is “yes”
Chaps26.28.29-75
Classification

CSE
4701


Two Stages
 Learning Stage: Construction of a Classification
Function or Model
 Classification Stage: Predication of Classes of
Objects Using the Function or Model
Tools for Classification
 Decision Tree
 Bayesian Network
 Neural Network
 Regression
Problem
 Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model
which can Correctly Classify Future Objects
Chaps26.28.29-76
An Example

Attributes

Class Attribute - Play/Don’t Play the Game
Training Set
 Values that Set the Condition for the Classification
 What are the Pattern Below?
CSE
4701

Attribute
Possible Values
outlook
sunny, overcast, rain
temperature continuous
humidity
continuous
windy
true, false
Outlook Temperature Humidity
sunny
85
85
overcast 83
78
sunny
80
90
sunny
72
95
sunny
72
70
…
…
…
Windy
false
false
true
false
false
…
Play
No
Yes
No
No
Yes
...
Chaps26.28.29-77
Data Mining Methods

CSE
4701
Summarization
 Characterization (Summarization) of General
Features of Objects in the Target Class
 Example
 Characterize People’s Buying Patterns on the Weekend
 Potential Impact on “Sale Items” & “When Sales Start”
 Department Stores with Bonus Coupons

Discrimination
 Comparison of General Features of Objects
Between a Target Class and a Contrasting Class
 Example
 Comparing Students in Engineering and in Art
 Attempt to Arrive at Commonalities/Differences
Chaps26.28.29-78
Summarization Technique

CSE
4701

Attribute-Oriented Induction
Generalization using Concert hierarchy (Taxonomy)
barcode category
14998
milk
brand
diaryland
content
size
Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
…
…
…
…
...
Milk
…
Skim milk … 2% milk
Category
milk
milk
…
Content Count
skim
2%
…
280
98
...
bread
White
whole
bread … wheat
Lucern … Dairyland
Wonder … Safeway
Chaps26.28.29-79
Building A Data Warehouse

CSE
4701

The builders of Data warehouse should take a broad
view of the anticipated use of the warehouse.
 The design should support ad-hoc querying
 An appropriate schema should be chosen that
reflects the anticipated usage.
The Design of a Data Warehouse involves following
steps.
 Acquisition of data for the warehouse.
 Ensuring that Data Storage meets the query
requirements efficiently.
 Giving full consideration to the environment in
which the data warehouse resides.
Chaps26.28.29-80
Building A Data Warehouse

CSE
4701
Acquisition of data for the warehouse
 The data must be extracted from multiple,
heterogeneous sources.
 Data must be formatted for consistency within the
warehouse.
 The data must be cleaned to ensure validity.
 Difficult to automate cleaning process.
 Back flushing, upgrading the data with cleaned data.


The data must be fitted into the data model of the
warehouse.
The data must be loaded into the warehouse.
 Proper design for refresh policy should be considered.
Chaps26.28.29-81
Building A Data Warehouse

CSE
4701






Storing the data according to the data model of the
warehouse
Creating and maintaining required data structures
Creating and maintaining appropriate access paths
Providing for time-variant data as new data are added
Supporting the updating of warehouse data.
Refreshing the data
Purging data
Chaps26.28.29-82
Why is Data Mining Popular?

CSE
4701
Technology Push
 Technology for Collecting Large Quantity of Data
 Bar Code, Scanners, Satellites, Cameras

Technology for Storing Large Collection of Data
 Databases, Data Warehouses
 Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web

Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
 Targeted Marketing by Age, Region, Income, etc.
 Exploiting User Preferences/Customized Shopping
Chaps26.28.29-83
Requirements & Challenges in Data Mining

CSE
4701



Security and Social
 What Information is Available to Mine?
 Preferences via Store Cards/Web Purchases
 What is Your Comfort Level with Trends?
User Interfaces and Visualization
 What Tools Must be Provided for End Users of
Data Mining Systems?
 How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
 Range from Real-Time for Some Queries to LongTerm for Other Queries
Data Sources of Complex Data Types or Unstructured
Data - Ability to Format, Clean, and Load Data Sets
Chaps26.28.29-84
Data Mining Visualization

CSE
4701

Leverage Improving 3D Graphics and Increased PC
Processing Power for Displaying Results
Significant Research in Visualization w.r.t. Displaying
Multi-Dimensional Data
Chaps26.28.29-85
Successful Data Mining Applications

CSE
4701



Business Data Analysis and Decision Support
 Marketing, Customer Profiling, Market Analysis
and Management, Risk Analysis and Management
Fraud Detection
 Detecting Telephone Fraud, Automotive and
Health Insurance Fraud, Credit-card Fraud,
Suspicious Money Transactions (Money
Laundering)
Text Mining
 Message Filtering (Email, Newsgroups, Etc.)
 Newspaper Articles Analysis
Sports
 IBM Advanced Scout Analyzed NBA Game
Statistics (Shots Blocked, Assists and Fouls) to
Gain Competitive Advantage
Chaps26.28.29-86
Select Data Mining Products
CSE
4701
Chaps26.28.29-87
Databases on WWW

CSE
4701

Web has changed the way we do Business & Research
Facts:
 Industry Saw an Opportunity, knew it had to Move
Quickly to Capitalize
 Lots of Action, Lots of Money, Lots of Releases
 Line Between R&D is Very Narrow
 Many Researchers Moved to Industry (Trying to Return
Back to Academia)

Emergence of Java
 Java changed the way that Software was Designed,
Developed, and Utilized
 Particularly w.r.t. Web-Based Applications,
Database Interoperability, Web Architectures, etc.
 Emergence of Enterprise Computing
Chaps26.28.29-88
Internet and the Web

CSE
4701
A Major Opportunity for Business
 A Global Marketplace
 Business Across State and Country Boundaries

A Way of Extending Services
 Online Payment vs. VISA, Mastercard

A Medium for Creation of New Services
 Publishers, Travel Agents, Teller, Virtual Yellow
Pages, Online Auctions …

A Boon for Academia
 Research Interactions and Collaborations
 Free Software for Classroom/Research Usage
 Opportunities for Exploration of Technologies in
Student Projects
Chaps26.28.29-89
WWW: Three Market Segments
CSE
4701
Business to Business
Server


Corporate
Network
Server
Intranet




Decision
support
Mfg.. System
monitoring
corporate
repositories
Workgroups

Information sharing
Ordering info./status
Targeted electronic
commerce
Internet
Corporate
Server Network
Internet




Sales
Marketing
Information
Services
Server
Chaps26.28.29-90
Information Delivery Problems on the Net

CSE
4701


Everyone can Publish Information on the Web
Independently at Any Time
 Consequently, there is an Information Explosion
 Identifying Information Content More Difficult
There are too Many Search Engines but too Few
Capable of Returning High Quality Data
 Is this Still True?
Most Search Engines are Useful for Ad-hoc Searches
but Awkward for Tracking Changes
 Is this Still True?
Chaps26.28.29-91
Example Web Applications

CSE
4701


Scenario 1: World Wide Wait
 A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web
 You Want to Monitor the Results for this
Important Event, so you Fire up your Trusty Web
Browser, Pointing at the Result Posting Site, and
Wait, and Wait, and Wait …
What is the Problem?
 The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics
of the Application and the Technology Used to
Implement the Application
Changed with Emergence of Mobile Computing?
Chaps26.28.29-92
Example Web Applications

CSE
4701


Scenario 2:
 Many Applications Today have the Need for
Tracking Changes in Local and Remote Data
Sources and Notifying Changes If Some Condition
Over the Data Source(s) is Met
 If You Want to Monitor the Changes on Web, You
Need to Fire Your Trusty Web Browser from Time
to Time, and Cache the Most Recent Result, and do
the Difference Manually Each Time You Poll the
Data Source(s) …
What is the Problem?
 Pure Pull is Not the Answer to All Problems
Changed with Emergence of Mobile Computing?
Chaps26.28.29-93
What is the Problem?

CSE
4701

Applications are Asymmetric but the Web is Not
 Computation Centric vs. Information Flow Centric
Type of Asymmetry
 Network Asymmetry
 Satellite, CATV, Mobile Clients, Etc.

Client to Server Ratio
 Too Many Clients can Swamp Servers

Data Volume
 Mouse and Key Click vs. Content Delivery

Update and Information Creation
 Clients Need to be Informed or Must Poll

What have we Seen re. Cell Networks Over Time?
Chaps26.28.29-94
Useful Solutions

CSE
4701

Combination/Interleave of Pull and Push Protocols
 User-initiated, Comprehensive Search-based
Information Delivery (Pull)
 Server-initiated Information Dissemination (Push)
Provide Support for a Variety of Data Delivery
Protocols, Frequencies, and Delivery Modes
 Information Delivery Frequencies
 Periodic, Conditional, Ad-Hoc


Information Delivery Modes
Information Delivery Protocols (IDP)
 Request/Respond, Polling, Publish/Subscribe,
Broadcast

Information Delivery Styles (IDS)
 Pull, Push, Hybrid
Chaps26.28.29-95
Information Delivery Frequencies

CSE
4701
Periodic
 Data is Delivered from a Server to Clients
Periodically
 Period can be Defined by System-default or by
Clients Using their Profiles
 Period can be Influenced by Client and Bandwidth






Mobile Device vs. PC w/Modem
PC w/DSL vs. PC w/Cable Modem
Multiple Mobile Devices of All Types
Streaming of Videos, Live Streaming of Events
Conditional (Aperiodic)
 Data is Delivered from a Server when Conditions
Installed by Clients in their Profiles are Satisfied
Ad-hoc (or Irregular)
Chaps26.28.29-96
Information Delivery Modes

CSE
4701

Uni-cast
 Data is Sent from a Data Source (a Single Server)
to Another Machine
1-to-n
 Data is Sent by a Single Data Source and Received
by Multiple Machines
 Multicast vs. Broadcast
 Multicast: Data is Sent to a Specific Set of Clients
 Broadcast: Sending Data Over a Medium which an
Unidentified or Unbounded Set of Clients can Listen
Chaps26.28.29-97
IDP: Request/Respond

CSE
4701


Semantics of Request/Respond
 Clients Send their Request to Servers to Ask the
Information of their Interest
 Servers Respond to the Client Request by
Delivering the Information Requested
 Client can Wait (Synchronous) or Not
Applications
 Most Database Systems and Web Search Engines
are Using the Request/Respond Protocol for
Client-Server Communication
What has Changed with Mobile Computing?
Chaps26.28.29-98
IDP: Programmed Polling vs. User Polling

CSE
4701


Semantics:
 Programmed Polling: a System Periodically
Sends Requests to Other Sites to Obtain Status
Information or Detect Changed Values
 User Polling: a User or Application Periodically
or Aperiodically Polls the Data Sites and Obtains
the Changes
Applications
 Programmed Polling: Save the Users from
having to Click, but does Nothing to Solve the
Scalability Problems Caused by the
Request/Respond Mechanism
What do Today’s Mobile Devices Use?
Chaps26.28.29-99
IDP: Publish/Subscribe

CSE
4701


Semantics: Servers Publish/Clients Subscribe
 Servers Publish Information Online
 Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery)
 Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic
 Danger: Subscriptions can Lead to Other
Unwanted Subscriptions
Applications
 Unicast: Database Triggers and Active Databases
 1-to-n: Online News Groups
How is this Utilized in Mobile Devices?
Chaps26.28.29-100
Information Delivery Styles

CSE
4701


Pull-Based System
 Transfer of Data from Server to Client is Initiated
by a Client Pull
 Clients Determine when to Get Information
 Potential for Information to be Old Unless Client
Periodically Pulls
Push-Based System
 Transfer of Data from Server to Client is Initiated
by a Server Push
 Clients may get Overloaded if Push is Too
Frequent
Hybrid
 Pull and Push Combined
 Pull First and then Push Continually
Chaps26.28.29-101
Summary: Pull vs. Push
CSE
4701
Request/
Respond
Pure Pull
Conditional Ad-hoc
Y
Pure Push
Hybrid
Publish/
Broadcast Periodic
Subscribe
Y
Y
Y
Y
Y
Y*
Y
Y
Y
Y
Y*
Chaps26.28.29-102
Design Options for Nodes

CSE
4701
Three Types of Nodes:
 Data Sources
 Provide Base Data which is to be Disseminated

Clients
 Who are the Net Consumers of the Information

Information Brokers
 Acquire Information from Other Data Sources, Add
Value to that Information and then Distribute this
Information to Other Consumers
 By Creating a Hierarchy of Brokers, Information
Delivery can be Tailored to the Need of Many Users

How has this Changed with Today’s Mobile
Computing?
Chaps26.28.29-103
The Next Big Challenge

CSE
4701

Interoperability
 Heterogeneous Distributed Databases
 Heterogeneous Distributed Systems
 Autonomous Applications
Scalability
 Rapid and Continuous Growth
 Amount of Data
 Variety of Data Types
 Dealing with personally identifiable information (PII)
and personal health information (PHI)
 Emergence of Fitness and Health Monitoring Apps
 Google Fit and Apple HealthKit
 New Apple ResearchKit for Medical Research
Chaps26.28.29-104
Interoperability: A Classic View
CSE
4701
Local
Schema
Simple Federation
Multiple Nested Federation
FDB Global
Schema
FDB Global
Schema 4
Federated
Integration
Federated
Integration
Local
Schema
Local
Schema
FDB 1
Local
Schema
Federation
FDB3
Federation
Chaps26.28.29-105
Java Client with Wrapper to Legacy Application
CSE
4701
Java Client
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
Interactions Between Java Client
and Legacy Appl. via C and RPC
C is the Medium of Info. Exchange
Java Client with C++/C Wrapper
NATIVE LAYER
Native Functions (C++)
RPC Client Stubs (C)
Legacy
Application
Network
Chaps26.28.29-106
COTS and Legacy Appls. to Java Clients
CSE
4701
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that
Map to COTS Appl
NATIVE LAYER
Native Functions that
Map to Legacy Appl
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
JAVA NETWORK WRAPPER
Mapping Classes
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers
Chaps26.28.29-107
Java Client to Legacy App via RDBS
CSE
4701
Transformed
Legacy Data
Java Client
Updated Data
Relational
Database
System(RDS)
Extract and
Generate Data
Transform and
Store Data
Legacy
Application
Chaps26.28.29-108
Database Interoperability in the Internet

CSE
4701

Technology
 Web/HTTP, JDBC/ODBC, CORBA (ORBs +
IIOP), XML, SOAP, REST API, WSDL
Architecture
Information Broker
•Mediator-Based Systems
•Agent-Based Systems
Chaps26.28.29-109
JDBC

CSE
4701

JDBC API Provides DB Access Protocols for Open,
Query, Close, etc.
Different Drivers for Different DB Platforms
JDBC API
Java
Application
Driver Manager
Driver
Oracle
Driver
Access
Driver
Driver
Sybase
Chaps26.28.29-110
Connecting a DB to the Web

CSE
4701
DBMS

CGI Script Invocation
or JDBC Invocation
Web Server
Internet

Web Server are
Stateless
DB Interactions Tend
to be Stateful
Invoking a CGI
Script on Each DB
Interaction is Very
Expensive, Mainly
Due to the Cost of
DB Open
Browser
Chaps26.28.29-111
Connecting More Efficiently

CSE
4701
DBMS
Helper
Processes
CGI Script
or JDBC
Invocation

Web Server
Internet

To Avoid Cost of
Opening Database, One
can Use Helper
Processes that Always
Keep Database Open
and Outlive Web
Connection
Newly Invoked CGI
Scripts Connect to a
Preexisting Helper
Process
System is Still Stateless
Browser
Chaps26.28.29-112
DB-Internet Architecture
CSE
4701
WWW Client
(Netscape)
WWW client
(Info. Explore)
WWW Client
(HotJava)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb
Dispatcher
DBWeb Gateway
Chaps26.28.29-113
EJB Architecture
CSE
4701
Chaps26.28.29-114
Technology Push

CSE
4701


Computer/Communication Technology (Almost Free)
 Plenty of Affordable CPU, Memory, Disk,
Network Bandwidth
 Next Generation Internet: Gigabit Now
 Wireless: Ubiquitous, High Bandwidth
Information Growth
 Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors
 Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information
The focus is to make information
available to users, in the right form, at
the right time, in the appropriate place.
Chaps26.28.29-115
Research Challenges

CSE
4701
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together

Inherent Complexity:
 Coping with Latency (Sometimes
Unpredictable)
 Failure Detection and Recovery
(Partial Failure)
 Concurrency, Load Balancing,
Availability, Scale
 Service Partitioning
 Ordering of Distributed Events
“Accidental” Complexity:
 Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
 Autonomy: Change and Evolve
Autonomously
 Tool Deficiencies: Language Support
(Sockets,rpc), Debugging, Etc.
Chaps26.28.29-116
Infosphere
Problem: too many sources,too much information
CSE
4701
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
Chaps26.28.29-117
Current State-of-Art – Has Mobile Changed This?
CSE
4701
Web
Server
Mainframe
Database
Server
Thin
Client
Chaps26.28.29-118
Infosphere Scenario – Where Does Mobile Fit?
CSE
4701
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
Chaps26.28.29-119
Heterogeneity and Autonomy

CSE
4701
Heterogeneity:
 How Much can we Really Integrate?
 Syntactic Integration
 Different Formats and Models
 XML/JSON/RDF/OWL/SQL Query Languages

Semantic Interoperability
 Basic Research on Ontology, Etc.

Autonomy
 No Central DBA on the Net
 Independent Evolution of Schema and Content
 Interoperation is Voluntary
 Interface Technology DCOM: Microsoft Standard
 CORBA, Etc...
Chaps26.28.29-120
Security and Data Quality

CSE
4701
Security
 System Security in the Broad Sense
 Attacks: Penetrations, Denial of Service
 System (and Information) Survivability
 Security Fault Tolerance
 Replication for Performance, Availability, and
Survivability

Data Quality
 Web Data Quality Problems




Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
Chaps26.28.29-121
Legacy Data Challenge

CSE
4701

Legacy Applications and Data
 Definition: Important and Difficult to Replace
 Typically, Mainframe Mission Critical Code
 Most are OLTP and Database Applications
Evolution of Legacy Databases
 Client-server Architectures
 Wrappers
 Expensive and Gradual in Any Case
Chaps26.28.29-122
Potential Value Added/Jumping on Bandwagon

CSE
4701




Sophisticated Query Capability
 Combining SQL with Keyword Queries
Consistent Updates
 Atomic Transactions and Beyond
But Everything has to be in a Database!
 Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
 Interoperable Query Processing
 Extended Transaction Updates
Commodities DB Software
 A Little Help is Still Good If it is Cheap
 Internet Facilitates Software Distribution
 Databases as Middleware
Chaps26.28.29-123
Concluding Remarks

CSE
4701


Four-Fold Objective
 Distributed Database Processing
 Data Warehouses
 Data Mining of Vast Information Repositories
 Web-Based Architectures for DB Interoperability
All Three are Tightly Related
 DDBMS can Improve Performance of Mining
Repositories as Backend Database Processors
 Web-Based Architectures Provide Access Means
for DDBMS or Mining
 Warehouses are Infrastructure to Facilitate Mining
Geographic Information Systems, Deductive DBMS,
Multi-Media DBMS, Mobile DBMS, Embedded/RealTime DBMS, etc.
Chaps26.28.29-124
Download