Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e Database System Architectures, Data CSE 4701 Mining/Warehousing, Web DB Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. Remaining slides represent new material. Chaps26.28.29-1 Classical and Distributed Architectures CSE 4701 Classic/Centralized DBMS Dominated the Commercial Market from 1970s Forward Problems of this Approach Difficult to Scale w.r.t. Performance Gains If DB Overloaded, replace with a Faster Computer this can Only Go So Far - Disk Bottlenecks Distributed DBMS have Evolved to Address a Number of Issues Improved Performance Putting Data “Near” Location where it is Needed Replication of Data for Fault Tolerance Vertical and Horizontal Partitioning of DB Tuples Chaps26.28.29-2 Common Features of Centralized DBMS CSE 4701 Data Independence High-Level Representation via Conceptual and External Schemas Physical Representation (Internal Schema) Hidden Program Independence Multiple Applications can Share Data Views/External Schema Support this Capability Reduction of Program/Data Redundancy Single, Unique, Conceptual Schema Shared Database Almost No Data Redundancy Controlled Data Access Reduces Inconsistencies Programs Execute with Consistent Results Chaps26.28.29-3 Common Features of Centralized DBMS CSE 4701 Promote Sharing: Automatically Provided via CC No Longer Programmatic Issue Most DBMS Offer Locking for Key Shared Data Oracle Allows Locks on Data Item (Attributes) For Example, Controlling Access to Shared Identifier Coherent and Central DB Administration Semantic DB Integrity via the Automatic Enforcement of Data Consistency via Integrity Constraints/Rules Data Resiliency Physical Integrity of Data in the Presence of Faults and Errors Supported by DB Recovery Data Security: Control Access for Authorized Users Against Sensitive Data Chaps26.28.29-4 Shared Nothing Architecture CSE 4701 In this Architecture, Each DBMS Operates Autonomously There is No Sharing Three Separate DBMSs on Three Different Computers Applications/Clients Must Know About the External Schemas of all Three DBMSs for Database Retrieval Client Processing Complicates Client Different DBMS Platforms (Oracle, Sybase, Informix, ..) Different Access Modes (Query, Embedded, ODBC) Difficult for SWE to Code Chaps26.28.29-5 Difficulty in Access – Manage Multiple APIs CSE 4701 Each Platform has a Different API API1 , API3 , …. , APIn An App Programmer Must Utilize All three APIs which could differ by PL – C++, C, Java, REST, etc. Any interactions Across 3 DBs – must be programmatically handled without DB Capabilities API1 API2 APIn Chaps26.28.29-6 NW Architecture with Centralized DB CSE 4701 High-Speed NWs/WANs Spawned Centralized DB Accessible Worldwide Clients at Any Site can Access Repository Data May be “Far” Away - Increased Access Time In Practice, Each Remote Site Needs only Portion of the Data in DB1 and/or DB2 Inefficient, no Replication w.r.t. Failure Chaps26.28.29-7 Fully Distributed Architecture CSE 4701 The Five Sites (Chicago, SF, LA, NY, Atlanta) each have a “Portion” of the Database - its Distributed Replication is Possible for Fault Tolerance Queries at one Site May Need to Access Data at Another Site (e.g., for a Join) Increased Transaction Processing Complexity Chaps26.28.29-8 Distributed Database Concepts CSE 4701 A transaction can be executed by multiple networked computers in a unified manner. A distributed database (DDB) processes a Unit of execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as Collection of multiple logically related database distributed over a computer network Distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user. Chaps26.28.29-9 Goals of DDBMS CSE 4701 Support User Distribution Across Multiple Sites Remote Access by Users Regardless of Location Distribution and Replication of Database Content Provide Location Transparency Users Manipulate their Own Data Non-Local Sites “Appear” Local to Any User Provide Transaction Control Akin to Centralized Case Transaction Control Hides Distribution CC and Serializability - Must be Extended Minimize Communications Cost Optimize Use of Network - a Critical Issue Distribute DB Design Supported by Partitioning (Fragmentation) and Replication Chaps26.28.29-10 Goals of DDBMS CSE 4701 Improve Response Time for DB Access Use a More Sophisticated Load Control for Transaction Processing However, Synchronization Across Sites May Introduce Additional Overhead System Availability Site Independence in the Presence of Site Failure Subset of Database is Always Available Replication can Keep All Data Available, Even When Multiple Sites Fail Modularity Incremental Growth with the Addition of Sites Dedicate Sites to Specific Tasks Chaps26.28.29-11 Advantages of DDBMS CSE 4701 There are Four Major Advantages Transparency Distribution/NW Transparency User Doesn’t Know about NW Configuration (Location Transparency) User can Find Object at any Site (Naming Transparency) Replication Transparency (see next PPT) User Doesn’t Know Location of Data Replicas are Transparently Accessible Fragmentation Transparency Horizontal Fragmentation (Distribute by Row) Vertical Fragmentation (Distribute by Column) Chaps26.28.29-12 Data Distribution and Replication CSE 4701 Chaps26.28.29-13 Other Advantages of DDBMS CSE 4701 Increased Reliability and Availability Reliability - System Always Running Availability - Data Always Present Achieved via Replication and Distribution Ability to Make Single Query for Entire DDBMS Improved Performance Sites Able to Utilize Data that is Local for Majority of Queries Easier Expansion Improve Performance of Site by Upgrading Processor of Computer Adding Additional Disks Splitting a Site into Two or More Sites Expansion over Time as Business Grows Chaps26.28.29-14 Challenges of DDBMS CSE 4701 Tracking Data - Meta Data More Complex Must Track Distribution (where is the Data) V & H Fragmentation (How is Data Split) Replication (Multiple Copies for Consistency) Distributed Query Processing Optimization, Accessibility, etc., More Complex Block Analysis of Data Size Must also Now Consider the NW Transmitting Time Distributed Transaction Processing TP Potentially Spans Multiple Sites Submit Query to Multiple Sites Collect and Collate Results Distributed Concurrency Control Across Nodes Chaps26.28.29-15 Challenges of DDBMS CSE 4701 Replicated Data Management TP Must Choose the Replica to Access Updates Must Modify All Replica Copies Distributed Database Recovery Recovery of Individual Sites Recovery Across DDBMS Security Local and Remote Authorization During TP, be Able to Verify Remote Privileges Distributed Directory Management Meta-Data on Database - Local and Remote Must maintain Replicas of this - Every Site Tracks the Meta-Data for All Sites Chaps26.28.29-16 A Complete Schema with Keys ... CSE 4701 Keys Allow us to Establish Links Between Relations Chaps26.28.29-17 …and Corresponding DB Tables CSE 4701 which Represent Tuples/Instances of Each Relation A S C null W B null null 1 4 5 5 Chaps26.28.29-18 …with Remaining DB Tables CSE 4701 Chaps26.28.29-19 What is Fragmentation? CSE 4701 Fragmentation Divides a DB Across Multiple Sites Two Types of Fragmentation Horizontal Fragmentation Given a Relation R with n Total Tuples, Spread Entire Tuples Across Multiple Sites Each Site has a Subset of the n Tuples Essentially Fragmentation is a Selection Vertical Fragmentation Given a Relation R with m Attributes and n Total Tuples, Spread the Columns Across Multiple Sites Essentially Fragmentation is a Projection Not Generally Utilized in Practice In Both Cases, Sites can Overlap for Replication Chaps26.28.29-20 Horizontal Fragmentation CSE 4701 A horizontal subset of a relation which contain those of tuples which satisfy selection conditions. Consider Employee relation with condition DNO = 5 All tuples satisfying this create a subset which will be a horizontal fragment of Employee relation. A selection condition may be composed of several conditions connected by AND or OR. Derived horizontal fragmentation: Partitioning of a primary relation to other secondary relations which are related with Foreign keys. Chaps26.28.29-21 Horizontal Fragmentation Site 2 Tracks All Information Related to Dept. 5 CSE 4701 Chaps26.28.29-22 Horizontal Fragmentation CSE 4701 Site 3 Tracks All Information Related to Dept. 4 Note that an Employee Could be Listed in Both Cases, if s/he Works on a Project for Both Departments Chaps26.28.29-23 Refined Horizontal Fragmentation CSE 4701 Further Fragment from Site 2 based on Dept. that Employee Works in Notice that G1 + G2 + G3 is the Same as WORKS_ON5 there is no Overlap Chaps26.28.29-24 Refined Horizontal Fragmentation CSE 4701 Further Fragment from Site 3 based on Dept. that Employee Works in Notice that G4 + G5 + G6 is the Same as WORKS_ON4 Note Some Fragments can be Empty Chaps26.28.29-25 Vertical Fragmentation CSE 4701 Subset of a relation created via a subset of columns. A vertical fragment of a relation will contain values of selected columns. There is no selection condition used in vertical fragmentation. A strict vertical slice/partition Consider the Employee relation. A vertical fragment of can be created by keeping the values of Name, Bdate, Sex, and Address. Since no condition for creating a vertical fragment Each fragment must include the primary key attribute of the parent relation Employee. All vertical fragments of a relation are connected. Chaps26.28.29-26 Vertical Fragmentation Example CSE 4701 Partition the Employee Table as Below Notice Each Vertical Fragment Needs Key Column EmpDemo EmpSupvrDept Chaps26.28.29-27 Homogeneous DDBMS CSE 4701 Homogeneous Identical Software (w.r.t. Database) One DB Product (e.g., Oracle, DB2, Sybase) is Distributed and Available at All Sites Uniformity w.r.t. Administration, Maintenance, Client Access, Users, Security, etc. Interaction by Programmatic Clients is Consistent (e.g., JDBC or ODBC or REST API …) Chaps26.28.29-28 Non-Federated Heterogeneous DDBMS CSE 4701 Non-Federated Heterogeneous Different Software (w.r.t. Database) Multiple DB Products (e.g., Oracle at One Site, Access another, Sybase, Informix, etc.) Replicated Administration (e.g., Users Needs Accounts on Multiple Systems) Varied Programmatic Access - SWEs Must Know All Platforms/Client Software Complicated Very Close to Shared Nothing Architecture Chaps26.28.29-29 Federated DDBMS CSE 4701 Federated Multiple DBMS Platforms Overlaid with a Global Schema View Single External Schema Combines Schemas from all Sites Multiple Data Models Relational in one Component DBS Object-Oriented in another DBS Hierarchical in a 3rd DBS Chaps26.28.29-30 Federated DBMS Issues CSE 4701 Differences in Data Models Reconcile Relational vs. Object-Oriented Models Each Different Model has Different Capabilities These Differences Must be Addressed in Order to Present a Federated Schema Differences in Constraints Referential Integrity Constraints in Different DBSs Different Constraints on “Similar” Data Federated Schema Must Deal with these Conflicts Differences in Query Languages SQL-89, SQL-92, SQL2, SQL3 Specific Types in Different DBMS (Oracle Blobs ) Differences in Key Processing & Timestamping Chaps26.28.29-31 Heterogeneous Distributed Database Systems CSE 4701 Federated: Each site may run different database system but the data access is managed through a single conceptual schema. The degree of local autonomy is minimum. Each site must adhere to a centralized access policy There may be a global schema. Multi-database: There is no one conceptual global schema For data access a schema is constructed dynamically as needed by the application software. Object Unix Relational Unix Oriented Site 5 Site 1 Hierarchical Window Communications Site 4 network Object Oriented Network DBMS Site 3 Linux Site 2 Linux Relational Chaps26.28.29-32 Query Processing in Distributed Databases Issues CSE 4701 Cost of transferring data (files and results) over the network. This cost is usually high so some optimization is necessary. Example relations: Employee at site 1 and Department at Site 2 – Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 106 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno – Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate Q: For each employee, retrieve employee name and department name Where the employee works. Q: Fname,Lname,Dname (Employee Dno = Dnumber Department) Chaps26.28.29-33 Query Processing in Distributed Databases CSE 4701 Result The result of this query will have 10,000 tuples, assuming that every employee is related to a department. Suppose each result tuple is 40 bytes long. The query is submitted at site 3 and the result is sent to this site. Problem: Employee and Department relations are not present at site 3. Chaps26.28.29-34 Query Processing in Distributed Databases CSE 4701 Strategies: 1. Transfer Employee and Department to site 3. Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3. Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 + 1,000,000 = 1,400,000 bytes. 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3. Total bytes transferred = 400,000 + 3500 = 403,500 bytes. Optimization criteria: minimizing data transfer. Preferred approach: strategy 3. Chaps26.28.29-35 Query Processing in Distributed Databases CSE 4701 Consider the query Q’: For each department, retrieve the department name and the name of the department manager Relational Algebra expression: Fname,Lname,Dname (Employee Mgrssn = SSN Department) Chaps26.28.29-36 Query Processing in Distributed Databases CSE 4701 Result of query has 100 tuples, assuming that every department has a manager, the execution strategies are: 1. Transfer Employee and Department to the result site and perform the join at site 3. Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3. Query result size = 40 * 100 = 4000 bytes. Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes. 3. Transfer Department relation to site 1, execute join at site 1 and send the result to site 3. Total transfer size = 4000 + 3500 = 7500 bytes. Preferred strategy: Choose strategy 3. Chaps26.28.29-37 Query Processing in Distributed Databases CSE 4701 Now suppose the result site is 2. Possible strategies : 1. Transfer Employee relation to site 2, execute the query and present the result to the user at site 2. Total transfer size = 1,000,000 bytes for both queries Q and Q’. 2. Transfer Department relation to site 1, execute join at site 1 and send the result back to site 2. Total transfer size for Q = 400,000 + 3500 = 403,500 bytes and for Q’ = 4000 + 3500 = 7500 bytes. Chaps26.28.29-38 DDBS Concurrency Control and Recovery CSE 4701 Distributed Databases encounter a number of concurrency control and recovery problems which are not present in centralized databases, including: Dealing with multiple copies of data items How are they All Updated if Needed? Failure of individual sites How are Queries Restarted or Rerouted? Communication link failure Network Failure Distributed commit How to Know All Updates Done at all Sites? Distributed deadlock How to Detect and Recover? Chaps26.28.29-39 Data Warehousing and Data Mining CSE 4701 Data Warehousing Provide Access to Data for Complex Analysis, Knowledge Discovery, and Decision Making Underlying Infrastructure in Support of Mining Provides Means to Interact with Multiple DBs OLAP (on-Line Analytical Processing) vs. OLTP Data Mining Discovery of Information in a Vast Data Sets Search for Patterns and Common Features based Discover Information not Previously Known Medical Records Accessible Nationwide Research/Discover Cures for Rare Diseases Relies on Knowledge Discovery in DBs (KDD) Chaps26.28.29-40 What is Purpose of a Data Warehouse? CSE 4701 Traditional databases are not optimized for data access but have to balance the requirement of data access to ensure integrity Most data warehouse users need only read access, but need the access to be fast over a large volume of data. Most of the data required for data warehouse analysis comes from multiple databases and these analysis are recurrent and predictable to be able to design software meet requirements. Critical for tools that provide decision makers with information to make decisions quickly and reliably based on historical data. Aforementioned Charactereistics achieved by Data Warehousing and Online analytical processing (OLAP) W. H Inmon characterized a data warehouse as: “A subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decisions.” Chaps26.28.29-41 Data Warehousing and OLAP CSE 4701 A Data Warehouse Database is Maintained Separately from an Operational Database “A Subject-Oriented, Integrated, Time-Variant, and Non-Volatile Collection of Data in Support for Management’s Decision Making Process [W.H.Inmon]” OLAP (on-Line Analytical Processing) Analysis of Complex Data in the Warehouse Attempt to Attain “Value” through Analysis Relies on Trained and Adept Skilled Knowledge Workers who Discover Information Data Mart Organized Data for a Subset of an Organization Chaps26.28.29-42 Conceptual Structure of Data Warehouse CSE 4701 Data Warehouse processing involves Cleaning and reformatting of data OLAP Data Mining Back Flushing Data Warehouse OLAP Cleaning Databases Other Data Inputs Data Reformatting Metadata DSSI EIS Data Mining Updates/New Data Chaps26.28.29-43 Building a Data Warehouse CSE 4701 Option 1 Leverage Existing Repositories Collate and Collect May Not Capture All Relevant Data Option 2 Start from Scratch Utilize Underlying Corporate Data Corporate data warehouse Option 1: Consolidate Data Marts Option 2: Build from scratch Data Mart ... Data Mart Data Mart Data Mart Corporate data Chaps26.28.29-44 Comparison with Traditional Databases CSE 4701 Data Warehouses are mainly optimized for appropriate data access Traditional databases are transactional Optimized for both access mechanisms and integrity assurance measures. Data warehouses emphasize historical data as their support time-series and trend analysis. Compared with transactional databases, data warehouses are nonvolatile. In transactional databases, transaction is the mechanism change to the database. In warehouse, data is relatively coarse grained and refresh policy is carefully chosen, usually incremental. Chaps26.28.29-45 Classification of Data Warehouses CSE 4701 Generally, Data Warehouses are an order of magnitude larger than the source databases. The sheer volume of data is an issue, based on which Data Warehouses could be classified as follows. Enterprise-wide data warehouses Huge projects requiring massive investment of time and resources. Virtual data warehouses Provide views of operational databases that are materialized for efficient access. Data marts Generally targeted to a subset of organization, such as a department, and are more tightly focused. Chaps26.28.29-46 Data Warehouse Characteristics CSE 4701 Utilizes a “Multi-Dimensional” Data Model Warehouse Comprised of Store of Integrated Data from Multiple Sources Processed into Multi-Dimensional Model Warehouse Supports of Times Series and Trend Analysis “Super-Excel” Integrated with DB Technologies Data is Less Volatile than Regular DB Doesn’t Dramatically Change Over Time Updates at Regular Intervals Specific Refresh Policy Regarding Some Data Chaps26.28.29-47 Three Tier Architecture CSE 4701 monitor External data sources OLAP Server integrator Summarization report Operational databases Extraxt Transform Load Refresh serve Data Warehouse Query report Data mining metadata Data marts Chaps26.28.29-48 Data Modeling for Data Warehouses CSE 4701 Traditional Databases generally deal with twodimensional data (similar to a spread sheet). However, querying performance in a multidimensional data storage model is much more efficient. Data warehouses can take advantage of this feature as generally these are Non volatile The degree of predictability of the analysis that will be performed on them is high. Chaps26.28.29-49 What is a Multi-Dimensional Data Cube? CSE 4701 Representation of Information in Two or More Dimensions Typical Two-Dimensional - Spreadsheet In Practice, to Track Trends or Conduct Analysis, Three or More Dimensions are Useful Aggregate Raw Data! Chaps26.28.29-50 Multi-Dimensional Schemas CSE 4701 Supporting Multi-Dimensional Schemas Requires Two Types of Tables: Dimension Table: Tuples of Attributes for Each Dimension Fact Table: Measured/Observed Variables with Pointers into Dimension Table Star Schema Characterizes Data Cubes by having a Single Fact Table for Each Dimension Snowflake Schema Dimension Tables from Star Schema are Organized into Hierarchy via Normalization Both Represent Storage Structures for Cubes Chaps26.28.29-51 Data Modeling for Data Warehouses CSE 4701 Advantages of a multi-dimensional model Multi-dimensional models lend themselves readily to hierarchical views in what is known as roll-up display & drill-down display. The data can be directly queried in any combination of dimensions, bypassing complex database queries. Chaps26.28.29-52 Data Warehouse Design CSE 4701 Most of Data Warehouses use a Start Schema to Represent Multi-Dimensional Data Model Each Dimension is Represented by a Dimension Table that Provides its Multidimensional Coordinates and Stores Measures for those Coordinates A Fact Table Connects All Dimension Tables with a Multiple Join Each Tuple in Fact Table Represents the Content of One Dimension Each Tuple in the Fact Table Consists of a Pointer to Each of the Dimensional Tables Links Between the Fact Table and the Dimensional Tables for a Shape Like a Star Chaps26.28.29-53 Sample Fact Tables CSE 4701 Chaps26.28.29-54 Example of Star Schema CSE 4701 Product Date Date Month Year Sale Fact Table Date ProductNo ProdName ProdDesc Categoryu Product Store Customer Unit_Sales Store StoreID City State Country Region Dollar_Sales Customer CustID CustName CustCity CustCountry Chaps26.28.29-55 A Second Example of Star Schema … CSE 4701 Chaps26.28.29-56 and Corresponding Snowflake Schema CSE 4701 Chaps26.28.29-57 Multi-dimensional Schemas CSE 4701 Fact Constellation Fact constellation is a set of tables that share some dimension tables. However, fact constellations limit the possible queries for the warehouse. Chaps26.28.29-58 Fact Table i2b2 (Integrating Biology &Bedside) CSE 4701 Chaps26.28.29-59 Data Warehouse Issues CSE 4701 Data Acquisition Extraction from Heterogeneous Sources Reformatted into Warehouse Context - Names, Meanings, Data Domains Must be Consistent Data Cleaning for Validity and Quality is the Data as Expected w.r.t. Content? Value? Transition of Data into Data Model of Warehouse Loading of Data into the Warehouse Other Issues Include: How Current is the Data? Frequency of Update? Availability of Warehouse? Dependencies of Data? Distribution, Replication, and Partitioning Needs? Loading Time (Clean, Format, Copy, Transmit, Index Creation, etc.)? Chaps26.28.29-60 OLAP Strategies CSE 4701 OLAP Strategies Roll-Up: Summarization of Data Drill-Down: from the General to Specific (Details) Pivot: Cross Tabulate the Data Cubes Slice and Dice: Projection Operations Across Dimensions Sorting: Ordering Result Sets Selection: Access by Value or Value Range Implementation Issues Persistent with Infrequent Updates (Loading) Optimization for Performance on Queries is More Complex - Across Multi-Dimensional Cubes Recovery Less Critical - Mostly Read Only Temporal Aspects of Data (Versions) Important Chaps26.28.29-61 Knowledge Discovery CSE 4701 Data Warehousing Requires Knowledge Discovery to Organize/Extract Information Meaningfully Knowledge Discovery Technology to Extract Interesting Knowledge (Rules, Patterns, Regularities, Constraints) from a Vast Data Set Process of Non-trivial Extraction of Implicit, Previously Unknown, and Potentially Useful Information from Large Collection of Data Data Mining A Critical Step in the Knowledge Discovery Process Extracts Implicit Information from Large Data Set KDD: Knowledge Discovery and Data Mining Chaps26.28.29-62 Steps in a KDD Process CSE 4701 Learning the Application Domain (goals) Gathering and Integrating Data Data Cleaning Data Integration Data Transformation/Consolidation Data Mining Choosing the Mining Method(s) and Algorithm(s) Mining: Search for Patterns or Rules of Interest Analysis and Evaluation of the Mining Results Use of Discovered Knowledge in Decision Making Important Caveats This is Not an Automated Process! Requires Significant Human Interaction! Chaps26.28.29-63 Processing in a Data Warehouse CSE 4701 Processing Types are Varied and Include: Roll-up: Data is summarized with increasing generalization Drill-Down: Increasing levels of detail are revealed Pivot: Cross tabulation is performed Slice and dice: Performing projection operations on the dimensions. Sorting: Data is sorted by ordinal value. Selection: Data is available by value or range. Derived attributes: Attributes are computed by operations on stored derived values. Chaps26.28.29-64 On-Line Analytical Processing CSE 4701 Data Cube A Multidimensonal Array Each Attribute is a Dimension In Example Below, the Data Must be Interpreted so that it Can be Aggregated by Region/Product/Date Product Product Store Date Sale acron Rolla,MO 7/3/99 325.24 budwiser LA,CA 5/22/99 833.92 large pants NY,NY 2/12/99 771.24 Pants Diapers Beer Nuts West East 3’ diaper Cuba,MO 7/30/99 81.99 Region Central Mountain South Jan Feb March April Date Chaps26.28.29-65 Examples of Data Mining CSE 4701 The Slicing Action A Vertical or Horizontal Slice Across Entire Cube Months Slice on city Atlanta Products Sales Products Sales Months Multi-Dimensional Data Cube Chaps26.28.29-66 Examples of Data Mining CSE 4701 The Dicing Action A Slide First Identifies on Dimension A Selection of Any Cube within the Slice which Essentially Constrains All Three Dimensions Months Products Sales Products Sales Months March 2000 Electronics Atlanta Dice on Electronics and Atlanta Chaps26.28.29-67 Examples of Data Mining Drill Down - Takes a Facet (e.g., Q1) and Decomposes into Finer Detail Jan Feb March Products Sales CSE 4701 Drill down on Q1 Roll Up on Location (State, USA) Roll Up: Combines Multiple Dimensions From Individual Cities to State Q1 Q2 Q3 Q4 Products Sales Products Sales Q1 Q2 Q3 Q4 Chaps26.28.29-68 Mining Other Types of Data Analysis and Access Dramatically More Complicated! CSE 4701 Spatial databases Multimedia databases World Wide Web Time series data Geographical and Satellite Data Chaps26.28.29-69 Advantages/Objectives of Data Mining CSE 4701 Descriptive Mining Discover and Describe General Properties 60% People who buy Beer on Friday also have Bought Nuts or Chips in the Past Three Months Predictive Mining Infer Interesting Properties based on Available Data People who Buy Beer on Friday usually also Buy Nuts or Chips Result of Mining Order from Chaos Mining Large Data Sets in Multiple Dimensions Allows Businesses, Individuals, etc. to Learn about Trends, Behavior, etc. Impact on Marketing Strateg Chaps26.28.29-70 Data Mining Methods CSE 4701 Association Discover the Frequency of Items Occurring Together in a Transaction or an Event Example 80% Customers who Buy Milk also Buy Bread Hence - Bread and Milk Adjacent in Supermarket 50% of Customers Forget to Buy Milk/Soda/Drinks Hence - Available at Register Prediction Predicts Some Unknown or Missing Information based on Available Data Example Forecast Sale Value of Electronic Products for Next Quarter via Available Data from Past Three Quarters Chaps26.28.29-71 Association Rules CSE 4701 Motivated by Market Analysis Rules of the Form Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn Example “Beer ^ Soft Drink Pop Corn” Problem: Discovering All Interesting Association Rules in a Large Database is Difficult! Issues Interestingness Completeness Efficiency Basic Measurement for Association Rules Support of the Rule Confidence of the Rule Chaps26.28.29-72 Data Mining Methods CSE 4701 Classification Determine the Class or Category of an Object based on its Properties Example Classify Companies based on the Final Sale Results in the Past Quarter Clustering Organize a Set of Multi-dimensional Data Objects in Groups to Minimize Inter-group Similarity is and Maximize Intra-group Similarity Example Group Crime Locations to Find Distribution Patterns Chaps26.28.29-73 Classification CSE 4701 Classification is the process of learning a model that is able to describe different classes of data. Learning is supervised as the classes to be learned are predetermined. Learning is accomplished by using a training set of pre-classified data. The model produced is usually in the form of a decision tree or a set of rules. Chaps26.28.29-74 One Classification Example CSE 4701 Rule extracted from the decision tree of Figure 28.7. IF 50K > salary >= 20K AND age >=25 THEN class is “yes” Chaps26.28.29-75 Classification CSE 4701 Two Stages Learning Stage: Construction of a Classification Function or Model Classification Stage: Predication of Classes of Objects Using the Function or Model Tools for Classification Decision Tree Bayesian Network Neural Network Regression Problem Given a Set of Objects whose Classes are Known (Training Set), Derive a Classification Model which can Correctly Classify Future Objects Chaps26.28.29-76 An Example Attributes Class Attribute - Play/Don’t Play the Game Training Set Values that Set the Condition for the Classification What are the Pattern Below? CSE 4701 Attribute Possible Values outlook sunny, overcast, rain temperature continuous humidity continuous windy true, false Outlook Temperature Humidity sunny 85 85 overcast 83 78 sunny 80 90 sunny 72 95 sunny 72 70 … … … Windy false false true false false … Play No Yes No No Yes ... Chaps26.28.29-77 Data Mining Methods CSE 4701 Summarization Characterization (Summarization) of General Features of Objects in the Target Class Example Characterize People’s Buying Patterns on the Weekend Potential Impact on “Sale Items” & “When Sales Start” Department Stores with Bonus Coupons Discrimination Comparison of General Features of Objects Between a Target Class and a Contrasting Class Example Comparing Students in Engineering and in Art Attempt to Arrive at Commonalities/Differences Chaps26.28.29-78 Summarization Technique CSE 4701 Attribute-Oriented Induction Generalization using Concert hierarchy (Taxonomy) barcode category 14998 milk brand diaryland content size Skim 2L food 12998 mechanical MotorCraft valve 23a 12in … … … … ... Milk … Skim milk … 2% milk Category milk milk … Content Count skim 2% … 280 98 ... bread White whole bread … wheat Lucern … Dairyland Wonder … Safeway Chaps26.28.29-79 Building A Data Warehouse CSE 4701 The builders of Data warehouse should take a broad view of the anticipated use of the warehouse. The design should support ad-hoc querying An appropriate schema should be chosen that reflects the anticipated usage. The Design of a Data Warehouse involves following steps. Acquisition of data for the warehouse. Ensuring that Data Storage meets the query requirements efficiently. Giving full consideration to the environment in which the data warehouse resides. Chaps26.28.29-80 Building A Data Warehouse CSE 4701 Acquisition of data for the warehouse The data must be extracted from multiple, heterogeneous sources. Data must be formatted for consistency within the warehouse. The data must be cleaned to ensure validity. Difficult to automate cleaning process. Back flushing, upgrading the data with cleaned data. The data must be fitted into the data model of the warehouse. The data must be loaded into the warehouse. Proper design for refresh policy should be considered. Chaps26.28.29-81 Building A Data Warehouse CSE 4701 Storing the data according to the data model of the warehouse Creating and maintaining required data structures Creating and maintaining appropriate access paths Providing for time-variant data as new data are added Supporting the updating of warehouse data. Refreshing the data Purging data Chaps26.28.29-82 Why is Data Mining Popular? CSE 4701 Technology Push Technology for Collecting Large Quantity of Data Bar Code, Scanners, Satellites, Cameras Technology for Storing Large Collection of Data Databases, Data Warehouses Variety of Data Repositories, such as Virtual Worlds, Digital Media, World Wide Web Corporations want to Improve Direct Marketing and Promotions - Driving Technology Advances Targeted Marketing by Age, Region, Income, etc. Exploiting User Preferences/Customized Shopping Chaps26.28.29-83 Requirements & Challenges in Data Mining CSE 4701 Security and Social What Information is Available to Mine? Preferences via Store Cards/Web Purchases What is Your Comfort Level with Trends? User Interfaces and Visualization What Tools Must be Provided for End Users of Data Mining Systems? How are Results for Multi-Dimensional Data Displayed? Performance Guarantees Range from Real-Time for Some Queries to LongTerm for Other Queries Data Sources of Complex Data Types or Unstructured Data - Ability to Format, Clean, and Load Data Sets Chaps26.28.29-84 Data Mining Visualization CSE 4701 Leverage Improving 3D Graphics and Increased PC Processing Power for Displaying Results Significant Research in Visualization w.r.t. Displaying Multi-Dimensional Data Chaps26.28.29-85 Successful Data Mining Applications CSE 4701 Business Data Analysis and Decision Support Marketing, Customer Profiling, Market Analysis and Management, Risk Analysis and Management Fraud Detection Detecting Telephone Fraud, Automotive and Health Insurance Fraud, Credit-card Fraud, Suspicious Money Transactions (Money Laundering) Text Mining Message Filtering (Email, Newsgroups, Etc.) Newspaper Articles Analysis Sports IBM Advanced Scout Analyzed NBA Game Statistics (Shots Blocked, Assists and Fouls) to Gain Competitive Advantage Chaps26.28.29-86 Select Data Mining Products CSE 4701 Chaps26.28.29-87 Databases on WWW CSE 4701 Web has changed the way we do Business & Research Facts: Industry Saw an Opportunity, knew it had to Move Quickly to Capitalize Lots of Action, Lots of Money, Lots of Releases Line Between R&D is Very Narrow Many Researchers Moved to Industry (Trying to Return Back to Academia) Emergence of Java Java changed the way that Software was Designed, Developed, and Utilized Particularly w.r.t. Web-Based Applications, Database Interoperability, Web Architectures, etc. Emergence of Enterprise Computing Chaps26.28.29-88 Internet and the Web CSE 4701 A Major Opportunity for Business A Global Marketplace Business Across State and Country Boundaries A Way of Extending Services Online Payment vs. VISA, Mastercard A Medium for Creation of New Services Publishers, Travel Agents, Teller, Virtual Yellow Pages, Online Auctions … A Boon for Academia Research Interactions and Collaborations Free Software for Classroom/Research Usage Opportunities for Exploration of Technologies in Student Projects Chaps26.28.29-89 WWW: Three Market Segments CSE 4701 Business to Business Server Corporate Network Server Intranet Decision support Mfg.. System monitoring corporate repositories Workgroups Information sharing Ordering info./status Targeted electronic commerce Internet Corporate Server Network Internet Sales Marketing Information Services Server Chaps26.28.29-90 Information Delivery Problems on the Net CSE 4701 Everyone can Publish Information on the Web Independently at Any Time Consequently, there is an Information Explosion Identifying Information Content More Difficult There are too Many Search Engines but too Few Capable of Returning High Quality Data Is this Still True? Most Search Engines are Useful for Ad-hoc Searches but Awkward for Tracking Changes Is this Still True? Chaps26.28.29-91 Example Web Applications CSE 4701 Scenario 1: World Wide Wait A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web You Want to Monitor the Results for this Important Event, so you Fire up your Trusty Web Browser, Pointing at the Result Posting Site, and Wait, and Wait, and Wait … What is the Problem? The Scalability Problems are the Result of a Mismatch Between the Data Access Characteristics of the Application and the Technology Used to Implement the Application Changed with Emergence of Mobile Computing? Chaps26.28.29-92 Example Web Applications CSE 4701 Scenario 2: Many Applications Today have the Need for Tracking Changes in Local and Remote Data Sources and Notifying Changes If Some Condition Over the Data Source(s) is Met If You Want to Monitor the Changes on Web, You Need to Fire Your Trusty Web Browser from Time to Time, and Cache the Most Recent Result, and do the Difference Manually Each Time You Poll the Data Source(s) … What is the Problem? Pure Pull is Not the Answer to All Problems Changed with Emergence of Mobile Computing? Chaps26.28.29-93 What is the Problem? CSE 4701 Applications are Asymmetric but the Web is Not Computation Centric vs. Information Flow Centric Type of Asymmetry Network Asymmetry Satellite, CATV, Mobile Clients, Etc. Client to Server Ratio Too Many Clients can Swamp Servers Data Volume Mouse and Key Click vs. Content Delivery Update and Information Creation Clients Need to be Informed or Must Poll What have we Seen re. Cell Networks Over Time? Chaps26.28.29-94 Useful Solutions CSE 4701 Combination/Interleave of Pull and Push Protocols User-initiated, Comprehensive Search-based Information Delivery (Pull) Server-initiated Information Dissemination (Push) Provide Support for a Variety of Data Delivery Protocols, Frequencies, and Delivery Modes Information Delivery Frequencies Periodic, Conditional, Ad-Hoc Information Delivery Modes Information Delivery Protocols (IDP) Request/Respond, Polling, Publish/Subscribe, Broadcast Information Delivery Styles (IDS) Pull, Push, Hybrid Chaps26.28.29-95 Information Delivery Frequencies CSE 4701 Periodic Data is Delivered from a Server to Clients Periodically Period can be Defined by System-default or by Clients Using their Profiles Period can be Influenced by Client and Bandwidth Mobile Device vs. PC w/Modem PC w/DSL vs. PC w/Cable Modem Multiple Mobile Devices of All Types Streaming of Videos, Live Streaming of Events Conditional (Aperiodic) Data is Delivered from a Server when Conditions Installed by Clients in their Profiles are Satisfied Ad-hoc (or Irregular) Chaps26.28.29-96 Information Delivery Modes CSE 4701 Uni-cast Data is Sent from a Data Source (a Single Server) to Another Machine 1-to-n Data is Sent by a Single Data Source and Received by Multiple Machines Multicast vs. Broadcast Multicast: Data is Sent to a Specific Set of Clients Broadcast: Sending Data Over a Medium which an Unidentified or Unbounded Set of Clients can Listen Chaps26.28.29-97 IDP: Request/Respond CSE 4701 Semantics of Request/Respond Clients Send their Request to Servers to Ask the Information of their Interest Servers Respond to the Client Request by Delivering the Information Requested Client can Wait (Synchronous) or Not Applications Most Database Systems and Web Search Engines are Using the Request/Respond Protocol for Client-Server Communication What has Changed with Mobile Computing? Chaps26.28.29-98 IDP: Programmed Polling vs. User Polling CSE 4701 Semantics: Programmed Polling: a System Periodically Sends Requests to Other Sites to Obtain Status Information or Detect Changed Values User Polling: a User or Application Periodically or Aperiodically Polls the Data Sites and Obtains the Changes Applications Programmed Polling: Save the Users from having to Click, but does Nothing to Solve the Scalability Problems Caused by the Request/Respond Mechanism What do Today’s Mobile Devices Use? Chaps26.28.29-99 IDP: Publish/Subscribe CSE 4701 Semantics: Servers Publish/Clients Subscribe Servers Publish Information Online Clients Subscribe to the Information of Interest (Subscription-based Information Delivery) Data Flow is Initiated by the Data Sources (Servers) and is Aperiodic Danger: Subscriptions can Lead to Other Unwanted Subscriptions Applications Unicast: Database Triggers and Active Databases 1-to-n: Online News Groups How is this Utilized in Mobile Devices? Chaps26.28.29-100 Information Delivery Styles CSE 4701 Pull-Based System Transfer of Data from Server to Client is Initiated by a Client Pull Clients Determine when to Get Information Potential for Information to be Old Unless Client Periodically Pulls Push-Based System Transfer of Data from Server to Client is Initiated by a Server Push Clients may get Overloaded if Push is Too Frequent Hybrid Pull and Push Combined Pull First and then Push Continually Chaps26.28.29-101 Summary: Pull vs. Push CSE 4701 Request/ Respond Pure Pull Conditional Ad-hoc Y Pure Push Hybrid Publish/ Broadcast Periodic Subscribe Y Y Y Y Y Y* Y Y Y Y Y* Chaps26.28.29-102 Design Options for Nodes CSE 4701 Three Types of Nodes: Data Sources Provide Base Data which is to be Disseminated Clients Who are the Net Consumers of the Information Information Brokers Acquire Information from Other Data Sources, Add Value to that Information and then Distribute this Information to Other Consumers By Creating a Hierarchy of Brokers, Information Delivery can be Tailored to the Need of Many Users How has this Changed with Today’s Mobile Computing? Chaps26.28.29-103 The Next Big Challenge CSE 4701 Interoperability Heterogeneous Distributed Databases Heterogeneous Distributed Systems Autonomous Applications Scalability Rapid and Continuous Growth Amount of Data Variety of Data Types Dealing with personally identifiable information (PII) and personal health information (PHI) Emergence of Fitness and Health Monitoring Apps Google Fit and Apple HealthKit New Apple ResearchKit for Medical Research Chaps26.28.29-104 Interoperability: A Classic View CSE 4701 Local Schema Simple Federation Multiple Nested Federation FDB Global Schema FDB Global Schema 4 Federated Integration Federated Integration Local Schema Local Schema FDB 1 Local Schema Federation FDB3 Federation Chaps26.28.29-105 Java Client with Wrapper to Legacy Application CSE 4701 Java Client Java Application Code WRAPPER Mapping Classes JAVA LAYER Interactions Between Java Client and Legacy Appl. via C and RPC C is the Medium of Info. Exchange Java Client with C++/C Wrapper NATIVE LAYER Native Functions (C++) RPC Client Stubs (C) Legacy Application Network Chaps26.28.29-106 COTS and Legacy Appls. to Java Clients CSE 4701 COTS Application Legacy Application Java Application Code Java Application Code Native Functions that Map to COTS Appl NATIVE LAYER Native Functions that Map to Legacy Appl NATIVE LAYER JAVA LAYER JAVA LAYER Mapping Classes JAVA NETWORK WRAPPER Mapping Classes JAVA NETWORK WRAPPER Network Java Client Java Client Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers Chaps26.28.29-107 Java Client to Legacy App via RDBS CSE 4701 Transformed Legacy Data Java Client Updated Data Relational Database System(RDS) Extract and Generate Data Transform and Store Data Legacy Application Chaps26.28.29-108 Database Interoperability in the Internet CSE 4701 Technology Web/HTTP, JDBC/ODBC, CORBA (ORBs + IIOP), XML, SOAP, REST API, WSDL Architecture Information Broker •Mediator-Based Systems •Agent-Based Systems Chaps26.28.29-109 JDBC CSE 4701 JDBC API Provides DB Access Protocols for Open, Query, Close, etc. Different Drivers for Different DB Platforms JDBC API Java Application Driver Manager Driver Oracle Driver Access Driver Driver Sybase Chaps26.28.29-110 Connecting a DB to the Web CSE 4701 DBMS CGI Script Invocation or JDBC Invocation Web Server Internet Web Server are Stateless DB Interactions Tend to be Stateful Invoking a CGI Script on Each DB Interaction is Very Expensive, Mainly Due to the Cost of DB Open Browser Chaps26.28.29-111 Connecting More Efficiently CSE 4701 DBMS Helper Processes CGI Script or JDBC Invocation Web Server Internet To Avoid Cost of Opening Database, One can Use Helper Processes that Always Keep Database Open and Outlive Web Connection Newly Invoked CGI Scripts Connect to a Preexisting Helper Process System is Still Stateless Browser Chaps26.28.29-112 DB-Internet Architecture CSE 4701 WWW Client (Netscape) WWW client (Info. Explore) WWW Client (HotJava) Internet HTTP Server DBWeb Gateway DBWeb Gateway DBWeb Gateway DBWeb Dispatcher DBWeb Gateway Chaps26.28.29-113 EJB Architecture CSE 4701 Chaps26.28.29-114 Technology Push CSE 4701 Computer/Communication Technology (Almost Free) Plenty of Affordable CPU, Memory, Disk, Network Bandwidth Next Generation Internet: Gigabit Now Wireless: Ubiquitous, High Bandwidth Information Growth Massively Parallel Generation of Information on the Internet and from New Generation of Sensors Disk Capacity on the Order of Peta-bytes Small, Handy Devices to Access Information The focus is to make information available to users, in the right form, at the right time, in the appropriate place. Chaps26.28.29-115 Research Challenges CSE 4701 Ubiquitous/Pervasive Many computers and information appliances everywhere, networked together Inherent Complexity: Coping with Latency (Sometimes Unpredictable) Failure Detection and Recovery (Partial Failure) Concurrency, Load Balancing, Availability, Scale Service Partitioning Ordering of Distributed Events “Accidental” Complexity: Heterogeneity: Beyond the Local Case: Platform, Protocol, Plus All Local Heterogeneity in Spades. Autonomy: Change and Evolve Autonomously Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc. Chaps26.28.29-116 Infosphere Problem: too many sources,too much information CSE 4701 Internet: Information Jungle Infopipes Clean, Reliable, Timely Information, Anywhere Digital Earth Personalized Filtering & Info. Delivery Sensors Chaps26.28.29-117 Current State-of-Art – Has Mobile Changed This? CSE 4701 Web Server Mainframe Database Server Thin Client Chaps26.28.29-118 Infosphere Scenario – Where Does Mobile Fit? CSE 4701 Infotaps & Fat Clients Sensors Variety of Servers Many sources Database Server Chaps26.28.29-119 Heterogeneity and Autonomy CSE 4701 Heterogeneity: How Much can we Really Integrate? Syntactic Integration Different Formats and Models XML/JSON/RDF/OWL/SQL Query Languages Semantic Interoperability Basic Research on Ontology, Etc. Autonomy No Central DBA on the Net Independent Evolution of Schema and Content Interoperation is Voluntary Interface Technology DCOM: Microsoft Standard CORBA, Etc... Chaps26.28.29-120 Security and Data Quality CSE 4701 Security System Security in the Broad Sense Attacks: Penetrations, Denial of Service System (and Information) Survivability Security Fault Tolerance Replication for Performance, Availability, and Survivability Data Quality Web Data Quality Problems Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise Chaps26.28.29-121 Legacy Data Challenge CSE 4701 Legacy Applications and Data Definition: Important and Difficult to Replace Typically, Mainframe Mission Critical Code Most are OLTP and Database Applications Evolution of Legacy Databases Client-server Architectures Wrappers Expensive and Gradual in Any Case Chaps26.28.29-122 Potential Value Added/Jumping on Bandwagon CSE 4701 Sophisticated Query Capability Combining SQL with Keyword Queries Consistent Updates Atomic Transactions and Beyond But Everything has to be in a Database! Only If we Stick with Classic DB Assumptions Relaxing DB Assumptions Interoperable Query Processing Extended Transaction Updates Commodities DB Software A Little Help is Still Good If it is Cheap Internet Facilitates Software Distribution Databases as Middleware Chaps26.28.29-123 Concluding Remarks CSE 4701 Four-Fold Objective Distributed Database Processing Data Warehouses Data Mining of Vast Information Repositories Web-Based Architectures for DB Interoperability All Three are Tightly Related DDBMS can Improve Performance of Mining Repositories as Backend Database Processors Web-Based Architectures Provide Access Means for DDBMS or Mining Warehouses are Infrastructure to Facilitate Mining Geographic Information Systems, Deductive DBMS, Multi-Media DBMS, Mobile DBMS, Embedded/RealTime DBMS, etc. Chaps26.28.29-124