Data Warehousing and data mining g in e‐Government I ’ definition d fi iti Inmons’s A data warehouse is ‐subject‐oriented, ‐integrated, g , ‐time‐variant, ‐nonvolatile collection of data in support of management’s decision making process. Subject oriented Subject‐oriented Data warehouse is organized around subjects such as sales,product,customer. It focuses on modeling and analysis of data for decision makers. Excludes data not useful in decision support process. Integration y integrating g g Data Warehouse is constructed by multiple heterogeneous sources. Data Preprocessing are applied to ensure consistency. RDBMS Legacy System Flat File Data Warehouse Data Processing Data Transformation Time‐variant Provides information from historical perspective e.g. past 5‐10 years Every key structure contains either implicitly or explicitly an element of time Nonvolatile Data once recorded cannot be updated. updated Data warehouse requires two operations in data accessing g Initial loading of data Access of data load access Operational v/s Information System Features Operational Information Characteristics Operational processing Informational processing Orientation Transaction Analysis User Clerk,DBA,database professional Knowledge workers Function y to day y operation p Day Decision support pp Data Current Historical View Detailed,flat relational Summarized, multidimensional DB design Application oriented Subject oriented Unit of work Short ,simple transaction Complex query Access Read/write Mostly read Operational v/s Information System Features Operational Information Focus Data in Information out Number of records accessed tens millions Number of users thousands hundreds DB size 100MB to GB 100 GB to TB Priority High performance,high High flexibility,endavailability user autonomy Metric Transaction throughput Query througput Data Warehousing Architecture Monitoring & Administratio n Metadata M d Repository OLAP Servers Reconciled data External Sources Extract Transform Load Refresh Analysis Serve Query/Reportin g Operational Dbs Data Mining DATA SOURCES TOOLS DATA MARTS Data Warehouse Architecture Data Warehouse server almost always a relational DBMS,rarely flat files OLAP servers to support and d operate on multi‐dimensional l i di i l data d structures Clients Query and reporting tools Analysis tools Data mining tools Building Data Warehouse Data Selection Data Preprocessing Fill missing i i values l Remove inconsistency Data Transformation & Integration Data Loading Data in warehouse is stored in form of fact tables and dimension tables. d Case Study g is a new company p y Afco Foods & Beverages which produces dairy,bread and meat products with production unit located at Baroda. There Th products d are sold ld iin N North,North hN hW West and Western region of India. They have sales units at Mumbai, Mumbai Pune , Ahemdabad ,Delhi and Baroda. The President of the company p y wants sales information. l Information f Sales Report: The number of units sold. 113 Report: The number of units sold over time January y February y March April p 14 41 33 25 Sales Information Report : The number of items sold for each product with time Jan Feb Mar Apr Wheat Bread 6 17 8 Cheese 6 16 6 Swiss Rolls 8 25 21 Product Sales Information Report: The number of items sold in each City for each product with time Feb Mar Mumbai Wheat Bread Pune 3 Cheese 3 16 6 Swiss Rolls 4 16 6 Wheat Bread 3 Cheese 3 Swiss Rolls 4 Apr 10 Time Jan 7 8 9 15 Produc t S l Information I f ti Sales Report: The number of items sold and income in each region for each product with time. Jan Rs Feb U Rs Mar U Mumbai Wheat Bread Pune Apr p Rs U Rs U 7.44 3 24.80 10 17 36 17.36 7 21.20 8 Cheese 7.95 3 42.40 16 15.90 6 Swiss Rolls 7.32 4 29.98 16 10.98 6 7 44 7.44 3 Wheat Bread Cheese 7.95 3 Swiss Rolls 7.32 4 16.47 9 27.45 15 S l Measures M i Sales & Di Dimensions Measure – Units sold, Amount. Dimensions – Product,Time,Region. S l Data D t W h M d l Sales Warehouse Model Fact Table City Product Mumbai Month Units Rupees Wheat Bread January 3 7.95 Mumbai Cheese January 4 7.32 Pune Wheat Bread January 3 7.95 P Pune Ch Cheese J January 4 7 32 7.32 Mumbai Swiss Rolls February 16 42.40 l Data Warehouse h d l Sales Model City_ID Prod_ID Month Units Rupees 1 589 1/1/1998 3 7.95 1 1218 1/1/1998 4 7.32 2 589 1/1/1998 3 7.95 2 1218 1/1/1998 4 7 32 7.32 1 589 2/1/1998 16 42.40 Sales Data Warehouse Model Product Dimension Tables Prod_ID Product_Name Product_Category_ID 589 Wheat h Bread d 1 590 White Bread 1 288 Coconut Cookies 2 Product_Category g y_Id Product_Category g y 1 Bread 2 Cookies Sales Data Warehouse Model Region Dimension Table City_ID City Region Country 1 Mumbai West India 2 Pune NorthWest India Sales Data Warehouse Model Time Sales Fact Region Product Product Category O li Analysis A l i Processing(OLAP) P i (OLAP) Online It enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. Produc t Data Warehouse Time C b OLAP Cube City Product Time Units Dollars All All All 113 251.26 Mumbai b i All ll All ll 64 146.07 Mumbai White Bread All 38 98.49 Mumbai Wheat Bread All 13 32 24 32.24 Mumbai Wheat Bread Qtr1 3 7.44 Mumbai Wheat Bread March 3 7.44 O ti OLAP Operations Drill Down Product Category e.g Electrical Appliance Sub Category e.g Kitchen Product e.g Toaster Time O ti OLAP Operations Drill Up Product Category e.g Electrical Appliance Sub Category e.g Kitchen Product e.g Toaster Time O ti OLAP Operations Slice and Dice Product Product=Toaster Time Time O ti OLAP Operations Pivot Product Product Time Region OLAP Server An OLAP Server is a high capacity,multi capacity multi user data manipulation engine specifically designed to support pp and operate p on multi‐dimensional data structure. OLAP server available are MOLAP server ROLAP server HOLAP server Presentation Product Reporting Tool Report Time Data Warehousing includes Build Data Warehouse Online analysis processing(OLAP). Presentation. Presentation Cleaning ,Selection & I t Integration ti Presentation RDBMS Flat File Warehouse & OLAP server Client d ffor Data Warehousing h Need Industry has huge amount of operational data Knowledge g worker wants to turn this data into useful information. This information is used by them to support strategic decision making . N d for f Data D Warehousing W h i (contd..) ( d ) Need It is a platform for consolidated historical data for analysis. It stores data of good quality so that knowledge worker can make correct decisions. Need for Data Warehousing (contd..) From business perspective g weapon p ‐it is latest marketing ‐helps to keep customers by learning more about their needs . ‐valuable tool in today’s competitive fast evolving world. Data Warehousing Tools Data Warehouse SQL Server 2000 DTS Oracle 8i Warehouse Builder OLAP tools SQL Server Analysis Services Oracle Express Server Reporting tools MS Excel Pivot Chart VB Applications Purpose of the DW Make information accessible Make information consistent What Is Data Mining? g Data mining (knowledge discovery from data) Extraction of interesting (non‐trivial, (non trivial implicit, implicit previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge g extraction, data/pattern p analysis, y data archeology, gy data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data data mining mining”?? Simple search and query processing (Deductive) expert systems Knowledge Discovery (KDD) Process Data mining—core of Pattern Evaluation knowledge k l d discovery di process Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection Data Mining and Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data D t Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA Data Mining: Confluence of Multiple Disciplines atabase Database Technology Machine Learning Pattern Recognition Statistics S i i Data Mining g Algorithm Visualization Other Disciplines Why Not Traditional Data Analysis? Tremendous amount of data bytes Algorithms must be highly scalable to handle such as tera tera‐bytes of data High‐dimensionality of data Micro‐array may have h tens off thousands h d off dimensions d High complexity of data Data streams and sensor data Time‐series data, temporal data, sequence data Structure data, graphs, social networks and multi‐linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications Multi‐Dimensional View of Data Mining Data to be mined Relational, data warehouse, transactional, stream, object‐ oriented/relational, series, text, media, oriented/relational active, active spatial, spatial time time‐series text multi multi‐media heterogeneous, legacy, WWW Knowledge to be mined Characterization, Ch i i discrimination, di i i i association, i i classification, l ifi i clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database‐oriented, data warehouse (OLAP), machine learning, statistics visualization statistics, visualization, etc etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio‐data mining, stock k market k analysis, l text mining, Web b mining, etc. Data Mining: Classification Schemes General functionality Descriptive data mining Predictive data mining Different Diff t views i lead l d to t different diff t classifications l ifi ti Data view: Kinds of data to be mined Knowledge K l d view: i Ki d off knowledge Kinds k l d tto be b di discovered d Method view: Kinds of techniques utilized Application A li i view: i Ki d off applications Kinds li i adapted d d Data Mining: On What Kinds of Data? Database‐oriented data sets and applications Relational database, data warehouse, transactional database Advanced sets and d d data d d advanced d d applications l Data streams and sensor data Time‐series Time series data, data temporal data data, sequence data (incl (incl. bio bio‐sequences) sequences) Structure data, graphs, social networks and multi‐linked data Object Object‐relational relational databases Heterogeneous databases and legacy databases Multimedia database Text databases The World‐Wide Web Major Issues in Data Mining Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation i off background b k d knowledge k l d Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration I t ti off the th discovered di d knowledge k l d with ith existing i ti one: knowledge k l d fusion f i User interaction Data mining query languages and ad‐hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain‐specific p data mining g & invisible data mining g Protection of data security, integrity, and privacy Summary Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications Includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining Mi i can b be performed f d iin a variety i off information i f i repositories i i Data mining functionalities: characterization, discrimination, association,, classification,, clustering, g, outlier and trend analysis, y , etc. Data mining systems and architectures Major issues in data mining Data warehousing and data mining in government Data warehousing and data mining technologies have extensive potential application in the government Such S h as agriculture, i l rurall development, d l Health H l h and d energy and national activities of government National Data warehouses Census data ‐A data warehouse can be build from this database upon OLAP techniques h i can b be applied. li d ‐Data mining also can be performed for analysis and knowledge discovery Prices of essential commodities Applications of Data Warehousing and data mining in e‐Government Agriculture Rural development Health Planning Education Commerce and trade Tourism