Introduction to Data Mining MIT-652 Data Mining Applications Thimaporn Phetkaew School of Informatics, Walailak University MIT-652: DM 1: Introduction to Data Mining 1 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 2 Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing (OLAP) Data Mining: extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases MIT-652: DM 1: Introduction to Data Mining 3 Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extendedrelational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining and data warehousing, multimedia databases, and Web databases 2000s: Data mining and its applications, Web technology (XML, data integration) and global information systems MIT-652: DM 1: Introduction to Data Mining 4 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 5 What Is Data Mining? Data mining (knowledge discovery in databases): Alternative names: Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: is everything “data mining”? MIT-652: DM Simple search and query processing Expert systems 1: Introduction to Data Mining 6 Potential Data Mining Applications Database analysis and decision support Market analysis and management Risk analysis and management Target marketing, customer relation management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications Text mining (news group, email, documents) and Web mining Bioinformatics and bio-data analysis MIT-652: DM 1: Introduction to Data Mining 7 Data Mining: A KDD Process Pattern Evaluation Data mining: the core of knowledge discovery Data Mining process Data Transformation Task-relevant Data Data Warehouse Data Selection Data Cleaning Data Integration Databases MIT-652: DM 1: Introduction to Data Mining 8 Steps of a KDD Process Data cleaning: to remove noise and incomplete data Data integration: where multiple data sources may be combined Data selection: where data relevant to the analysis task are retrieved from the database/repository Data transformation: where data are transformed into forms appropriate for mining MIT-652: DM 1: Introduction to Data Mining 9 Steps of a KDD Process Data mining: where intelligent methods are applied in order to extract data patterns Pattern evaluation: to identify the truly interesting patterns representing knowledge Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user MIT-652: DM 1: Introduction to Data Mining 10 Data Mining and Business Intelligence Increasing potential to support business decisions Decisions Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific Experiments, Database Systems MIT-652: DM 1: Introduction to Data Mining DBA 11 Architecture of a Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edgeBase Database or Data Warehouse Server data cleaning, integration, and selection Database MIT-652: DM Data World-Wide Other Info Repositories Warehouse Web 1: Introduction to Data Mining 12 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 13 Data Mining: On What Kind of Data? Relational databases Transactional databases Data warehouses Advanced DB and information repositories MIT-652: DM Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases Multimedia databases Heterogeneous and legacy databases WWW 1: Introduction to Data Mining 14 Relational Databases A collection of tables consisting of a set of tuples (records or rows) Each tuple owns a set of attributes (columns or fields) A tuple is identified by a unique key SQL “Show me total sales of last year grouped by branch” Data mining can search for trends and data patterns e.g. detect deviations such as MIT-652: DM Items whose sales are far from those expected compared to last year; Predict credit risk of new customers based on their income, age, credit history 1: Introduction to Data Mining 15 Transactional Databases Consist of a file where each record represents a transaction (TransID + ItemList, such as items purchased in a store) Q: How many transactions include item #13? Data mining: Market basket analysis: identify frequent itemsets to enable maximizing sales MIT-652: DM Knowing printers are commonly purchased together with computers, thus discounting an expensive model printers with a purchase of selected computers (hoping selling more expensive printers) Shelf diapers, beers and potato chips nearby to increase sales 1: Introduction to Data Mining 16 Data Warehouses A repository of information collected from multiple sources, organized under a unified schema at a single site in order to facilitate management decision making Provide a multidimensional view of data and allows precomputation and fast accessing of summarized data Data warehouse systems are well suited for On-line Analytical Processing (OLAP) Data mining uses data warehouse tools to support data analysis MIT-652: DM OLAP operations such as drill-down and roll-up 1: Introduction to Data Mining 17 Object-Oriented Databases Based on OO programming paradigm Data and code encapsulated into a single unit Entity is considered as object associated with A set of variables describe the object (attributes in E-R model) A set of messages the object uses to communicate with other object A set of methods which return a value in response upon receiving a message Superclass/subclass with inheritance benefits information sharing MIT-652: DM 1: Introduction to Data Mining 18 Object-Relational Databases Extend basic relational model by adding power to handle complex data types, class hierarchies, and object inheritance Data mining in OO and OR systems share some similarities with relational Data mining MIT-652: DM Techniques need to be developed for handling complex object structure, class and subclass hierarchies, property inheritance, and methods and procedures 1: Introduction to Data Mining 19 Spatial Databases Include geographic (map) databases, medical image databases and satellite image databases Applications: forestry and ecology planning, vehicle navigation, providing public service information regarding location of telephone and electric cables, pipes, and sewage systems Data mining uncover patterns describing MIT-652: DM Characteristics of houses located near a park Climate of mountainous areas located at various altitudes 1: Introduction to Data Mining 20 Time-series and Temporal Databases Time-series database stores sequences of values that change with time e.g. stock exchange Temporal database stores relational data that include time-related attributes Data mining finds characteristics of object evolution or the trend of changes for objects in database e.g. MIT-652: DM Aid in scheduling of bank tellers according to the volume of customer traffic Uncover trends that could help investment strategy planning (when is the best time to purchase XXX stock?) 1: Introduction to Data Mining 21 Text Databases Contain word descriptions for objects such as library databases, articles, product specifications, and bug reports Data mining may uncover keyword or content associations, clustering behavior of text objects e.g. search engines cluster documents according to the words contained MIT-652: DM 1: Introduction to Data Mining 22 Multimedia Databases Store text, image, audio,video data Application: picture content-based retrieval, voice mail system, speech-based user interfaces Data mining needs storage and search techniques to be integrated to support real-time retrieval of countinuous media data MIT-652: DM 1: Introduction to Data Mining 23 Heterogeneous and Legacy Databases Heterogenous database consists of a set of interconnected, autonomous component databases Legacy database is a group of heterogeneous databases that combined different kinds of data systems, such as relational or OO databases, spreadsheets, or file systems Data mining may provide an interesting solution to the information exchange problem by transforming the given data into higher, more generalized, conceptual levels MIT-652: DM 1: Introduction to Data Mining 24 World Wide Web A huge, widely distributed information repository made available by internet Data mining: Mining path traversal patterns: understand user access patterns Help providing efficient access between highly correlated objects (improve system desigh) Lead to better marketing decisions e.g. placing advertisements Providing better customer classification and behavior analysis MIT-652: DM 1: Introduction to Data Mining 25 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 26 Data Mining Functionalities Concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Frequent patterns, association (correlation and causality) Multi-dimensional vs. single-dimensional association MIT-652: DM age(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support = 20%, confidence = 60%] contains(X, “computer”) -> contains(X, “software”) [10%, 75%] 1: Introduction to Data Mining 27 Data Mining Functionalities Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values Cluster analysis MIT-652: DM Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity 1: Introduction to Data Mining 28 Data Mining Functionalities Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining: e.g., digital camera Æ large SD memory Periodicity analysis Similarity-based analysis MIT-652: DM 1: Introduction to Data Mining 29 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 30 Are All the “Discovered” Patterns Interesting? A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm MIT-652: DM 1: Introduction to Data Mining 31 Are All the “Discovered” Patterns Interesting? Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. MIT-652: DM 1: Introduction to Data Mining 32 Can We Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Heuristic vs. exhaustive search Association vs. classification vs. clustering Search for only interesting patterns: Optimization Can a data mining system find only the interesting patterns? Approaches MIT-652: DM First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns—mining query optimization 1: Introduction to Data Mining 33 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 34 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science MIT-652: DM Statistics Data Mining Algorithm 1: Introduction to Data Mining Visualization Other Disciplines 35 Data Mining: Classification Schemes General functionality Descriptive data mining characterize general properties of data in database e.g. association rules Predictive data mining perform inference on current data in order to make predictions Different views, different classifications Data view: kinds of databases to be mined Knowledge view: kinds of knowledge to be discovered Method view: kinds of techniques utilized Application view: kinds of applications adapted MIT-652: DM 1: Introduction to Data Mining 36 Data Mining: Classification Schemes Different views, different classifications (1) Databases to be mined Relational, transactional, object-oriented, objectrelational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW, etc. Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels MIT-652: DM 1: Introduction to Data Mining 37 Data Mining: Classification Schemes Different views, different classifications (2) Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc. MIT-652: DM 1: Introduction to Data Mining 38 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 39 What Defines a Data Mining Task ? Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization of discovered patterns MIT-652: DM 1: Introduction to Data Mining 40 Task-Relevant Data Database or data warehouse name Database tables or data cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria MIT-652: DM 1: Introduction to Data Mining 41 Types of knowledge to be mined Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks MIT-652: DM 1: Introduction to Data Mining 42 Types of knowledge to be mined: Metapatterns Metapatterns: all discovered patterns must match For example, metarule for association rules P(X:customer,W) ^ Q(X,Y) ⇒ buys(X,Z) age(X,”30..39”) ^ income(X,”40K..49K”) ⇒ buys(X,”VCR”) [2.2%, 60%] occupation(X,”student”) ^ age(X,”20..29”) ⇒ buys(X,”Computer”) [1.4%, 70%] MIT-652: DM 1: Introduction to Data Mining 43 Background Knowledge: Concept Hierarchies Schema hierarchy E.g., street < city < province_or_state < country Set-grouping hierarchy E.g., {20-39} = young, {40-59} = middle_aged, {6089} = senior Operation-derived hierarchy E.g., email address: name@wu.ac.th login-name < department < university < country Rule-based hierarchy low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50 MIT-652: DM 1: Introduction to Data Mining 44 Measurements of Pattern Interestingness Simplicity e.g., (association) rule length, (decision) tree size Certainty e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility potential usefulness, e.g., support (association), noise threshold (description) Novelty not previously known, contribute new information, e.g., exception MIT-652: DM 1: Introduction to Data Mining 45 Visualization of Discovered Patterns Different backgrounds/usages may require different forms of representation Concept hierarchy is also important E.g., rules, tables, crosstabs, pie/bar chart etc. Discovered knowledge might be more understandable when represented at high level of abstraction Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data Different kinds of knowledge require different representation: association, classification, clustering, etc. MIT-652: DM 1: Introduction to Data Mining 46 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 47 Major Issues in Data Mining Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation: the interestingness problem MIT-652: DM 1: Introduction to Data Mining 48 Major Issues in Data Mining Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW) Performance and scalability Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods MIT-652: DM 1: Introduction to Data Mining 49 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data mining task primitives Major issues in data mining Summary MIT-652: DM 1: Introduction to Data Mining 50 Summary Database technology: evolved from primitive file processing to the development of database management systems Data mining: discovering interesting patterns from large amounts of data in a variety of information repositories A KDD process: includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. MIT-652: DM 1: Introduction to Data Mining 51 Summary Pattern interestingness: easily understood, valid (with some degree of certainty), useful, novel Measures of pattern interestingness, either objective or subjective can be used to guide the discovery process Data mining systems: can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used, or the applications adapted MIT-652: DM 1: Introduction to Data Mining 52 Summary Five primitives for specification of a data mining task task-relevant data (i.e., the data set to be mined) kind of knowledge to be mined background knowledge interestingness measures knowledge presentation and visualization techniques to be used for displaying the discovered patterns MIT-652: DM 1: Introduction to Data Mining 53