DATA MINING 1 Introduction Outline Goal: Provide an overview of data mining. • • • • • Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues 2 Introduction • Data is growing at a phenomenal rate • Users expect more sophisticated information • How? UNCOVER HIDDEN INFORMATION DATA MINING 3 Data Mining Definition • Finding hidden information in a database • Fit data to a model • Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning 4 Database Processing vs. Data Mining Processing • Query – Well defined – SQL Data • Query – Poorly defined – No precise query language – Operational data Output – Precise – Subset of database Data – Not operational data Output – Fuzzy – Not a subset of database 5 Query Examples • Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk • Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) 6 Data Mining Models and Tasks 7 Basic Data Mining Tasks • Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction • Regression is used to map a data item to a real valued prediction variable. • Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning 8 Basic Data Mining Tasks (cont’d) • Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization • Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns. 9 Ex: Time Series Analysis • • • • Example: Stock Market Predict future values Determine similar patterns over time Classify behavior 10 Data Mining vs. KDD • Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 11 KDD Process Modified from [FPSS96C] • Selection: Obtain data from various sources. • Preprocessing: Cleanse data. • Transformation: Convert to common format. Transform to new format. • Data Mining: Obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner. 12 KDD Process Ex: Web Log • Selection: – Select log data (dates and locations) to use • Preprocessing: – Remove identifying URLs – Remove error logs • Transformation: – Sessionize (sort and group) • Data Mining: – Identify and count patterns – Construct data structure • Interpretation/Evaluation: – Identify and display frequently accessed sequences. • Potential User Applications: – Cache prediction – Personalization 13 Data Mining Development •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures •Neural Networks •Decision Tree Algorithms 14 Social Implications of DM • Privacy • Profiling • Unauthorized use 15 Data Mining Metrics • • • • Usefulness Return on Investment (ROI) Accuracy Space/Time 16 Database Perspective on Data Mining • • • • Scalability Real World Data Updates Ease of Use 17 Classification vs. Prediction • Classification – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction – models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications – Credit approval – Target marketing March 24, 2016 Data Mining: Concepts and Techniques 18 18 Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur 19 – If the accuracy is acceptable, use the model to classify data tuples whose class labels are not and known March 24, 2016 Data Mining: Concepts Techniques 19 Process (1): Model Construction Classification Algorithms Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no March 24, 2016 Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 20 Data Mining: Concepts and Techniques 20 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom Merlisa George Joseph March 24, 2016 RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Data Mining: Concepts and Techniques Tenured? 21 21 Ex2: Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set 22 Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes 23 or clusters in the data March 24, 2016 Data Mining: Concepts and Techniques 23 Issues: Data Preparation • Data cleaning – Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) – Remove the irrelevant or redundant attributes • Data transformation – Generalize and/or normalize data 24 March 24, 2016 Data Mining: Concepts and Techniques 24 Issues: Evaluating Classification Methods • Accuracy – classifier accuracy: predicting class label – predictor accuracy: guessing value of predicted attributes • Speed – time to construct the model (training time) – time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability – understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of 25 rulesData Mining: Concepts and Techniques Marchclassification 24, 2016 25 Related Concepts Outline Goal: Examine some areas which are related to data mining. • • • • • • • • • Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Data Warehousing OLAP/DSS Statistics Machine Learning Pattern Matching 26 Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. 27 IR Query Result Measures and Classification IR Classification 28 Dimensional Modeling • View data in a hierarchical manner more as business executives might • Useful in decision support systems and mining • Dimension: collection of logically related attributes; axis for modeling data. • Facts: data stored • Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional. 29 Relational View of Data ProdID 123 123 150 150 150 150 200 300 500 500 1 LocID Dallas Houston Dallas Dallas Fort Worth Chicago Seattle Rochester Bradenton Chicago Date 022900 020100 031500 031500 021000 Quantity 5 10 1 5 5 UnitPrice 25 20 100 95 80 012000 030100 021500 022000 012000 20 5 200 15 10 75 50 5 20 25 30 Dimensional Modeling Queries • • • • • Roll Up: more general dimension Drill Down: more specific dimension Dimension (Aggregation) Hierarchy SQL uses aggregation Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems. 31 Cube view of Data 32 Aggregation Hierarchies 33 Data Warehousing • “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon • Operational Data: Data used in day to day needs of company. • Informational Data: Supports other functions such as planning and forecasting. • Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse. 34 Operational vs. Informational Application Use Temporal Modification Orientation Data Size Level Access Response Data Schema Operational Data Data Warehouse OLTP Precise Queries Snapshot Dynamic Application Operational Values Gigabits Detailed Often Few Seconds Relational OLAP Ad Hoc Historical Static Business Integrated Terabits Summarized Less Often Minutes Star/Snowflake 35 OLAP • Online Analytic Processing (OLAP): provides more complex queries than OLTP. • OnLine Transaction Processing (OLTP): traditional database/transaction processing. • Dimensional data; cube view • Visualization of operations: – Slice: examine sub-cube. – Dice: rotate cube to look at another dimension. – Roll Up/Drill Down DM: May use OLAP queries. 36 OLAP Operations Roll Up Drill Down Single Cell Multiple Cells Slice Dice 37