CIS4930 Introduction to Data Mining Introduction Peixiang Zhao Tallahassee, Florida, 2016 Welcome to CIS4930 • Course Website: – http://www.cs.fsu.edu/~zhao/cis4930/main.html – Everything about the course can be found here • Syllabus, announcements, policies, schedules, slides, assignments, resource… – Make sure you check the course website periodically • Please read the class syllabus, policies, and lecture schedule; ask now if you have questions 1 Teaching Staff • Instructor: Peixiang Zhao – Research interest • Generally, data and information science including database systems and data mining • Specifically, graph data, information network analysis, large-scale data-intensive computation and analytics – Brief history • Illinois (Ph.D. from UIUC) • Florida (Assistant professor at FSU starting from Aug. 2012) • TA: – Yongjiang Liang (liang@cs.fsu.edu) – Office hours: Tuesday 10am – 11am 2 Prerequisite • Must know how to program, and have data structure and algorithm background – COP3330: Object-oriented Programming – COP4530: Data structures and algorithms – Knowledge on probability theory, statistics, and linear algebra 3 Textbook • Data Mining: Concepts and Techniques. 3rd edition – Jiawei Han, Micheline Kamber, Jian Pei • References – – – – Introduction to Data Mining Data Mining: The Textbook The Elements of Statistical Learning Pattern recognition and Machine Learning 4 Course Format • Two 75-min lectures/week – Lecture slides are used to complement the lectures, not to substitute the textbook • Four homework (40%) – Written assignments and machine problems • Datasets or software might be provided – Individual work – Due right before the class starts in the due date – No late homework will be accepted • One midterm (15%) and one final (40%) – Check dates and make sure no conflict! • Quizzes (5%) 5 You Tell Me -• Why Are You Taking this Course? – https://www.youtube.com/watch?v=vbb-AjiXyh0 – https://www.youtube.com/watch?v=1i6uESo98Yo – Data mining tops LinkedIn’s list of the “hottest skills of 2014” – Data scientist: the sexiest job of 21st century (Harvard Business Review) – Data scientist: 2015’s hottest profession (Mashable) 6 Why Data Mining? • Big Data • However, we are drowning in data, but starving for knowledge! – There is often information “hidden” in the data that is not readily evident – Human analysts may take weeks to discover useful information – Much of the data is never analyzed at all 7 What is Data Mining • Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) – Data to be mined • Relational databases, data warehouses; Data streams and sensor data; Time-series data, temporal data, sequence data; Graphs, social networks and multi-linked data; Spatial data and spatiotemporal data; Multimedia data; Text data; WWW data – Knowledge to be obtained • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis 8 The Goal: Decision Support • Typical procedure – Data Knowledge Action/Decision Goal • Examples – Netflix collects user ratings of movies What types of movies you will like Recommend new movies to you Users stay with Netflix – Gene sequences of cancer patients Which genes lead to cancer? Appropriate treatment Save life – Road traffic Which road is likely to be congested? Suggest better routes to drivers Save time and energy 9 Example: Association Rule Mining • Data – A set of transactions, each of which consists of a set of items • Association rules – A set of rules that characterize associations between items Market-Basket transactions TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 10 Example: Classification • Process – Construct models (functions) based on training data with known class labels – Describe and distinguish classes or concepts for future prediction – Predict testing data with unknown class labels • Applications – Spam identification – Treatment prediction – Document categorization – …… 11 Ads Targeting features class labels training a classifier: f(x)=y: features class labels testing 12 Fraud Detection Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes Training Set Learn Classifier Test Set Model 13 Example: Clustering • Goal – Finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups 14 Example: Outlier Detection • Outliers (Anomalies) – Global: observations inconsistent with rest of the dataset – Local: • Observations inconsistent with their neighborhoods • A local instability or discontinuity • Applications – Fraud/intrusion detection – Customized marketing – Weather prediction One persons noise could be another person’s signal. - Edward Ng 15 Data Mining Tasks • Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection • Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 16 Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining Database Technology Statistics Visualization High-Performance Computing 17 The Top 10 Data Mining Algorithms 1. C4.5: classification 2. K-Means: clustering 3. SVM: classification 4. Apriori: association analysis 5. EM: statistical learning 6. PageRank: link mining 7. AdaBoost: bagging and boosting 8. kNN: classification 9. Naive Bayes: classification 10. CART: classification 18 Questions Any questions? Please feel free to raise your hands. 19