Data Mining

CIS4930 Introduction to Data Mining Introduction Peixiang Zhao Tallahassee, Florida, 2016 Welcome to CIS4930 • Course Website: – http://www.cs.fsu.edu/~zhao/cis4930/main.html – Everything about the course can be found here • Syllabus, announcements, policies, schedules, slides, assignments, resource… – Make sure you check the course website periodically • Please read the class syllabus, policies, and lecture schedule; ask now if you have questions 1 Teaching Staff • Instructor: Peixiang Zhao – Research interest • Generally, data and information science including database systems and data mining • Specifically, graph data, information network analysis, large-scale data-intensive computation and analytics – Brief history • Illinois (Ph.D. from UIUC) • Florida (Assistant professor at FSU starting from Aug. 2012) • TA: – Yongjiang Liang (liang@cs.fsu.edu) – Office hours: Tuesday 10am – 11am 2 Prerequisite • Must know how to program, and have data structure and algorithm background – COP3330: Object-oriented Programming – COP4530: Data structures and algorithms – Knowledge on probability theory, statistics, and linear algebra 3 Textbook • Data Mining: Concepts and Techniques. 3rd edition – Jiawei Han, Micheline Kamber, Jian Pei • References – – – – Introduction to Data Mining Data Mining: The Textbook The Elements of Statistical Learning Pattern recognition and Machine Learning 4 Course Format • Two 75-min lectures/week – Lecture slides are used to complement the lectures, not to substitute the textbook • Four homework (40%) – Written assignments and machine problems • Datasets or software might be provided – Individual work – Due right before the class starts in the due date – No late homework will be accepted • One midterm (15%) and one final (40%) – Check dates and make sure no conflict! • Quizzes (5%) 5 You Tell Me -• Why Are You Taking this Course? – https://www.youtube.com/watch?v=vbb-AjiXyh0 – https://www.youtube.com/watch?v=1i6uESo98Yo – Data mining tops LinkedIn’s list of the “hottest skills of 2014” – Data scientist: the sexiest job of 21st century (Harvard Business Review) – Data scientist: 2015’s hottest profession (Mashable) 6 Why Data Mining? • Big Data • However, we are drowning in data, but starving for knowledge! – There is often information “hidden” in the data that is not readily evident – Human analysts may take weeks to discover useful information – Much of the data is never analyzed at all 7 What is Data Mining • Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) – Data to be mined • Relational databases, data warehouses; Data streams and sensor data; Time-series data, temporal data, sequence data; Graphs, social networks and multi-linked data; Spatial data and spatiotemporal data; Multimedia data; Text data; WWW data – Knowledge to be obtained • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis 8 The Goal: Decision Support • Typical procedure – Data  Knowledge  Action/Decision  Goal • Examples – Netflix collects user ratings of movies  What types of movies you will like  Recommend new movies to you  Users stay with Netflix – Gene sequences of cancer patients  Which genes lead to cancer?  Appropriate treatment  Save life – Road traffic  Which road is likely to be congested?  Suggest better routes to drivers  Save time and energy 9 Example: Association Rule Mining • Data – A set of transactions, each of which consists of a set of items • Association rules – A set of rules that characterize associations between items Market-Basket transactions TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 10 Example: Classification • Process – Construct models (functions) based on training data with known class labels – Describe and distinguish classes or concepts for future prediction – Predict testing data with unknown class labels • Applications – Spam identification – Treatment prediction – Document categorization – …… 11 Ads Targeting features class labels training a classifier: f(x)=y: features  class labels testing 12 Fraud Detection Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes Training Set Learn Classifier Test Set Model 13 Example: Clustering • Goal – Finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups 14 Example: Outlier Detection • Outliers (Anomalies) – Global: observations inconsistent with rest of the dataset – Local: • Observations inconsistent with their neighborhoods • A local instability or discontinuity • Applications – Fraud/intrusion detection – Customized marketing – Weather prediction One persons noise could be another person’s signal. - Edward Ng 15 Data Mining Tasks • Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection • Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 16 Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining Database Technology Statistics Visualization High-Performance Computing 17 The Top 10 Data Mining Algorithms 1. C4.5: classification 2. K-Means: clustering 3. SVM: classification 4. Apriori: association analysis 5. EM: statistical learning 6. PageRank: link mining 7. AdaBoost: bagging and boosting 8. kNN: classification 9. Naive Bayes: classification 10. CART: classification 18 Questions Any questions? Please feel free to raise your hands. 19

Data Mining

Related documents

Products

Support

Data Mining

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib