Data Mining

advertisement
CIS4930
Introduction to Data Mining
Introduction
Peixiang Zhao
Tallahassee, Florida, 2016
Welcome to CIS4930
• Course Website:
– http://www.cs.fsu.edu/~zhao/cis4930/main.html
– Everything about the course can be found here
• Syllabus, announcements, policies, schedules, slides,
assignments, resource…
– Make sure you check the course website periodically
• Please read the class syllabus, policies, and lecture
schedule; ask now if you have questions
1
Teaching Staff
• Instructor: Peixiang Zhao
– Research interest
• Generally, data and information science including database
systems and data mining
• Specifically, graph data, information network analysis, large-scale
data-intensive computation and analytics
– Brief history
• Illinois (Ph.D. from UIUC)
• Florida (Assistant professor at FSU starting from Aug. 2012)
• TA:
– Yongjiang Liang (liang@cs.fsu.edu)
– Office hours: Tuesday 10am – 11am
2
Prerequisite
• Must know how to program, and have data structure
and algorithm background
– COP3330: Object-oriented Programming
– COP4530: Data structures and algorithms
– Knowledge on probability theory, statistics, and linear algebra
3
Textbook
• Data Mining: Concepts and Techniques. 3rd edition
– Jiawei Han, Micheline Kamber, Jian Pei
• References
–
–
–
–
Introduction to Data Mining
Data Mining: The Textbook
The Elements of Statistical Learning
Pattern recognition and Machine Learning
4
Course Format
• Two 75-min lectures/week
– Lecture slides are used to complement the lectures, not to substitute the
textbook
• Four homework (40%)
– Written assignments and machine problems
• Datasets or software might be provided
– Individual work
– Due right before the class starts in the due date
– No late homework will be accepted
• One midterm (15%) and one final (40%)
– Check dates and make sure no conflict!
• Quizzes (5%)
5
You Tell Me -• Why Are You Taking this Course?
– https://www.youtube.com/watch?v=vbb-AjiXyh0
– https://www.youtube.com/watch?v=1i6uESo98Yo
– Data mining tops LinkedIn’s list of
the “hottest skills of 2014”
– Data scientist: the sexiest job of 21st
century (Harvard Business Review)
– Data scientist: 2015’s hottest
profession (Mashable)
6
Why Data Mining?
• Big Data
• However, we are drowning in data, but starving for
knowledge!
– There is often information “hidden” in the data that is not readily evident
– Human analysts may take weeks to discover useful information
– Much of the data is never analyzed at all
7
What is Data Mining
• Non-trivial extraction of implicit, previously unknown,
and potentially useful information from data
– a.k.a. KDD (knowledge discovery in databases)
– Data to be mined
• Relational databases, data warehouses; Data streams and sensor
data; Time-series data, temporal data, sequence data; Graphs,
social networks and multi-linked data; Spatial data and
spatiotemporal data; Multimedia data; Text data; WWW data
– Knowledge to be obtained
• Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis
8
The Goal: Decision Support
• Typical procedure
– Data  Knowledge  Action/Decision  Goal
• Examples
– Netflix collects user ratings of movies  What types of movies you
will like  Recommend new movies to you  Users stay with Netflix
– Gene sequences of cancer patients  Which genes lead to cancer? 
Appropriate treatment  Save life
– Road traffic  Which road is likely to be congested?  Suggest better
routes to drivers  Save time and energy
9
Example: Association Rule Mining
• Data
– A set of transactions, each of which consists of a set of items
• Association rules
– A set of rules that characterize associations between items
Market-Basket transactions
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
10
Example: Classification
• Process
– Construct models (functions) based on training data with known
class labels
– Describe and distinguish classes or concepts for future prediction
– Predict testing data with unknown class labels
• Applications
– Spam identification
– Treatment prediction
– Document categorization
– ……
11
Ads Targeting
features
class labels
training
a classifier: f(x)=y: features  class labels
testing
12
Fraud Detection
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
13
Example: Clustering
• Goal
– Finding groups of objects such that the objects in a group will be
similar to one another and different from the objects in other
groups
14
Example: Outlier Detection
• Outliers (Anomalies)
– Global: observations inconsistent with rest of the dataset
– Local:
• Observations inconsistent with their neighborhoods
• A local instability or discontinuity
• Applications
– Fraud/intrusion detection
– Customized marketing
– Weather prediction
One persons noise could be another person’s signal.
- Edward Ng
15
Data Mining Tasks
• Prediction Methods: Use some variables to predict
unknown or future values of other variables
– Classification
– Regression
– Outlier detection
• Description Methods: Find human-interpretable
patterns that describe the data
– Clustering
– Association rule mining
16
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
17
The Top 10 Data Mining Algorithms
1. C4.5: classification
2. K-Means: clustering
3. SVM: classification
4. Apriori: association analysis
5. EM: statistical learning
6. PageRank: link mining
7. AdaBoost: bagging and boosting
8. kNN: classification
9. Naive Bayes: classification
10. CART: classification
18
Questions
Any questions?
Please feel free to raise your hands.
19
Download