CIS4930 Introduction to Data Mining Final Review Tallahassee, Florida, 2016

advertisement
CIS4930
Introduction to Data Mining
Final Review
Peixiang Zhao
Tallahassee, Florida, 2016
Final Exam
• Time: Wednesday 4/27/2016 5:30pm --- 7:30pm
– Plan your time well
• Venue: LOV 301, in-class exam
• Closed book, closed note, but you can bring a onepage cheat sheet (A4, double side)
– Plan your strategy well
• No calculators or other electronic devices
– Laptops, ipads, smart phones, etc. are prohibited
• Any form of cheating on the examination will result
in a zero grade, and will be reported to the university
1
Final Exam
• Bring you FSU ID to attend the final exam
• 40% of your final score
• Coverage
– All materials taught in the class AND in the textbook, starting
from Introduction, to Clustering
2
Format
• One set of true/false questions with brief answers
– e.g., k-Means can be used to cluster datasets with any arbitrary
shape
– Answer: False. Because ……
• Short-answer questions
– e.g, What are the key differences between decision tree based
classification and kNN classification?
• Several more questions
– e.g., Compute frequent itemsets and strong association rules
• 100 points
• I believe you have enough time (120 minutes)
3
Final Exam
• How to do well in the exam?
– Review the materials carefully and make sure you understand them
• Both in slides and in the textbook
– Reexamine the homework and make sure you can work out the
solutions independently
– Discuss with your peer students
– Discuss with the TA and me
• Monday: 2pm-4pm
– Relax 
4
Final Exam
5
What is Data Mining
• Non-trivial extraction of implicit, previously unknown,
and potentially useful information from data
– a.k.a. KDD (knowledge discovery in databases)
• Typical procedure
– Data  Knowledge  Action/Decision  Goal
• Representative Examples
– Frequent pattern & association rule mining
– Classification
– Clustering
– Outlier detection
6
Data Mining Tasks
• Prediction Methods: Use some variables to predict
unknown or future values of other variables
– Classification
– Regression
– Outlier detection
• Description Methods: Find human-interpretable
patterns that describe the data
– Clustering
– Association rule mining
7
Data
• Types of attributes
– Nominal, ordinal, interval, ratio
– Discrete, continuous
• Basic statistics
– Mean, median, mode
– Quantiles: Q1, Q3; IQR
– Variance; standard deviation
• Visualization tools
–
–
–
–
Boxplot
Histogram
Q-Q plot
Scatter plot
8
Similarity
• Proximity measure for binary attributes
– Contingency table; symmetric, asymmetric measures; Jaccard
coefficient
• Minkowski distance
– Metric
– Manhattan, Euclidean, supremum distance
– Cosine similarity
9
Data Preprocessing
• Data quality
• Major tasks in data preprocessing
– Cleaning, integration, reduction, transformation, discretization
• Clean Noisy data
– Binning, regression, clustering, human inspection
• Handling redundancy in data integration
– Correlation analysis
• Χ2 (chi-square) test
• Covariance analysis
10
Data Preprocessing
• Data reduction
– Dimensionality reduction
• Curse of dimensionality
• PCA vs. SVD
• Feature selection
– Numerosity reduction
• Regression
• Histogram, clustering, sampling
– Data compression
• Data transformation
– Normalization
– Discretization
11
Frequent Pattern Mining
• Definition
– Frequent itemsets
• Closed itemsets
• Maximal itemsets
– Association rules
• Support, confidence
• Complexity
– The overall search space formulated as a lattice
• Methods
– Apriori
– FPGrowth
– Eclat
12
Apriori
• The downward closure property
– Or anti-monotone property of support
• Apriori algorithm
– Candidate generation
• Self-join
– Frequency counting
• Hash tree
• Further improvement
13
FP-Growth
• Major philosophy
– grow long patterns from short ones using local frequent items
only
• FP-tree
– Augmented prefix tree
– Properties
• Completeness and non- redundancy
• FP-growth algorithm
– Progressive subspace projection
– Early termination condition
14
ECLAT
• Vertical representation of transactional DB
– Tid-lists
• Algorithm
– DFS-like
15
Association Rules
• The number of association rules can be exponentially
large!
• Algorithm
• Pattern evaluation
– Is confidence always an interesting measure for association
analysis?
16
Classification
• Problem definition
– Training & Test
• Classification models
–
–
–
–
Decision tree: Gini index, information gain, error rate
Naïve Bayes
KNN
SVM
• Ensemble Methods
– Bagging
– Boosting
• Model Evaluation
17
Clustering
• Definition
• Types of clustering
• Methods
–
–
–
–
K-means
Hierarchical clustering
DBSCAN
Graph based clustering
• Impossibility for clustering
• Cluster validity
• Semi-supervised clustering
18
19
Download