CIS4930 Introduction to Data Mining Midterm Review Tallahassee, Florida, 2016

advertisement
CIS4930
Introduction to Data Mining
Midterm Review
Peixiang Zhao
Tallahassee, Florida, 2016
Midterm Exam
• Time: Wednesday 3/2/2016 5:15pm --- 6:30pm
– Plan your time well
• Venue: LOV 301, in-class exam
• Closed book, closed note, but you can bring a onepage cheat sheet (A4, double side)
– Plan your strategy well
• No calculators or other electronic devices
– Laptops, ipads, smart phones, etc. are prohibited
• Any form of cheating on the examination will result
in a zero grade, and will be reported to the university
1
Midterm Exam
• 15% of your final score
• Format
1. True/False questions w. explanations
2. Short-answer questions: testing for basic concepts
• Make your answers clear and succinct
• Example 1: What is the difference between Apriori and FPGrowth?
• Example 2: Compute the Manhattan distance between data points
• Coverage
– From “Introduction” to “Frequent Pattern Mining”
2
Midterm Exam
• How to do well in the midterm exam?
– Review the materials carefully and make sure you understand them
• Both in slides and in the textbook
– Reexamine the homework and make sure you can work out the
solutions independently
– Discuss with your peer students
– Discuss with the TA and me
– Relax 
3
What is Data Mining
• Non-trivial extraction of implicit, previously unknown,
and potentially useful information from data
– a.k.a. KDD (knowledge discovery in databases)
• Typical procedure
– Data  Knowledge  Action/Decision  Goal
• Representative Examples
– Frequent pattern & association rule mining
– Classification
– Clustering
– Outlier detection
4
Data Mining Tasks
• Prediction Methods: Use some variables to predict
unknown or future values of other variables
– Classification
– Regression
– Outlier detection
• Description Methods: Find human-interpretable
patterns that describe the data
– Clustering
– Association rule mining
5
Data
• Types of attributes
– Nominal, ordinal, interval, ratio
– Discrete, continuous
• Basic statistics
– Mean, median, mode
– Quantiles: Q1, Q3; IQR
– Variance; standard deviation
• Visualization tools
–
–
–
–
Boxplot
Histogram
Q-Q plot
Scatter plot
6
Similarity
• Proximity measure for binary attributes
– Contingency table; symmetric, asymmetric measures; Jaccard
coefficient
• Minkowski distance
– Metric
– Manhattan, Euclidean, supremum distance
– Cosine similarity
7
Data Preprocessing
• Data quality
• Major tasks in data preprocessing
– Cleaning, integration, reduction, transformation, discretization
• Clean Noisy data
– Binning, regression, clustering, human inspection
• Handling redundancy in data integration
– Correlation analysis
• Χ2 (chi-square) test
• Covariance analysis
8
Data Preprocessing
• Data reduction
– Dimensionality reduction
• Curse of dimensionality
• PCA vs. SVD
• Feature selection
– Numerosity reduction
• Regression
• Histogram, clustering, sampling
– Data compression
9
Principal Component Analysis (PCA)
• Motivation and objective
– The direction with the largest projected variance is called the first
principal component
– The orthogonal direction that captures the second largest
projected variance is called the second principal component
– and so on…
• General procedure
– Preprocessing
– Compute the covariance matrix
– Derive eigenvectors for projection
• Relationship between PCA and SVD
10
Numerosity Reduction
• Parametric method
– Regression
• Non-parametric method
– Histogram
• Equal-width
• Equal-frequency
– Sampling
• Simple, sampling w/o replacement, stratified sampling
11
Data Transformation
• Normalization
– Min-max
– Z-score
– Decimal scaling
• Discretization
– Binning
• Equal-width
• Equal-depth
12
Frequent Pattern Mining
• Definition
– Frequent itemsets
• Closed itemsets
• Maximal itemsets
– Association rules
• Support, confidence
• Complexity
– The overall search space formulated as a lattice
13
Apriori
• The downward closure property
– Or anti-monotone property of support
• Apriori algorithm
– Candidate generation
• Self-join
– Frequency counting
• Hash tree
• Further improvement
14
FP-Growth
• Major philosophy
– grow long patterns from short ones using local frequent items
only
• FP-tree
– Augmented prefix tree
– Properties
• Completeness and non- redundancy
• FP-growth algorithm
– Progressive subspace projection
– Early termination condition
15
ECLAT
• Vertical representation of transactional DB
– Tid-lists
• Algorithm
– DFS-like
16
Association Rules
• The number of association rules can be exponentially
large!
• Algorithm
• Pattern evaluation
– Is confidence always an interesting measure for association
analysis?
17
18
Download