CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao Tallahassee, Florida, 2016 Midterm Exam • Time: Wednesday 3/2/2016 5:15pm --- 6:30pm – Plan your time well • Venue: LOV 301, in-class exam • Closed book, closed note, but you can bring a onepage cheat sheet (A4, double side) – Plan your strategy well • No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited • Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1 Midterm Exam • 15% of your final score • Format 1. True/False questions w. explanations 2. Short-answer questions: testing for basic concepts • Make your answers clear and succinct • Example 1: What is the difference between Apriori and FPGrowth? • Example 2: Compute the Manhattan distance between data points • Coverage – From “Introduction” to “Frequent Pattern Mining” 2 Midterm Exam • How to do well in the midterm exam? – Review the materials carefully and make sure you understand them • Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me – Relax 3 What is Data Mining • Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) • Typical procedure – Data Knowledge Action/Decision Goal • Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 4 Data Mining Tasks • Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection • Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 5 Data • Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous • Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation • Visualization tools – – – – Boxplot Histogram Q-Q plot Scatter plot 6 Similarity • Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient • Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 7 Data Preprocessing • Data quality • Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization • Clean Noisy data – Binning, regression, clustering, human inspection • Handling redundancy in data integration – Correlation analysis • Χ2 (chi-square) test • Covariance analysis 8 Data Preprocessing • Data reduction – Dimensionality reduction • Curse of dimensionality • PCA vs. SVD • Feature selection – Numerosity reduction • Regression • Histogram, clustering, sampling – Data compression 9 Principal Component Analysis (PCA) • Motivation and objective – The direction with the largest projected variance is called the first principal component – The orthogonal direction that captures the second largest projected variance is called the second principal component – and so on… • General procedure – Preprocessing – Compute the covariance matrix – Derive eigenvectors for projection • Relationship between PCA and SVD 10 Numerosity Reduction • Parametric method – Regression • Non-parametric method – Histogram • Equal-width • Equal-frequency – Sampling • Simple, sampling w/o replacement, stratified sampling 11 Data Transformation • Normalization – Min-max – Z-score – Decimal scaling • Discretization – Binning • Equal-width • Equal-depth 12 Frequent Pattern Mining • Definition – Frequent itemsets • Closed itemsets • Maximal itemsets – Association rules • Support, confidence • Complexity – The overall search space formulated as a lattice 13 Apriori • The downward closure property – Or anti-monotone property of support • Apriori algorithm – Candidate generation • Self-join – Frequency counting • Hash tree • Further improvement 14 FP-Growth • Major philosophy – grow long patterns from short ones using local frequent items only • FP-tree – Augmented prefix tree – Properties • Completeness and non- redundancy • FP-growth algorithm – Progressive subspace projection – Early termination condition 15 ECLAT • Vertical representation of transactional DB – Tid-lists • Algorithm – DFS-like 16 Association Rules • The number of association rules can be exponentially large! • Algorithm • Pattern evaluation – Is confidence always an interesting measure for association analysis? 17 18