CIS4930 Introduction to Data Mining Final Review Peixiang Zhao Tallahassee, Florida, 2016 Final Exam • Time: Wednesday 4/27/2016 5:30pm --- 7:30pm – Plan your time well • Venue: LOV 301, in-class exam • Closed book, closed note, but you can bring a onepage cheat sheet (A4, double side) – Plan your strategy well • No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited • Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1 Final Exam • Bring you FSU ID to attend the final exam • 40% of your final score • Coverage – All materials taught in the class AND in the textbook, starting from Introduction, to Clustering 2 Format • One set of true/false questions with brief answers – e.g., k-Means can be used to cluster datasets with any arbitrary shape – Answer: False. Because …… • Short-answer questions – e.g, What are the key differences between decision tree based classification and kNN classification? • Several more questions – e.g., Compute frequent itemsets and strong association rules • 100 points • I believe you have enough time (120 minutes) 3 Final Exam • How to do well in the exam? – Review the materials carefully and make sure you understand them • Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me • Monday: 2pm-4pm – Relax 4 Final Exam 5 What is Data Mining • Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) • Typical procedure – Data Knowledge Action/Decision Goal • Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 6 Data Mining Tasks • Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection • Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 7 Data • Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous • Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation • Visualization tools – – – – Boxplot Histogram Q-Q plot Scatter plot 8 Similarity • Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient • Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 9 Data Preprocessing • Data quality • Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization • Clean Noisy data – Binning, regression, clustering, human inspection • Handling redundancy in data integration – Correlation analysis • Χ2 (chi-square) test • Covariance analysis 10 Data Preprocessing • Data reduction – Dimensionality reduction • Curse of dimensionality • PCA vs. SVD • Feature selection – Numerosity reduction • Regression • Histogram, clustering, sampling – Data compression • Data transformation – Normalization – Discretization 11 Frequent Pattern Mining • Definition – Frequent itemsets • Closed itemsets • Maximal itemsets – Association rules • Support, confidence • Complexity – The overall search space formulated as a lattice • Methods – Apriori – FPGrowth – Eclat 12 Apriori • The downward closure property – Or anti-monotone property of support • Apriori algorithm – Candidate generation • Self-join – Frequency counting • Hash tree • Further improvement 13 FP-Growth • Major philosophy – grow long patterns from short ones using local frequent items only • FP-tree – Augmented prefix tree – Properties • Completeness and non- redundancy • FP-growth algorithm – Progressive subspace projection – Early termination condition 14 ECLAT • Vertical representation of transactional DB – Tid-lists • Algorithm – DFS-like 15 Association Rules • The number of association rules can be exponentially large! • Algorithm • Pattern evaluation – Is confidence always an interesting measure for association analysis? 16 Classification • Problem definition – Training & Test • Classification models – – – – Decision tree: Gini index, information gain, error rate Naïve Bayes KNN SVM • Ensemble Methods – Bagging – Boosting • Model Evaluation 17 Clustering • Definition • Types of clustering • Methods – – – – K-means Hierarchical clustering DBSCAN Graph based clustering • Impossibility for clustering • Cluster validity • Semi-supervised clustering 18 19