Exploratory Data Mining and Data Preparation Fall 2003 Data Mining 1 The Data Mining Process Data evaluation Business understanding Data preparation Deployment Data Modeling Evaluation Fall 2003 Data Mining 2 Exploratory Data Mining Preliminary process Data summaries Attribute means Attribute variation Attribute relationships Visualization Fall 2003 Data Mining 3 Summary Statistics Possible Problems: • Many missing values (16%) • No examples of one value Select an attribute Appears to be a good predictor of the class Visualization Fall 2003 Data Mining 4 Fall 2003 Data Mining 5 Exploratory DM Process For each attribute: Look at data summaries Identify potential problems and decide if an action needs to be taken (may require collecting more data) Visualize the distribution Identify potential problems (e.g., one dominant attribute value, even distribution, etc.) Evaluate usefulness of attributes Fall 2003 Data Mining 6 Weka Filters Weka has many filters that are helpful in preprocessing the data Attribute filters Add, remove, or transform attributes Instance filters Add, remove, or transform instances Process Fall 2003 Choose for drop-down menu Edit parameters (if any) Apply Data Mining 7 Data Preprocessing Data cleaning Missing values, noisy or inconsistent data Data integration/transformation Data reduction Dimensionality reduction, data compression, numerosity reduction Discretization Fall 2003 Data Mining 8 Data Cleaning Missing values Weka reports % of missing values Can use filter called ReplaceMissingValues Noisy data Due to uncertainty or errors Weka reports unique values Useful filters include Fall 2003 RemoveMisclassified MergeTwoValues Data Mining 9 Data Transformation Why transform data? Fall 2003 Combine attributes. For example, the ratio of two attributes might be more useful than keeping them separate Normalizing data. Having attributes on the same approximate scale helps many data mining algorithms(hence better models) Simplifying data. For example, working with discrete data is often more intuitive and helps the algorithms(hence better models) Data Mining 10 Weka Filters The data transformation filters in Weka include: Add AddExpression MakeIndicator NumericTransform Normalize Standardize Fall 2003 Data Mining 11 Discretization Discretization reduces the number of values for a continuous attribute Why? Some methods can only use nominal data Fall 2003 E.g., in Weka ID3 and Apriori algorithms Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree) Data Mining 12 Unsupervised Discretization Unsupervised - does not account for classes Equal-interval binning 64 Yes 65 No 68 69 70 Yes Yes Yes 71 No 72 75 No Yes Yes Yes 80 No 81 83 Yes Yes 85 No Equal-frequency binning 64 Yes Fall 2003 65 No 68 69 70 Yes Yes Yes 71 No Data Mining 72 75 No Yes Yes Yes 80 No 81 83 Yes Yes 85 No 13 Supervised Discretization Take classification into account Use “entropy” to measure information gain Goal: Discretizise into 'pure' intervals Usually no way to get completely pure intervals: 1 yes 8 yes & 5 no 64 Yes 65 No F Fall 2003 9 yes & 4 no 68 69 70 71 Yes Yes Yes No 72 75 No Yes Yes Yes E Data Mining D 80 No C B 1 no 81 83 85 Yes Yes No A 14 Error-Based Discretization Count number of misclassifications Majority class determines prediction Count instances that are different Must restrict number of classes. Complexity Brute-force: exponential time Dynamic programming: linear time Downside: cannot generate adjacent intervals with same label Fall 2003 Data Mining 15 Weka Filter Fall 2003 Data Mining 16 Attribute Selection Before inducing a model we almost always do input engineering The most useful part of this is attribute selection (also called feature selection) Select relevant attributes Remove redundant and/or irrelevant attributes Why? Fall 2003 Data Mining 17 Reasons for Attribute Selection Simpler model More transparent Easier to interpret Faster model induction What about overall time? Structural knowledge Knowing which attributes are important may be inherently important to the application What about the accuracy? Fall 2003 Data Mining 18 Attribute Selection Methods What is evaluated? Attributes Subsets of attributes Filters Filters Independent Evaluation Method Fall 2003 Learning algorithm Wrappers Data Mining 19 Filters Results in either Ranked list of attributes Typical when each attribute is evaluated individually Must select how many to keep A selected subset of attributes Forward selection Best first Random search such as genetic algorithm Fall 2003 Data Mining 20 Filter Evaluation Examples Information Gain Gain ration Relief Correlation High correlation with class attribute Low correlation with other attributes Fall 2003 Data Mining 21 Wrappers “Wrap around” the learning algorithm Must therefore always evaluate subsets Return the best subset of attributes Apply for each learning algorithm Use same search methods as before Fall 2003 Data Mining Select a subset of attributes Induce learning algorithm on this subset Evaluate the resulting model (e.g., accuracy) No Stop? Yes 22 How does it help? Naïve Bayes Instance-based learning Decision tree induction Fall 2003 Data Mining 23 Fall 2003 Data Mining 24 Scalability Data mining uses mostly well developed techniques (AI, statistics, optimization) Key difference: very large databases How to deal with scalability problems? Scalability: the capability of handling increased load in a way that does not effect the performance adversely Fall 2003 Data Mining 25 Massive Datasets Very large data sets (millions+ of instances, hundreds+ of attributes) Scalability in space and time Data set cannot be kept in memory E.g., processing one instance at a time Learning time very long How does the time depend on the input? Number of attributes, number of instances Fall 2003 Data Mining 26 Two Approaches Increased computational power Only works if algorithms can be sped up Must have the computing availability Adapt algorithms Fall 2003 Automatically scale-down the problem so that it is always approximately the same difficulty Data Mining 27 Computational Complexity We want to design algorithms with good computational complexity Time exponential polynomial linear logarithm Number of instances (Number of attributes) Fall 2003 Data Mining 28 Example: Big-Oh Notation Define n =number of instances m =number of attributes Going once through all the instances has complexity O(n) Examples Fall 2003 Polynomial complexity: O(mn2) Linear complexity: O(m+n) Exponential complexity: O(2n) Data Mining 29 Classification If no polynomial time algorithm exists to solve a problem it is called NP-complete Finding the optimal decision tree is an example of a NP-complete problem However, ID3 and C4.5 are polynomial time algorithms Fall 2003 Heuristic algorithms to construct solutions to a difficult problem “Efficient” from a computational complexity standpoint but still have a scalability problem Data Mining 30 Decision Tree Algorithms Traditional decision tree algorithms assume training set kept in memory Swapping in and out of main and cache memory expensive Solution: Fall 2003 Partition data into subsets Build a classifier on each subset Combine classifiers Not as accurate as a single classifier Data Mining 31 Other Classification Examples Instance-Based Learning Goes through instances one at a time Compares with new instance Polynomial complexity O(mn) Response time may be slow, however Naïve Bayes Polynomial complexity Stores a very large model Fall 2003 Data Mining 32 Data Reduction Another way is to reduce the size of the data before applying a learning algorithm (preprocessing) Some strategies Dimensionality reduction Data compression Numerosity reduction Fall 2003 Data Mining 33 Dimensionality Reduction Remove irrelevant, weakly relevant, and redundant attributes Attribute selection Many methods available E.g., forward selection, backwards elimination, genetic algorithm search Often much smaller problem Often little degeneration in predictive performance or even better performance Fall 2003 Data Mining 34 Data Compression Also aim for dimensionality reduction Transform the data into a smaller space Principle Component Analysis Fall 2003 Normalize data Compute c orthonormal vectors, or principle components, that provide a basis for normalized data Sort according to decreasing significance Eliminate the weaker components Data Mining 35 PCA: Example Fall 2003 Data Mining 36 Numerosity Reduction Replace data with an alternative, smaller data representation Histogram 1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15, 15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20, 20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30 1-10 11-20 21-30 Fall 2003 Data Mining 37 Other Numerosity Reduction Clustering Data objects (instance) that are in the same cluster can be treated as the same instance Must use a scalable clustering algorithm Sampling Fall 2003 Randomly select a subset of the instances to be used Data Mining 38 Sampling Techniques Different samples Sample without replacement Sample with replacement Cluster sample Stratified sample Complexity of sampling actually sublinear, that is, the complexity is O(s) where s is the number of samples and s<<n Fall 2003 Data Mining 39 Weka Filters PrincipalComponents is under the Attribute Selection tab Already talked about filters to discretize the data The Resample filter randomly samples a given percentage of the data Fall 2003 If you specify the same seed, you’ll get the same sample again Data Mining 40