Total Score 114 out of 120 DATA MINING ASSIGNMENT – WEEK 5 Score 20 out of 20 1. For any test data to be correctly classified, the training dataset of the supervised learner model should be randomly selected via stratified sampling so as to result in a sample with a distribution of classes that is representative of the distribution in the entire dataset. An examination of the training instance typicality scores identifies the most typical instances whose attribute composition and values can be compared to that of the overall population to determine if the training instances are representative of it. Good Training set data must be proportionate to the test set data: models built with training data that does not represent the set of all possible domain instances can lead to misclassified test data (215). Redundant attributes must be addressed: redundant attributes can certainly wreak havoc upon output results (228). Missing values must be addressed: in the data preprocessing phase decisions need to be made as to how to handle missing or null attribute values on an entry by entry level (155). Input must be meaningful and error minimized: we have a variety of error checking and process simplifying mechanisms at our disposal (129). Score 20 out of 20 2. It is important to tune the number of clusters that the unsupervised clustering model partitions the training data into so that the number matches as closely as possible the natural structure of the data from which the test data will be drawn. As a result, the accuracy of the results generated by the model will be optimized as the clustering algorithm attempts to identify clusters such that the betweencluster variation is large compared to the within-cluster variation. Good. The main goal of unsupervised clustering is to create K clusters where the entries within each cluster are very similar, but the clusters are different from one another. Score 16 out of 20 3. For optimal performance, the K-Means algorithm requires the data to be normalized to standardize the scale of effect each attribute has on the results so that no particular attribute or subset of attributes dominates the analysis. Z-score standardization can be used to detect outliers after transforming all attribute values to fall within a range of values from 0 to 1. A. The number of clusters must be estimated in Step 1 of the algorithm and this is not always possible. B. The initial cluster centers are picked and random and most likely will be suboptimal. C. The ability to try and judge alternatives is severely hampered as the number of points and pairs increase. D. Our choice of initial cluster centers will be reflected in our final cluster centers. E. Even outside of the problems of center choice and combinatorial explosion we see that optimal solutions require alternative computations, which may be impractical. F. The K-Means algorithm “only works with real valued data”. G. The K-means algorithm tends to work “best when the clusters that exist in the data are of approximately equal size”. H. Both attribute significance and clusters themselves cannot be fully explained using the K-Means algorithm. Score 20 out of 20 4. Genetic algorithms are plagued with some issues that affect performance. Genetic algorithms are not limited in the types of data that they need. As long as a given problem can be represented as a string of bits of a fixed length, it can be handled. Although many problems can be encoded, the specific characteristics of the encoding affects the goodness of the results generated. By properly coding problems, solutions to many problems can be reached. But genetic algorithms may not generate the optimum solution because it prematurely settles on a reasonable solution. Good To get around this problem GA can be combined with other techniques followed by the changing of each bit in the solution to determine if a better solution exists. GA works by evolving successive generations of genomes that approximate more and more closely the fitness criteria established in the fitness function with the goal of maximizing the fitness of the genomes in the population. The problem of overly fast convergence to a less than optimum solution suggests that the search is being limited. Good To get around this, the various probabilities for crossover and mutation can be set high initially, then slowly decreased from one generation to the next. Good Score 20 out of 20 5. To be useful for data mining, data may need to be cleaned and transformed. Certain data mining techniques do not function well in the presence of outliers and may deliver unstable results. Identification of outliers can be accomplished by examining histograms of attributes or scatter plots. Good In some data mining algorithms, large differences in the ranges of attributes will lead to the tendency for the attribute with the greater range to have excessive influence on the results. In such cases numerical attributes should be normalized, to standardize the scale of effect each attribute has on the results. Good There are two techniques to restrict the values of selected attributes to a range from 0 to 1. Min-Max normalization accomplishes this by seeing how much greater the attribute value is than the minimum value and scaling this difference by the range. Z-score standardization takes the difference between the attribute value and the attribute mean value and scaling the difference by the standard deviation of the attribute values. Z-score standardization can also be used to identify outliers that are farther than 3 standard deviations from the mean and have a Z-score standardization that is either greater then 3 or less than -3. Good Missing data can be handled by replacing the missing value with a value substituted according to various criteria: a.) the attribute mean or mode, b.) a value generated at random from the attribute distribution observed or 3.) some constant. Data misclassifications can be identified by examining the frequency distribution of categorical attributes for inconsistencies. Good Score 18 out of 20 6. Neural networks can be used for identifying clusters of records that are similar to each other but unfortunately, 1.) don’t explain how they are similar Good, 2.) work best when all of the input and output values are between 0 and 1, and 3.) may converge on an inferior solution. Good Application of neural networks would be useful in an effort to identify bank customers who are good candidates for home equity loans. Customer data includes attributes such as appraised value of a home, amount of credit available, amount of credit granted, age, marital status, and household income. Since the inputs to a neural network must range in value between 0 and 1, transformation of the input data is required. Continuous values, like home value or income must be transformed by subtracting the lower bound of the range from the value and dividing the results by the size of the range (the difference between the highest and the lowest value for that attribute in the data set). For categorical attributes, like marital status, arbitrary fractions between 0 and 1 are assigned to each of the categories. Because neural networks produce continuous values, the output from a network can be difficult to interpret for categorical results. The best way to calibrate the output is to run the network over a test set, separate from the training set, and use the results from the test set to calibrate the output of the network to categories. In some cases the network can have a separate output for each category. Neural networks do not produce easily understood rules that explain how they arrive at a given result. Even though neural network cannot produce explicit rules, sensitivity analysis can be used to explain which inputs are more important than others. This analysis can be performed inside the network, by using the errors generated from backpropagation. Finally, neural networks usually converge on an inferior solution for any given training set. Use of the test set can be made to determine when a model provides good enough performance to be used on unknown data. Also, ANNs can be over trained to the point of working well on the training data but poorly on test data.