Total Score 110 out of 130 DATA MINING ASSIGNMENT – WEEK 9 Score 50 out of 50 1. Databases need to undergo preprocessing to be useful for data mining. Dirty data can cause confusion for the data mining procedure, resulting in unreliable output. Data cleaning includes smoothing noisy data, filling in missing values, identifying and removing outliers, and resolving inconsistencies. NOISY DATA: The problem of noisy data must be addressed in order to minimize its negative effect on the overall accuracy of rules generated or models created by data mining applications. Noise is a component in the training set that should not be characterized by the modeling tool. Good Noise is random error in attribute values due to duplicate records, incorrect values, outliers or values not consistent with common sense. Identifying outliers is important because they may represent errors in data entry. Good Even if the outlier is a valid data point, certain modeling techniques are sensitive to the presence of outliers and may produce unstable results. Ways to minimize the impact of noise include a.) Identification and elimination of outliers for a numeric variable by examining a histogram of the variable or examining scatter plots to identify outliers on more than one variable, Good b.) Identification of outliers using Z-score standardization which reveals values farther than 3 standard deviations from the mean Good and c.) Application of standardization techniques to standardize the scale of effect each variable has on the result when analyzing data in which the variables have ranges that vary greatly from each other. Good MISSING DATA: Missing values, if left untreated, can have serious consequences. Missing values should be replaced because a.) some data mining techniques cannot deal with missing values and exclude the entire instance, Good b.) the default replacement, if inappropriate, assigned by the data mining tool may introduce distortion, Good and c.) most default replacement methods discard the information contained in the missing-value patterns. Good There are methods that can be used to estimate an appropriate value to replace a missing value. Capturing the variability that exists in a data set in the form of ratios between various values can be used to infer appropriate missing values that least corrupt the patterns in the original data set. Good Choices of replacement values include constants specified by the analyst, the field mean for numerical variables or mode for categorical variables, or a value generated at random from the variable distribution observed. Good DATA NORMALIZATION AND SCALING: Some data mining tools require the range of the input variables to be normalized and most techniques benefit from it. Variables tend to have ranges that vary greatly from each other. For some data mining algorithms, such differences in the ranges will lead to a tendency for the variable with greater range to have too much influence on the results. Good Therefore, data miners should normalize their numerical variables, to standardize the scale of effect each variable has on the results. Techniques for normalization include a.) max-min normalization where it is determined how much greater the field value is than the minimum value min(X) and this difference is scaled by the range or X* = Xmin(X) /max(X) – min(X), Good b.) Z-score standardization where the difference between the field value and the field mean value is computed and scaled by the standard deviation of the field values or X* = X – mean(X)/SD(X) resulting in all of the transformed values scaled to a range between 0 and 1, Good c.) Decimal scaling where each field value is divided by 10 raised to a sufficiently large enough power to scale all values to a range between 1 and -1. Good ATTRIBUTE AND INSTANCE SELECTION: Data modeling requires at least three sets of data, a training set, a test set and an execution set. Each set selected needs to be representative of the main data set in terms of the distribution of characteristics among instances. Good Feature enhancement may require a concentration of instances exhibiting some particular feature. Such a concentration can only be made if a subset of the data is extracted from the main data set. Good So there is a need to decide how large a data set is required to be an accurate reflection of all patterns in the data. It is critical that each subset is representative of the composition of attributes found in the main data set to avoid the introduction of bias that reduces the accuracy of the results from a data mining session. Good DATA TYPE CONVERSION: Data mining tools vary in the type of data that they can process. Some can only process numerical data (e.g., neural networks, linear regression) and others only categorical data (e.g., decision tree algorithm). Good To accommodate the requirements of specific data mining techniques data transformation must take place to convert categorical data to numerical data and vice versus. Good Score 60 out of 80 2. A decision tree is a predictive model that can be viewed as a tree with each branch representing a classification question, and the leaves representing, partitions of the data set with their classifications. Decision-tree methods are best suited for the solution of problems that require the classification of records or prediction of outcomes. When the objective is to assign each record to one of a few categories, decision trees are the best choice. Decision trees are less appropriate for estimation where the objective is to predict the value of a continuous variable. Decision tree have problems processing times series data unless the data is manipulated or presented in a way to make the trends apparent. Some decision-tree algorithms are limited in that they can only deal with binaryvalue target classes (e.g., yes/no, accept/reject). Others are able to assign records to an arbitrary number of classes, but produce errors when the number of training examples per class gets small which can happen when a tree has many levels and/or branches per node. Decision trees are not easy to train because at each node, each candidate splitting field must be sorted before its best split can be found. Good Association rules take the form “If antecedent, then consequent”, along with the measure of support and confidence associated with the rule. Association rules or techniques such as market basket analysis is the best choice when there is an undirected data mining problem that consists of well-defined items that group together in interesting ways. It works best when all items have about the same frequency in the data. Computations required to generate association rules grow exponentially with the number of items and the complexity of the rules being considered. The most difficult problem when applying this method comes with determining the right set of items to use in the analysis. Good The k-means algorithm is an algorithm for finding clusters in data by following steps 1.) Identification of how many clusters k the data should be partitioned into, 2.) random assignment of k records to be the initial cluster center locations, 3.) Identification of the nearest cluster center for each record, 4.) For each of the k clusters, identification of the cluster centroid, and update of the location of each cluster center to the new value of the centroid and 5.) the repeat of steps 3 to 5 until convergence or termination when the centroids no longer change. The k-means method is best for use to identify patterns of grouping or clusters in a large, complex data set with many variables and a lot of internal structure. In the k-means method, the original choice of a value for K determines the number of clusters that will be found. If this number does not match the natural structure of the data, the techniques won’t obtain good results. It is difficult to interpret the resulting clusters of k-means because when you don’t know what you are looking for, you may not recognize it when you find it. Clusters identified may not have any practical value. Good Linear regression is a method of prediction or estimation in which a model is created that maps values from predictors in such a way that the lowest error occurs in making predictions. The relationship between a predictor and a prediction can be mapped on a two-dimensional space and the records plotted for the prediction values along the Y axis and the predictor values along the axis. The linear regression model can be viewed as the line that minimized the error rate between the actual prediction value and the point on the line (i.e., the prediction). Of the many lines that could be drawn through the data, the one that minimizes the distance between the line and the data points is the one that is chosen for the predictive model. The regression model is represented by an output attribute (i.e. prediction) whose value is determined by a linear sum of weighted input attribute (i.e., predictor) values. Linear regression is very sensitive to anomalous fluctuations in the data, such as outliers and is seriously affected by co-linearity in the input variables. It cannot deal with missing data, and many of the standard default methods of replacing missing values do more harm than good to the resulting model. It is sensitive to additive interactions when fitting straight lines. Good Logistic regression is a nonlinear regression technique that associates a conditional probability with a data instance. Logistic regression functions similarly to the above described linear regression model except that the prediction values require logarithmic transformation for the model to generate good results. Logistic regression is applied to create predictive models when the problem involves the prediction of a response that can only be yes or no. Good Bayes classifier is a statistical classifier that can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayes classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. Bayesian methods are a way of starting with one set of evidence (i.e., multiple variable data) and arriving at an assessment of an estimate of the outcome probabilities given the evidence. In order to discover the actual probability of an outcome given some multivariate evidence, all of the variables are required to be independent of each other. The algorithm can deal with all variable types but only by converting continuous variables into categorical values. This technique easily incorporates domain knowledge as nodes with the nodes created and their probabilities learned from data or created and the probability values set from the domain knowledge. Bayesian networks can present insights as well as predictions. Good Neural Networks ??? Genetic Algorithms ???