Total Score 130 out of 130 Karen Cegelski Mid Term Week 5 CSIS 5420 July 26, 2005 Score 50 out of 50 1. (50 Points) Data pre-processing and conditioning is one of the key factors that determine whether a data mining project will be a success. For each of the following topics, describe the affect this issue can have on our data mining session and what techniques can we use to counter this problem. a. Noisy data In a large database, many of the attribute values may be inexact or incorrect. This may be attributed to the instruments measuring the values, or human error when entering the data. Sometimes some of the values in the training set are altered from what they should have been. This may result in one or more entries in the database conflicting with rules already established. The system may then regard these extreme values as noise, and either ignores them or it may take the values into account possibly changing correct patterns. The problem is that one never knows if the extreme values are correct or not, and the challenge is finding the best method to handle these “weird” values. Good b. Missing data Data values can be missing because they were not measured, not answered, or were unknown or lost. Data mining methods vary in the way they treat missing values. The missing values can be ignored, records containing missing data may be omitted, missing values may be replaced with the mode or the mean, or an inference can be made from the existing values. The data needs to be formatted, sampled, adapted, and sometimes transformed for the data mining algorithm in order to deal with the missing data. Identifying variables that explain the cause of missing data can help to mitigate the bias. Good c. Data normalization and scaling Data normalization is a common data transformation method that involves changing numeric values so that they fall within a specified range. A classifier such as a neural network works best when the numerical data is scaled to a range between 0 and 1. This method is appealing with distance-based classifiers, because by normalizing the attribute values, attributes with a wide range of values are less likely to outweigh attributes with smaller ranges. Good Decimal scaling divides each numerical value by the same power of 10. If the values range between –1000 and 1000, the range can be changed to –1 and 1 by dividing each value by 1000. Good d. Data type conversion Data mining tools, including neural networks and some statistical methods, cannot process categorical data. Changing the categorical data to a numeric equivalent is a common data transformation method. Also some data mining techniques are not able to process numeric data in its original form. Most decision tree algorithm discrete numeric values by sorting the data and consider alternative binary splits of the data items. Good e. Attribute and instance selection Dealing with large volumes of data varies among classifiers. Some data mining algorithms are unable to analyze data containing too few attributes while others have trouble with large number of instances. Differentiating between relevant and irrelevant attributes is another problem with some data mining algorithms. The number of irrelevant attributes directly affects the number of training instances needed to build an accurate supervised learner model. We must make decisions about which attributes and instances to use when building the data mining models in order to overcome these problems. The problem with some of these algorithms is that they are complex. Generating and testing all possible models for any dataset containing more than a few attributes is not possible. One technique that can be used is to eliminate attributes. With some classifiers such as neural networks and nearest neighbor classifiers, attribute selection needs to take place before the data mining process can begin. Some steps that can be taken to assist in determining which attributes to eliminate are: 1) Input attributes highly correlated with other input attributes are redundant. The data mining tool will build a better model when only one attribute from a set of highly correlated attributes is designated as an input value. 2) 3) Any attribute containing value v, with a domain predictability score that is greater than a chose threshold can be considered for elimination when using categorical data. Numerical attribute significance can be determined by comparing class mean and standard deviation scores when using supervised learning. Good Combining attributes with little predictive power can sometimes be combined with other attributes to create new attributes with a high degree of predictive capability. A few transformations commonly used to create new attributes are: 1) Create a new attribute where each value represents a ratio of the value of one attribute divided by the value of a second attribute. 2) Create a new attribute whose values are differences between the values of two existing attributes. 3) Create a new attribute with values computed as the percent increase or decrease of two current attributes. Good During the training phase of supervised learning, the data is often randomly chosen from a pool of instances. Instances are selected to guarantee representation from each concept class. Instance typicality scores can chose the best set of representative instances from each class. Classification accuracy can best be achieved when all types of classifiers formed by the training sets contain an over weighted selection of highly and moderately typical training instances. Good Score 80 out of 80 2. (80 points) We've covered several data mining techniques in this course. For each technique identified below, describe the technique, identify which problems it is best suited for, identify which problems it has difficulties with, and describe any issues or limitations of the technique. a. Decision trees The most popular structure for supervised data mining is decision trees. An initial tree is constructed when a common algorithm for building a decision tree selects a subset of instances from the training data. The accuracy of the tree is tested by the remaining training instances. Incorrectly classified instances will be added to the current set of training data and the process is repeated. Minimizing the number of tree levels and trees nodes will maximize data generalization. They are easy to understand and are nicely mapped to a set of production rules and can be successfully applied to real problems. Decision trees make no prior assumptions about the data. Models with datasets containing numerical and categorical data can utilize decision trees. Good All Data mining algorithms have some issues and decision trees are no exception. Some of the issues with decision tree usage are output attributes must be categorical and multiple output attributes are not permitted. Slight variations in the training data can result in different attribute selections at each choice point within the tree causing the decision tree algorithm to be unstable. Trees created using numeric datasets can be complex as attribute splits for numeric data are typically binary. Good b. Association rules Association rules assist in finding relationships in large databases. They are unlike traditional production rules in that an attribute that is a precondition in one rule may appear as a consequent in another rule. Association rule generators allow for the consequent of a rule to contain one or several attribute values. Because association rules are more complex special techniques are required in order to generate them efficiently. Good Using rule confidence and support assist in discovering which associations are likely to be interesting from a marketing perspective. Good Caution should be exercised in the interpretation of association rules because some discovered relationships might turn out to be trivial. Good c. K-Means algorithm K-Means algorithm is a statistical unsupervised clustering technique. Using all numeric input attributes, the user is required to make a decision about how many clusters are to be discovered. The algorithm starts by randomly choosing one data point to represent each cluster. Next each data instance is placed in the cluster to which it is most similar. The process continues developing new cluster centers until the cluster centers do not change. K-Means algorithm is easy to implement and understand. It is however, not guaranteed to converge to a globally optimal solution. Good It lacks the ability to explain what has been formed and is not able to tell which attributes are significant in determining the formed clusters. Even with these limitations, the K-Means algorithm is among the most widely used clustering techniques. Good d. Linear regression Linear regression is a favorite for estimation and prediction problems. It attempts to model the variation in a dependent variable as a linear combination of one or more independent variables. Good It is an appropriate data mining strategy when the relationship between the dependent and independent variables is nearly linear. It is a poor choice when the outcome is binary. Good The value restriction placed on the dependent variable is not observed by the regression equation. This happens because linear regression produces a straight-line function and the values of the dependent variable are unbounded in both the positive and negative direction. Good e. Logistic regression Logistic regression is a good choice for problems having a binary outcome. Good It is a nonlinear regression technique that associates a conditional probability value with east data instance. A created regression equation limits the values of the output attribute to values between 0 and 1. Doing this allows the output values to represent a probability of class membership. Good f. Bayes classifier Bayes classifier is a simple and powerful supervised classification technique. All input attributes are assumed to be of equal importance and independent of one another. Good Bayes classifier works well in practice even if these assumptions are likely to be false. It can be applied to datasets that contain both categorical and numeric data. Unlike many statistical classifiers, Bayes classifier can be applied to datasets containing missing information. Good g. Neural networks A Neural network is a set of interconnected nodes designed to imitate the function of the human brain. They are quite popular in the data mining community, as they have been successfully applied to problems across several disciplines. They can be constructed for both supervised and unsupervised clustering. The input values must always be numeric. Good The first phase of the neural network is known as the learning phase. During this phase, the input values associated with each instance enter the network at the input layer. For each input attribute contained in the data there is one input layer node. Using both the input values along with the network connection weights the neural network computes the output for each instance. The output for each instance is then compared with the desired network output. Training is completed after a certain number of iterations or when the network converges to a predetermined minimum error rate. Good During the second phase, the network weights are fixed and the network is then used to compute output values for the new instances. h. Genetic algorithms Genetic algorithms use the theory of evolution to inductive leaning. It can be supervised or unsupervised and is generally used for problems that cannot be solved using traditional techniques. Good Applying a fitness function to a set of data elements in order to determine which elements survive from one generation to the next is the standard method. New instances are created from those elements not surviving in order to replace deleted elements. This technique can be used in conjunction with other learning techniques. Good Very good overview!