Total Score 130 out of 130 Nancy McDonald CSIS 5420 Final Exam: Week 9 Score 50 out of 50 1. Describe the affect the following have on data mining and discuss what can be done to counter the problem. a. Noise: random errors in attribute values Concerns: duplicate records, invalid attribute value Duplicate records can cause skewed results in the data mining operation because one value may appear more than once, giving it an unfair weight in determining the results of a mining session. Another problem with duplicates is if the mining session provides a listing of results for a user to process and the processing costs money, the user incurs the unwarranted expense of duplicate entries. Automated tools should filter out duplicate records when the data is moved from the operational environment to the warehouse. Good Invalid attribute values may cause the data mining process to ignore the attribute value for an entry and thus it is not categorized correctly. This may lead to data that does not accurately represent the population to be mined. Invalid attributes should also be addressed in the preprocessing stage before the data is stored in a warehouse. This can be done by simple scrubbing routines that are coded to look for invalid value type or values out of range. Data mining tools can handle errors in categorical data by checking frequency values or predictability scores for attribute values. Those values with a score near zero may be error candidates. Class mean and standard deviation score can also flag error candidates. Good b. Missing Data Concerns: The problem with missing data is that crucial information for a data entry is lost and therefore is unaccounted for in the final analysis. Again, the problem lies in the fact that the data may not accurately reflect the population if missing data is not counted in the final results. Missing data may also reflect a situation, for example a missing salary may mean the person is unemployed, but the data mining tool isn’t set up to handle this case and thus the person’s entry is misrepresented. Good Missing data can be handled during preprocessing by 1. Discarding records with missing values – good if this is a small percentage of the total instances. 2. Replace missing values with the class mean for real-valued data – reasonable for numerical data. 3. Replace missing attribute value with values found within other highly similar instances – can be used for categorical or real-valued data. Good Missing data can be handled during data mining by 1. Ignore missing values (neural networks, Bayes classifier) 2. Treat missing values as equal comparisons (not good with very noisy data – dissimilar instances may appear alike) 3. Treat missing values as unequal comparisons (ESX – a pessimistic approach). Good c. Data Normalization and Scaling Normalization: changing numeric values so that they fall within a specified range. [Decimal] Scaling: dividing all numbers by the same power of 10. Concerns:Why do this? Classifiers such as neural networks do a better job with numerical values scaled to a range between 0 and 1. Also, distance-based classifiers do better with normalization so that attributes with a wide range of values are less likely to outweigh attributes with smaller ranges. Good Techniques: Decimal scaling: divide each numerical value by the same power of 10. Good Min-Max normalization: (when maximum and minimum values are known). Puts attribute values between 0 and 1 with 0 equal to the minimum and 1 equal to the maximum. Good Normalization using Z scores: convert value to a standard score using the attribute mean and the attribute standard deviation – good when max. and min. are unknown. Good Logarithmic Normalization: Replaces the set of attribute values with their logarithmic value which scales down the range of values without loss of information (i.e. without loss of precision). Good d. Data Type Conversion Why? Convert categorical data to numeric because some data mining tools (e.g.,neural networks) cannot process categorical data. Also, some data mining tools cannot process numeric data in its original form (decision tree algorithms). Good How? Can convert numeric data to a categorical range. For example, convert discrete ages ranging from 1 to 99 to categories: under10, under20, under 30, etc. so that the user can view results grouped by age. Good Convert categorical data to a number: e.g. if hair color = blonde then attribute value = 1, if hair color = brunette then attribute value = 2, if hair color = red then attribute value = 3. A problem with this conversion is that brunette may appear closer to blonde than red is, only because it is assigned a number (2) closer to blonde’s number(1). Good e. Attribute and Instance Selection Concerns: Some data mining algorithms have trouble with a large number of instances. Other algorithms have problems analyzing data containing more than a few attributes. Many algorithms are unable to differentiate between relevant and irrelevant attributes. This poses a problem because the number of training instances needed to build an accurate supervised learner model is directly affected by the number of irrelevant attributes in the data. For example, neural networks and nearest neighbor algorithms give equal weight to all attributes during model building. Good Techniques: Instance Selection: instead of randomly selecting data used for the training phase of supervised learning, instance-based classifiers save a subset of representative instances from each class. A new instance is classified by comparing its attributes to the values of saved instances. The accuracy of this method depends upon how well the chosen, “saved” instances represent each class. Instance typicality scores are used to choose the best set of representative instances. One can use this technique with unsupervised clustering also by computing a typicality score for each instance relative to all domain instances. Clusters are improved (well-defined) by eliminating the most atypical domain instances. Once well-defined clusters have been formed, these atypical instances can be presented and the model will either form new clusters or place them in existing clusters. Good Attribute Selection: Some classifiers have attribute selection techniques included as part of the model process. (thus less likely to suffer from effects of attributes with little predictive value). For other processes, one can do the following: Eliminate attributes that are not predictive of class membership by eliminating all but one attribute from a set of highly correlated attributes (i.e., remove redundant ones) Good If a value for a categorical attribute exceeds a domain predictability threshold, this means that most domain instances will have this value and therefore the attribute doesn’t help differentiate between instances. So eliminate this attribute. Good Eliminate attributes that have low attribute significance scores. Good Create new attributes that are a meaningful combination of existing attributes (more meaningful than the original attributes were by themselves, e.g., ratios, differences, percent increase/decrease.) Good Score 80 out of 80 2. Describe the following data mining techniques, identify problems for which it is best suited, identify problems with which it has difficulties and describe issues or limitations. a. Decision trees Description: A supervised learning technique in which a decision tree is built, then finelytuned, from training data. Each branch (or link) represents a unique value for a chosen attribute. Sub-trees are added when instances don’t completely follow a path to the end of a tree node. Good Best Suited For: Supervised learning in which you have a set of training data that represents the general population, and where the user knows what attributes differentiate instances between classes. Decision trees are good for creating a set of production rules for further processing. Good Difficulties With: Trees created from numeric data sets may be too complex if there are a lot of different numbers, because the splits in the tree are only binary. Good Issues/Limitations: Output values must be categorical. Multiple output attributes are not allowed. Slight variations in training data can result in different selections at a branch (choice point) in the tree. This attribute choice can affect all descendent subtrees. Good b. Association Rules Description: “Affinity Analysis” which is the process of determining which things go together and then generating a set of rules to define these associations. Each rule has an associated confidence value to aid the user in analyzing the rules. Good Best Suited For: Market basket analysis - determining which attributes go together. The association rules are used to determine marketing strategy. Also, the user is not restricted by having to choose a single dependent variable. Good Difficulties With: Discrete-valued attributes. Good Issues/Limitations: Discovered relationships may turn out to be trivial. If there is a high volume of data, the user may need to set a coverage criterion to weed out irrelevant associations. This criterion must then be adjusted up/down and the mining process repeated to get the desired amount/quality of association rules. Good c. K-Means Algorithm Description: A statistical [unsupervised] clustering technique in which the user sets the number of clusters; the algorithm selects a specified number of instances at random to be the initial cluster centers; then the remaining instances are assigned to the closest cluster using the Euclidean distance formula. The algorithm then calculates a new mean for each cluster and repeats the process of assigning instances to clusters until the new mean equals the previous mean or until a threshold has been met. Good Best Suited For: Classification tasks. Unsupervised clustering of numerical data – looking to see how instances group together. Good Difficulties With: If the data creates clusters of unequal size, K-means will probably not find the optimal solution. Good Issues/Limitations: Categorical data is either ignored or must be converted to numerical form. The user must be careful when converting so as not to inadvertently make one categorical value appear “closer” to another value at the expense of a third value. (see data type conversion in problem 1). If a poor choice is made for the number of clusters to be formed, the data may not form significantly distinct clusters. The user may have to run the algorithm several times to test different values for the number of clusters. Using several irrelevant attributes may cause less than optimal results. K-means doesn’t explain the nature of the formed clusters and the user must interpret what has been found. (Supervised data mining tools can help here…) Good d. Linear Regression Description: Supervised learning tool that models the dependent (output) variable as a linear combination of >=1 independent (input) variables. The output of the model can be used to create an equation of the form: dependent variable = a(1)x(1) + a(2)x(2) + …+a(n)x(n) + c where x(i) is an independent attribute, a(i) is its coefficient, and c is a constant figured out by the algorithm. Good Best Suited For: Supervised learning of numeric data if the relationship between the dependent and independent variables is pretty much linear. Good Difficulties With: Data that lacks linear relationships between the dependent and independent variables. Good Issues/Limitations: Categorical data must be converted to numerical. Appropriate only if data can be modeled with a straight-line function. Also, the values of the output variable are unbounded in both the positive and negative direction thus perhaps causing a problem if the observed outcome is restricted to 2 values (0 or 1). Good e. Logistic Regression Description: Supervised learning technique that is non-linear. It associates a probability score for each data instance. The model transforms the above-mentioned linear regression output values which are unbounded to output values between 0 and 1. This is done using the base of natural logarithms, e. Good Best Suited For: Producing the probability of the occurrence or non-occurrence of a measured event. Good Difficulties With: Data that lacks linear relationships between the dependent and independent variables. Good Issues/Limitations: Categorical data must be converted to numerical. Appropriate only if data can be modeled with a straight-line function. Good f. Bayes Classifier Description: A supervised learning technique that assumes all input values are of equal importance and independent of one another. It produces the conditional probability of a hypothesis being true given the evidence (determined by the input attributes). Good Best Suited For: Testing hypothesis about an output given the input (evidence). Good Difficulties With: numerical attribute values – the user must know how the data is distributed ahead of time. Good Issues/Limitations: When the count for an attribute may equal 0, (e.g. there are no females) then you must use a small constant in the equation to ensure that you don’t try to divide by 0. When dealing with numerical attributes, you must know the probability density function representing the distribution of data. Then you use values such as class mean and standard deviation to determine the conditional probability. Good g. Neural Networks Description: A set of interconnected nodes designed to imitate the functioning of the human brain. There are one or more levels of nodes and there are weights assigned to each path to these nodes. These weights are adjusted when errors between the desired and computed outcomes are propagated back through the network. These networks can be built for supervised learning or for unsupervised clustering. Good Best Suited For: Predicting numeric or continuous outcomes. They also handle data sets with large amounts of noisy data well. Neural networks are good for handling applications that require a time element to be included in the data. Good Difficulties With: All input values must be numeric therefore categorical data may require special handling. Good Issues/Limitations: Neural networks lack the ability to explain their behavior – it’s kind of like a black box in that stuff goes in one end and out the other but we don’t see the process inside the hidden layers. There are algorithms that try to create rules from neural networks (by using the weighted links) but these have not been very successful. Classifiers such as neural networks do a better job with numerical values scaled to a range between 0 and 1, therefore the user may want to normalize, scale and/or convert data. However, you have to watch when converting categorical data to numerical that some values don’t appear “closer” to one than another. (Described in question 1 under data type conversion). Neural network algorithms are not guaranteed to converge to an optimal solution – the user can deal with this by manipulating learning parameters. Neural networks can overtrain so that they work well on training data but do poorly on test data. (The user can monitor this by consistently measuring test set performance.) Good h. Genetic Algorithms Description: Genetic algorithms are based upon Darwinian principles of natural selection. They can be developed for supervised learning and unsupervised clustering. Basically, there is a fitness function. If an instance “passes” the fitness test, it remains in the set of elements. If it fails, it then becomes a candidate for modification, e.g. crossover and/or mutation. This altered version is then passed back into the fitness function. If the modified instance passes the function, it stays in the set of elements. The process repeats until a termination condition is satisfied. Good Best Suited For: Problems that are difficult to solve using conventional methods: Scheduling problems, network routing problems, and financial marketing. This method is also useful if the data contains a lot of irrelevant attributes – these will get eliminated by the fitness function. Good Difficulties With: Attribute values that are not suitable for genetic altering. Fitness functions with several calculations – this can be computationally expensive. Good Issues/Limitations: Don’t replace rejected instances with instances that have already been selected, otherwise your solution will tend to be specialized instead of generalized. It is possible that the solution will be a local optimization as opposed to a global optimization - there is no guarantee that a solution is a global optimization. Transforming data to a form suitable for the genetic algorithm may take some work. The algorithms can explain themselves only to the point where the fitness function is understandable. Good Very good overview!