Noise: random errors in attribute values

advertisement
Total Score 130 out of 130
Nancy McDonald
CSIS 5420
Final Exam: Week 9
Score 50 out of 50
1. Describe the affect the following have on data mining and discuss what can be
done to counter the problem.
a. Noise: random errors in attribute values
Concerns: duplicate records, invalid attribute value
Duplicate records can cause skewed results in the data mining operation because
one value may appear more than once, giving it an unfair weight in determining the
results of a mining session. Another problem with duplicates is if the mining session
provides a listing of results for a user to process and the processing costs money, the user
incurs the unwarranted expense of duplicate entries. Automated tools should filter out
duplicate records when the data is moved from the operational environment to the
warehouse. Good
Invalid attribute values may cause the data mining process to ignore the attribute
value for an entry and thus it is not categorized correctly. This may lead to data that does
not accurately represent the population to be mined. Invalid attributes should also be
addressed in the preprocessing stage before the data is stored in a warehouse. This can be
done by simple scrubbing routines that are coded to look for invalid value type or values
out of range. Data mining tools can handle errors in categorical data by checking
frequency values or predictability scores for attribute values. Those values with a score
near zero may be error candidates. Class mean and standard deviation score can also flag
error candidates. Good
b. Missing Data
Concerns: The problem with missing data is that crucial information for a data entry is
lost and therefore is unaccounted for in the final analysis. Again, the problem lies in the
fact that the data may not accurately reflect the population if missing data is not counted
in the final results. Missing data may also reflect a situation, for example a missing
salary may mean the person is unemployed, but the data mining tool isn’t set up to handle
this case and thus the person’s entry is misrepresented. Good
Missing data can be handled during preprocessing by
1. Discarding records with missing values – good if this is a small percentage of
the total instances.
2. Replace missing values with the class mean for real-valued data – reasonable
for numerical data.
3. Replace missing attribute value with values found within other highly similar
instances – can be used for categorical or real-valued data. Good
Missing data can be handled during data mining by
1. Ignore missing values (neural networks, Bayes classifier)
2. Treat missing values as equal comparisons (not good with very noisy data –
dissimilar instances may appear alike)
3. Treat missing values as unequal comparisons (ESX – a pessimistic approach).
Good
c. Data Normalization and Scaling
Normalization: changing numeric values so that they fall within a specified range.
[Decimal] Scaling: dividing all numbers by the same power of 10.
Concerns:Why do this?
Classifiers such as neural networks do a better job with numerical values scaled to a
range between 0 and 1. Also, distance-based classifiers do better with normalization so
that attributes with a wide range of values are less likely to outweigh attributes with
smaller ranges. Good
Techniques:
Decimal scaling: divide each numerical value by the same power of 10. Good
Min-Max normalization: (when maximum and minimum values are known). Puts
attribute values between 0 and 1 with 0 equal to the minimum and 1 equal to the
maximum. Good
Normalization using Z scores: convert value to a standard score using the attribute
mean and the attribute standard deviation – good when max. and min. are unknown.
Good
Logarithmic Normalization: Replaces the set of attribute values with their
logarithmic value which scales down the range of values without loss of information (i.e.
without loss of precision). Good
d. Data Type Conversion
Why?
Convert categorical data to numeric because some data mining tools (e.g.,neural
networks) cannot process categorical data. Also, some data mining tools cannot process
numeric data in its original form (decision tree algorithms). Good
How?
Can convert numeric data to a categorical range. For example, convert discrete ages
ranging from 1 to 99 to categories: under10, under20, under 30, etc. so that the user can
view results grouped by age. Good
Convert categorical data to a number: e.g. if hair color = blonde then attribute value = 1,
if hair color = brunette then attribute value = 2, if hair color = red then attribute value = 3.
A problem with this conversion is that brunette may appear closer to blonde than red is,
only because it is assigned a number (2) closer to blonde’s number(1). Good
e. Attribute and Instance Selection
Concerns:
Some data mining algorithms have trouble with a large number of instances. Other
algorithms have problems analyzing data containing more than a few attributes.
Many algorithms are unable to differentiate between relevant and irrelevant attributes.
This poses a problem because the number of training instances needed to build an
accurate supervised learner model is directly affected by the number of irrelevant
attributes in the data. For example, neural networks and nearest neighbor algorithms give
equal weight to all attributes during model building. Good
Techniques:
Instance Selection: instead of randomly selecting data used for the training phase of
supervised learning, instance-based classifiers save a subset of representative instances
from each class. A new instance is classified by comparing its attributes to the values of
saved instances. The accuracy of this method depends upon how well the chosen,
“saved” instances represent each class. Instance typicality scores are used to choose the
best set of representative instances. One can use this technique with unsupervised
clustering also by computing a typicality score for each instance relative to all domain
instances. Clusters are improved (well-defined) by eliminating the most atypical domain
instances. Once well-defined clusters have been formed, these atypical instances can be
presented and the model will either form new clusters or place them in existing clusters.
Good
Attribute Selection:
Some classifiers have attribute selection techniques included as part of the model process.
(thus less likely to suffer from effects of attributes with little predictive value). For other
processes, one can do the following:
Eliminate attributes that are not predictive of class membership by eliminating all
but one attribute from a set of highly correlated attributes (i.e., remove redundant ones)
Good
If a value for a categorical attribute exceeds a domain predictability threshold, this
means that most domain instances will have this value and therefore the attribute doesn’t
help differentiate between instances. So eliminate this attribute. Good
Eliminate attributes that have low attribute significance scores. Good
Create new attributes that are a meaningful combination of existing attributes
(more meaningful than the original attributes were by themselves, e.g., ratios, differences,
percent increase/decrease.) Good
Score 80 out of 80
2. Describe the following data mining techniques, identify problems for which it is
best suited, identify problems with which it has difficulties and describe issues or
limitations.
a. Decision trees
Description: A supervised learning technique in which a decision tree is built, then finelytuned, from training data. Each branch (or link) represents a unique value for a chosen
attribute. Sub-trees are added when instances don’t completely follow a path to the end
of a tree node. Good
Best Suited For: Supervised learning in which you have a set of training data that
represents the general population, and where the user knows what attributes differentiate
instances between classes. Decision trees are good for creating a set of production rules
for further processing. Good
Difficulties With: Trees created from numeric data sets may be too complex if there are a
lot of different numbers, because the splits in the tree are only binary. Good
Issues/Limitations: Output values must be categorical.
Multiple output attributes are not allowed.
Slight variations in training data can result in different selections at a branch (choice
point) in the tree. This attribute choice can affect all descendent subtrees. Good
b. Association Rules
Description: “Affinity Analysis” which is the process of determining which things go
together and then generating a set of rules to define these associations. Each rule has an
associated confidence value to aid the user in analyzing the rules. Good
Best Suited For: Market basket analysis - determining which attributes go together. The
association rules are used to determine marketing strategy. Also, the user is not restricted
by having to choose a single dependent variable. Good
Difficulties With: Discrete-valued attributes. Good
Issues/Limitations:
Discovered relationships may turn out to be trivial.
If there is a high volume of data, the user may need to set a coverage criterion to weed
out irrelevant associations. This criterion must then be adjusted up/down and the mining
process repeated to get the desired amount/quality of association rules. Good
c. K-Means Algorithm
Description: A statistical [unsupervised] clustering technique in which the user sets the
number of clusters; the algorithm selects a specified number of instances at random to be
the initial cluster centers; then the remaining instances are assigned to the closest cluster
using the Euclidean distance formula. The algorithm then calculates a new mean for each
cluster and repeats the process of assigning instances to clusters until the new mean
equals the previous mean or until a threshold has been met. Good
Best Suited For: Classification tasks.
Unsupervised clustering of numerical data – looking to see how instances group together.
Good
Difficulties With: If the data creates clusters of unequal size, K-means will probably not
find the optimal solution. Good
Issues/Limitations: Categorical data is either ignored or must be converted to numerical
form. The user must be careful when converting so as not to inadvertently make one
categorical value appear “closer” to another value at the expense of a third value. (see
data type conversion in problem 1).
If a poor choice is made for the number of clusters to be formed, the data may not form
significantly distinct clusters. The user may have to run the algorithm several times to
test different values for the number of clusters.
Using several irrelevant attributes may cause less than optimal results.
K-means doesn’t explain the nature of the formed clusters and the user must interpret
what has been found. (Supervised data mining tools can help here…) Good
d. Linear Regression
Description: Supervised learning tool that models the dependent (output) variable as a
linear combination of >=1 independent (input) variables. The output of the model can be
used to create an equation of the form: dependent variable = a(1)x(1) + a(2)x(2) +
…+a(n)x(n) + c where x(i) is an independent attribute, a(i) is its coefficient, and c is a
constant figured out by the algorithm. Good
Best Suited For: Supervised learning of numeric data if the relationship between the
dependent and independent variables is pretty much linear. Good
Difficulties With: Data that lacks linear relationships between the dependent and
independent variables. Good
Issues/Limitations:
Categorical data must be converted to numerical.
Appropriate only if data can be modeled with a straight-line function.
Also, the values of the output variable are unbounded in both the positive and negative
direction thus perhaps causing a problem if the observed outcome is restricted to 2 values
(0 or 1). Good
e. Logistic Regression
Description: Supervised learning technique that is non-linear. It associates a probability
score for each data instance. The model transforms the above-mentioned linear
regression output values which are unbounded to output values between 0 and 1. This is
done using the base of natural logarithms, e. Good
Best Suited For: Producing the probability of the occurrence or non-occurrence of a
measured event. Good
Difficulties With: Data that lacks linear relationships between the dependent and
independent variables. Good
Issues/Limitations:
Categorical data must be converted to numerical.
Appropriate only if data can be modeled with a straight-line function. Good
f. Bayes Classifier
Description: A supervised learning technique that assumes all input values are of equal
importance and independent of one another. It produces the conditional probability of a
hypothesis being true given the evidence (determined by the input attributes). Good
Best Suited For: Testing hypothesis about an output given the input (evidence). Good
Difficulties With: numerical attribute values – the user must know how the data is
distributed ahead of time. Good
Issues/Limitations:
When the count for an attribute may equal 0, (e.g. there are no females) then you must
use a small constant in the equation to ensure that you don’t try to divide by 0.
When dealing with numerical attributes, you must know the probability density function
representing the distribution of data. Then you use values such as class mean and
standard deviation to determine the conditional probability. Good
g. Neural Networks
Description: A set of interconnected nodes designed to imitate the functioning of the
human brain. There are one or more levels of nodes and there are weights assigned to
each path to these nodes. These weights are adjusted when errors between the desired and
computed outcomes are propagated back through the network. These networks can be
built for supervised learning or for unsupervised clustering. Good
Best Suited For: Predicting numeric or continuous outcomes. They also handle data sets
with large amounts of noisy data well.
Neural networks are good for handling applications that require a time element to be
included in the data. Good
Difficulties With: All input values must be numeric therefore categorical data may
require special handling. Good
Issues/Limitations:
Neural networks lack the ability to explain their behavior – it’s kind of like a black box in
that stuff goes in one end and out the other but we don’t see the process inside the hidden
layers. There are algorithms that try to create rules from neural networks (by using the
weighted links) but these have not been very successful.
Classifiers such as neural networks do a better job with numerical values scaled to a
range between 0 and 1, therefore the user may want to normalize, scale and/or convert
data. However, you have to watch when converting categorical data to numerical that
some values don’t appear “closer” to one than another. (Described in question 1 under
data type conversion).
Neural network algorithms are not guaranteed to converge to an optimal solution – the
user can deal with this by manipulating learning parameters.
Neural networks can overtrain so that they work well on training data but do poorly on
test data. (The user can monitor this by consistently measuring test set performance.)
Good
h. Genetic Algorithms
Description: Genetic algorithms are based upon Darwinian principles of natural selection.
They can be developed for supervised learning and unsupervised clustering.
Basically, there is a fitness function. If an instance “passes” the fitness test, it remains in
the set of elements. If it fails, it then becomes a candidate for modification, e.g.
crossover and/or mutation. This altered version is then passed back into the fitness
function. If the modified instance passes the function, it stays in the set of elements. The
process repeats until a termination condition is satisfied. Good
Best Suited For: Problems that are difficult to solve using conventional methods:
Scheduling problems, network routing problems, and financial marketing.
This method is also useful if the data contains a lot of irrelevant attributes – these will get
eliminated by the fitness function. Good
Difficulties With: Attribute values that are not suitable for genetic altering.
Fitness functions with several calculations – this can be computationally expensive.
Good
Issues/Limitations:
Don’t replace rejected instances with instances that have already been selected, otherwise
your solution will tend to be specialized instead of generalized.
It is possible that the solution will be a local optimization as opposed to a global
optimization - there is no guarantee that a solution is a global optimization.
Transforming data to a form suitable for the genetic algorithm may take some work.
The algorithms can explain themselves only to the point where the fitness function is
understandable. Good
Very good overview!
Download