Cegelski - Final Exam

advertisement
Total Score 130 out of 130
Karen Cegelski
Mid Term Week 5
CSIS 5420
July 26, 2005
Score 50 out of 50
1. (50 Points) Data pre-processing and conditioning is one of the key factors
that determine whether a data mining project will be a success. For each of the
following topics, describe the affect this issue can have on our data mining
session and what techniques can we use to counter this problem.
a. Noisy data
In a large database, many of the attribute values may be inexact
or incorrect. This may be attributed to the instruments measuring
the values, or human error when entering the data. Sometimes
some of the values in the training set are altered from what they
should have been. This may result in one or more entries in the
database conflicting with rules already established. The system
may then regard these extreme values as noise, and either
ignores them or it may take the values into account possibly
changing correct patterns. The problem is that one never knows if
the extreme values are correct or not, and the challenge is finding
the best method to handle these “weird” values. Good
b. Missing data
Data values can be missing because they were not measured, not
answered, or were unknown or lost. Data mining methods vary in
the way they treat missing values. The missing values can be
ignored, records containing missing data may be omitted, missing
values may be replaced with the mode or the mean, or an
inference can be made from the existing values. The data needs
to be formatted, sampled, adapted, and sometimes transformed
for the data mining algorithm in order to deal with the missing
data. Identifying variables that explain the cause of missing data
can help to mitigate the bias. Good
c. Data normalization and scaling
Data normalization is a common data transformation method that
involves changing numeric values so that they fall within a
specified range. A classifier such as a neural network works best
when the numerical data is scaled to a range between 0 and 1.
This method is appealing with distance-based classifiers, because
by normalizing the attribute values, attributes with a wide range of
values are less likely to outweigh attributes with smaller ranges.
Good
Decimal scaling divides each numerical value by the same power
of 10. If the values range between –1000 and 1000, the range can
be changed to –1 and 1 by dividing each value by 1000. Good
d. Data type conversion
Data mining tools, including neural networks and some statistical
methods, cannot process categorical data. Changing the
categorical data to a numeric equivalent is a common data
transformation method. Also some data mining techniques are not
able to process numeric data in its original form. Most decision
tree algorithm discrete numeric values by sorting the data and
consider alternative binary splits of the data items. Good
e. Attribute and instance selection
Dealing with large volumes of data varies among classifiers.
Some data mining algorithms are unable to analyze data
containing too few attributes while others have trouble with large
number of instances. Differentiating between relevant and
irrelevant attributes is another problem with some data mining
algorithms. The number of irrelevant attributes directly affects the
number of training instances needed to build an accurate
supervised learner model. We must make decisions about which
attributes and instances to use when building the data mining
models in order to overcome these problems. The problem with
some of these algorithms is that they are complex. Generating
and testing all possible models for any dataset containing more
than a few attributes is not possible.
One technique that can be used is to eliminate attributes. With
some classifiers such as neural networks and nearest neighbor
classifiers, attribute selection needs to take place before the data
mining process can begin. Some steps that can be taken to assist
in determining which attributes to eliminate are:
1)
Input attributes highly correlated with other input attributes
are redundant. The data mining tool will build a better
model when only one attribute from a set of highly
correlated attributes is designated as an input value.
2)
3)
Any attribute containing value v, with a domain
predictability score that is greater than a chose threshold
can be considered for elimination when using categorical
data.
Numerical attribute significance can be determined by
comparing class mean and standard deviation scores when
using supervised learning. Good
Combining attributes with little predictive power can sometimes be
combined with other attributes to create new attributes with a high
degree of predictive capability. A few transformations commonly
used to create new attributes are:
1) Create a new attribute where each value represents a ratio
of the value of one attribute divided by the value of a
second attribute.
2) Create a new attribute whose values are differences
between the values of two existing attributes.
3) Create a new attribute with values computed as the percent
increase or decrease of two current attributes. Good
During the training phase of supervised learning, the data is often
randomly chosen from a pool of instances. Instances are selected
to guarantee representation from each concept class. Instance
typicality scores can chose the best set of representative
instances from each class. Classification accuracy can best be
achieved when all types of classifiers formed by the training sets
contain an over weighted selection of highly and moderately
typical training instances. Good
Score 80 out of 80
2. (80 points) We've covered several data mining techniques in this course. For
each technique identified below, describe the technique, identify which
problems it is best suited for, identify which problems it has difficulties with, and
describe any issues or limitations of the technique.
a. Decision trees
The most popular structure for supervised data mining is decision
trees. An initial tree is constructed when a common algorithm for
building a decision tree selects a subset of instances from the
training data. The accuracy of the tree is tested by the remaining
training instances. Incorrectly classified instances will be added to
the current set of training data and the process is repeated.
Minimizing the number of tree levels and trees nodes will
maximize data generalization. They are easy to understand and
are nicely mapped to a set of production rules and can be
successfully applied to real problems. Decision trees make no
prior assumptions about the data. Models with datasets containing
numerical and categorical data can utilize decision trees. Good
All Data mining algorithms have some issues and decision trees
are no exception. Some of the issues with decision tree usage are
output attributes must be categorical and multiple output attributes
are not permitted. Slight variations in the training data can result in
different attribute selections at each choice point within the tree
causing the decision tree algorithm to be unstable. Trees created
using numeric datasets can be complex as attribute splits for
numeric data are typically binary. Good
b. Association rules
Association rules assist in finding relationships in large databases.
They are unlike traditional production rules in that an attribute that
is a precondition in one rule may appear as a consequent in
another rule. Association rule generators allow for the consequent
of a rule to contain one or several attribute values. Because
association rules are more complex special techniques are
required in order to generate them efficiently. Good Using rule
confidence and support assist in discovering which associations
are likely to be interesting from a marketing perspective. Good
Caution should be exercised in the interpretation of association
rules because some discovered relationships might turn out to be
trivial. Good
c. K-Means algorithm
K-Means algorithm is a statistical unsupervised clustering
technique. Using all numeric input attributes, the user is required
to make a decision about how many clusters are to be discovered.
The algorithm starts by randomly choosing one data point to
represent each cluster. Next each data instance is placed in the
cluster to which it is most similar. The process continues
developing new cluster centers until the cluster centers do not
change. K-Means algorithm is easy to implement and understand.
It is however, not guaranteed to converge to a globally optimal
solution. Good It lacks the ability to explain what has been formed
and is not able to tell which attributes are significant in
determining the formed clusters. Even with these limitations, the
K-Means algorithm is among the most widely used clustering
techniques. Good
d. Linear regression
Linear regression is a favorite for estimation and prediction
problems. It attempts to model the variation in a dependent
variable as a linear combination of one or more independent
variables. Good It is an appropriate data mining strategy when the
relationship between the dependent and independent variables is
nearly linear. It is a poor choice when the outcome is binary.
Good The value restriction placed on the dependent variable is
not observed by the regression equation. This happens because
linear regression produces a straight-line function and the values
of the dependent variable are unbounded in both the positive and
negative direction. Good
e. Logistic regression
Logistic regression is a good choice for problems having a binary
outcome. Good It is a nonlinear regression technique that
associates a conditional probability value with east data instance.
A created regression equation limits the values of the output
attribute to values between 0 and 1. Doing this allows the output
values to represent a probability of class membership. Good
f. Bayes classifier
Bayes classifier is a simple and powerful supervised classification
technique. All input attributes are assumed to be of equal
importance and independent of one another. Good Bayes
classifier works well in practice even if these assumptions are
likely to be false. It can be applied to datasets that contain both
categorical and numeric data. Unlike many statistical classifiers,
Bayes classifier can be applied to datasets containing missing
information. Good
g. Neural networks
A Neural network is a set of interconnected nodes designed to
imitate the function of the human brain. They are quite popular in
the data mining community, as they have been successfully
applied to problems across several disciplines. They can be
constructed for both supervised and unsupervised clustering. The
input values must always be numeric. Good The first phase of the
neural network is known as the learning phase. During this phase,
the input values associated with each instance enter the network
at the input layer. For each input attribute contained in the data
there is one input layer node. Using both the input values along
with the network connection weights the neural network computes
the output for each instance. The output for each instance is then
compared with the desired network output. Training is completed
after a certain number of iterations or when the network
converges to a predetermined minimum error rate. Good During
the second phase, the network weights are fixed and the network
is then used to compute output values for the new instances.
h. Genetic algorithms
Genetic algorithms use the theory of evolution to inductive
leaning. It can be supervised or unsupervised and is generally
used for problems that cannot be solved using traditional
techniques. Good Applying a fitness function to a set of data
elements in order to determine which elements survive from one
generation to the next is the standard method. New instances are
created from those elements not surviving in order to replace
deleted elements. This technique can be used in conjunction with
other learning techniques. Good
Very good overview!
Download