Hammond - DATA MINING ASSIGNMENT

advertisement
Total Score 114 out of 120
DATA MINING ASSIGNMENT – WEEK 5
Score 20 out of 20
1. For any test data to be correctly classified, the training dataset of the supervised
learner model should be randomly selected via stratified sampling so as to result
in a sample with a distribution of classes that is representative of the distribution
in the entire dataset. An examination of the training instance typicality scores
identifies the most typical instances whose attribute composition and values can
be compared to that of the overall population to determine if the training instances
are representative of it. Good
Training set data must be proportionate to the test set data: models built with
training data that does not represent the set of all possible domain instances can
lead to misclassified test data (215).
Redundant attributes must be addressed: redundant attributes can certainly wreak
havoc upon output results (228).
Missing values must be addressed: in the data preprocessing phase decisions
need to be made as to how to handle missing or null attribute values on an entry
by entry level (155).
Input must be meaningful and error minimized: we have a variety of error checking
and process simplifying mechanisms at our disposal (129).
Score 20 out of 20
2. It is important to tune the number of clusters that the unsupervised clustering
model partitions the training data into so that the number matches as closely as
possible the natural structure of the data from which the test data will be drawn.
As a result, the accuracy of the results generated by the model will be optimized
as the clustering algorithm attempts to identify clusters such that the betweencluster variation is large compared to the within-cluster variation. Good. The
main goal of unsupervised clustering is to create K clusters where
the entries within each cluster are very similar, but the clusters are
different from one another.
Score 16 out of 20
3. For optimal performance, the K-Means algorithm requires the data to be
normalized to standardize the scale of effect each attribute has on the results so
that no particular attribute or subset of attributes dominates the analysis. Z-score
standardization can be used to detect outliers after transforming all attribute
values to fall within a range of values from 0 to 1.
A. The number of clusters must be estimated in Step 1 of the algorithm
and this is not always possible.
B. The initial cluster centers are picked and random and most likely will be
suboptimal.
C. The ability to try and judge alternatives is severely hampered as the
number of points and pairs increase.
D. Our choice of initial cluster centers will be reflected in our final cluster
centers.
E. Even outside of the problems of center choice and combinatorial
explosion we see that optimal solutions require alternative computations,
which may be impractical.
F. The K-Means algorithm “only works with real valued data”.
G. The K-means algorithm tends to work “best when the clusters that exist
in the data are of approximately equal size”.
H. Both attribute significance and clusters themselves cannot be fully
explained using the K-Means algorithm.
Score 20 out of 20
4. Genetic algorithms are plagued with some issues that affect performance. Genetic
algorithms are not limited in the types of data that they need. As long as a given
problem can be represented as a string of bits of a fixed length, it can be handled.
Although many problems can be encoded, the specific characteristics of the
encoding affects the goodness of the results generated. By properly coding
problems, solutions to many problems can be reached. But genetic algorithms
may not generate the optimum solution because it prematurely settles on a
reasonable solution. Good To get around this problem GA can be combined with
other techniques followed by the changing of each bit in the solution to determine
if a better solution exists. GA works by evolving successive generations of
genomes that approximate more and more closely the fitness criteria established
in the fitness function with the goal of maximizing the fitness of the genomes in
the population. The problem of overly fast convergence to a less than optimum
solution suggests that the search is being limited. Good To get around this, the
various probabilities for crossover and mutation can be set high initially, then
slowly decreased from one generation to the next. Good
Score 20 out of 20
5. To be useful for data mining, data may need to be cleaned and transformed.
Certain data mining techniques do not function well in the presence of outliers
and may deliver unstable results. Identification of outliers can be accomplished
by examining histograms of attributes or scatter plots. Good In some data mining
algorithms, large differences in the ranges of attributes will lead to the tendency
for the attribute with the greater range to have excessive influence on the results.
In such cases numerical attributes should be normalized, to standardize the scale
of effect each attribute has on the results. Good There are two techniques to
restrict the values of selected attributes to a range from 0 to 1. Min-Max
normalization accomplishes this by seeing how much greater the attribute value is
than the minimum value and scaling this difference by the range. Z-score
standardization takes the difference between the attribute value and the attribute
mean value and scaling the difference by the standard deviation of the attribute
values. Z-score standardization can also be used to identify outliers that are
farther than 3 standard deviations from the mean and have a Z-score
standardization that is either greater then 3 or less than -3. Good Missing data
can be handled by replacing the missing value with a value substituted according
to various criteria: a.) the attribute mean or mode, b.) a value generated at random
from the attribute distribution observed or 3.) some constant. Data
misclassifications can be identified by examining the frequency distribution of
categorical attributes for inconsistencies. Good
Score 18 out of 20
6. Neural networks can be used for identifying clusters of records that are similar to
each other but unfortunately, 1.) don’t explain how they are similar Good, 2.)
work best when all of the input and output values are between 0 and 1, and 3.)
may converge on an inferior solution. Good Application of neural networks
would be useful in an effort to identify bank customers who are good candidates
for home equity loans. Customer data includes attributes such as appraised value
of a home, amount of credit available, amount of credit granted, age, marital
status, and household income. Since the inputs to a neural network must range in
value between 0 and 1, transformation of the input data is required. Continuous
values, like home value or income must be transformed by subtracting the lower
bound of the range from the value and dividing the results by the size of the range
(the difference between the highest and the lowest value for that attribute in the
data set). For categorical attributes, like marital status, arbitrary fractions between
0 and 1 are assigned to each of the categories. Because neural networks produce
continuous values, the output from a network can be difficult to interpret for
categorical results. The best way to calibrate the output is to run the network over
a test set, separate from the training set, and use the results from the test set to
calibrate the output of the network to categories. In some cases the network can
have a separate output for each category. Neural networks do not produce easily
understood rules that explain how they arrive at a given result. Even though
neural network cannot produce explicit rules, sensitivity analysis can be used to
explain which inputs are more important than others. This analysis can be
performed inside the network, by using the errors generated from
backpropagation. Finally, neural networks usually converge on an inferior
solution for any given training set. Use of the test set can be made to determine
when a model provides good enough performance to be used on unknown data.
Also, ANNs can be over trained to the point of working well on the
training data but poorly on test data.
Download