Hammond - DATA MINING ASSIGNMENT

advertisement
Total Score 110 out of 130
DATA MINING ASSIGNMENT – WEEK 9
Score 50 out of 50
1. Databases need to undergo preprocessing to be useful for data mining. Dirty data
can cause confusion for the data mining procedure, resulting in unreliable output.
Data cleaning includes smoothing noisy data, filling in missing values, identifying
and removing outliers, and resolving inconsistencies.
NOISY DATA: The problem of noisy data must be addressed in order to minimize its
negative effect on the overall accuracy of rules generated or models created by data
mining applications. Noise is a component in the training set that should not be
characterized by the modeling tool. Good Noise is random error in attribute values
due to duplicate records, incorrect values, outliers or values not consistent with
common sense. Identifying outliers is important because they may represent errors in
data entry. Good Even if the outlier is a valid data point, certain modeling techniques
are sensitive to the presence of outliers and may produce unstable results. Ways to
minimize the impact of noise include a.) Identification and elimination of outliers for
a numeric variable by examining a histogram of the variable or examining scatter
plots to identify outliers on more than one variable, Good b.) Identification of
outliers using Z-score standardization which reveals values farther than 3 standard
deviations from the mean Good and c.) Application of standardization techniques to
standardize the scale of effect each variable has on the result when analyzing data in
which the variables have ranges that vary greatly from each other. Good
MISSING DATA: Missing values, if left untreated, can have serious consequences.
Missing values should be replaced because a.) some data mining techniques cannot
deal with missing values and exclude the entire instance, Good b.) the default
replacement, if inappropriate, assigned by the data mining tool may introduce
distortion, Good and c.) most default replacement methods discard the information
contained in the missing-value patterns. Good There are methods that can be used to
estimate an appropriate value to replace a missing value. Capturing the variability that
exists in a data set in the form of ratios between various values can be used to infer
appropriate missing values that least corrupt the patterns in the original data set.
Good Choices of replacement values include constants specified by the analyst, the
field mean for numerical variables or mode for categorical variables, or a value
generated at random from the variable distribution observed. Good
DATA NORMALIZATION AND SCALING: Some data mining tools require the
range of the input variables to be normalized and most techniques benefit from it.
Variables tend to have ranges that vary greatly from each other. For some data
mining algorithms, such differences in the ranges will lead to a tendency for the
variable with greater range to have too much influence on the results. Good
Therefore, data miners should normalize their numerical variables, to standardize the
scale of effect each variable has on the results. Techniques for normalization include
a.) max-min normalization where it is determined how much greater the field value is
than the minimum value min(X) and this difference is scaled by the range or X* = Xmin(X) /max(X) – min(X), Good b.) Z-score standardization where the difference
between the field value and the field mean value is computed and scaled by the
standard deviation of the field values or X* = X – mean(X)/SD(X) resulting in all of
the transformed values scaled to a range between 0 and 1, Good c.) Decimal scaling
where each field value is divided by 10 raised to a sufficiently large enough power to
scale all values to a range between 1 and -1. Good
ATTRIBUTE AND INSTANCE SELECTION: Data modeling requires at least three
sets of data, a training set, a test set and an execution set. Each set selected needs to
be representative of the main data set in terms of the distribution of characteristics
among instances. Good Feature enhancement may require a concentration of
instances exhibiting some particular feature. Such a concentration can only be made if
a subset of the data is extracted from the main data set. Good So there is a need to
decide how large a data set is required to be an accurate reflection of all patterns in
the data. It is critical that each subset is representative of the composition of
attributes found in the main data set to avoid the introduction of bias that reduces the
accuracy of the results from a data mining session. Good
DATA TYPE CONVERSION: Data mining tools vary in the type of data that they
can process. Some can only process numerical data (e.g., neural networks, linear
regression) and others only categorical data (e.g., decision tree algorithm). Good To
accommodate the requirements of specific data mining techniques data
transformation must take place to convert categorical data to numerical data and vice
versus. Good
Score 60 out of 80
2. A decision tree is a predictive model that can be viewed as a tree with each
branch representing a classification question, and the leaves representing,
partitions of the data set with their classifications. Decision-tree methods are best
suited for the solution of problems that require the classification of records or
prediction of outcomes. When the objective is to assign each record to one of a
few categories, decision trees are the best choice. Decision trees are less
appropriate for estimation where the objective is to predict the value of a
continuous variable. Decision tree have problems processing times series data
unless the data is manipulated or presented in a way to make the trends apparent.
Some decision-tree algorithms are limited in that they can only deal with binaryvalue target classes (e.g., yes/no, accept/reject). Others are able to assign records
to an arbitrary number of classes, but produce errors when the number of training
examples per class gets small which can happen when a tree has many levels
and/or branches per node. Decision trees are not easy to train because at each
node, each candidate splitting field must be sorted before its best split can be
found. Good
Association rules take the form “If antecedent, then consequent”, along with the
measure of support and confidence associated with the rule. Association rules or
techniques such as market basket analysis is the best choice when there is an
undirected data mining problem that consists of well-defined items that group
together in interesting ways. It works best when all items have about the same
frequency in the data. Computations required to generate association rules grow
exponentially with the number of items and the complexity of the rules being
considered. The most difficult problem when applying this method comes with
determining the right set of items to use in the analysis. Good
The k-means algorithm is an algorithm for finding clusters in data by following steps
1.) Identification of how many clusters k the data should be partitioned into, 2.)
random assignment of k records to be the initial cluster center locations, 3.)
Identification of the nearest cluster center for each record, 4.) For each of the k
clusters, identification of the cluster centroid, and update of the location of each
cluster center to the new value of the centroid and 5.) the repeat of steps 3 to 5 until
convergence or termination when the centroids no longer change. The k-means
method is best for use to identify patterns of grouping or clusters in a large, complex
data set with many variables and a lot of internal structure. In the k-means method,
the original choice of a value for K determines the number of clusters that will be
found. If this number does not match the natural structure of the data, the techniques
won’t obtain good results. It is difficult to interpret the resulting clusters of k-means
because when you don’t know what you are looking for, you may not recognize it
when you find it. Clusters identified may not have any practical value. Good
Linear regression is a method of prediction or estimation in which a model is created
that maps values from predictors in such a way that the lowest error occurs in making
predictions. The relationship between a predictor and a prediction can be mapped on a
two-dimensional space and the records plotted for the prediction values along the Y
axis and the predictor values along the axis. The linear regression model can be
viewed as the line that minimized the error rate between the actual prediction value
and the point on the line (i.e., the prediction). Of the many lines that could be drawn
through the data, the one that minimizes the distance between the line and the data
points is the one that is chosen for the predictive model. The regression model is
represented by an output attribute (i.e. prediction) whose value is determined by a
linear sum of weighted input attribute (i.e., predictor) values. Linear regression is
very sensitive to anomalous fluctuations in the data, such as outliers and is seriously
affected by co-linearity in the input variables. It cannot deal with missing data, and
many of the standard default methods of replacing missing values do more harm than
good to the resulting model. It is sensitive to additive interactions when fitting
straight lines. Good
Logistic regression is a nonlinear regression technique that associates a conditional
probability with a data instance. Logistic regression functions similarly to the above
described linear regression model except that the prediction values require
logarithmic transformation for the model to generate good results. Logistic
regression is applied to create predictive models when the problem involves the
prediction of a response that can only be yes or no. Good
Bayes classifier is a statistical classifier that can predict class membership
probabilities, such as the probability that a given sample belongs to a particular class.
Bayes classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. Bayesian methods are a way of
starting with one set of evidence (i.e., multiple variable data) and arriving at an
assessment of an estimate of the outcome probabilities given the evidence. In order
to discover the actual probability of an outcome given some multivariate evidence, all
of the variables are required to be independent of each other. The algorithm can deal
with all variable types but only by converting continuous variables into categorical
values. This technique easily incorporates domain knowledge as nodes with the
nodes created and their probabilities learned from data or created and the probability
values set from the domain knowledge. Bayesian networks can present insights as
well as predictions. Good
Neural Networks ???
Genetic Algorithms ???
Download