Review2

advertisement
COSC 6335
Review2 on Thursday, October 31
Dr. Eick
1. Decision Trees/Classification
a) Compute the GINI-gain1 for the following decision tree split[4] (just giving the
formula is fine!):
(5,3,2)
(4,0,2)
(1,3,0)
GINI-gain=GINIbefore-GINIafter= G(5/10,3/10,2/10) – 0.6*G(2/3,0,1/3)-0.4*G(1/4,3/4,0)
b) Assume you learn a decision tree for a dataset that solely contains numerical
attributes. What can be said about the shape of the decision boundaries that the learnt
decision tree model employs? [2]
Axis-parallel lines or line segments
c) Are decision trees capable to model the ‘either-or’ operator; for example.
IF EITHER A OR B THEN class1 ELSE class2?
Give a reason for your answer! [3]
Solution given on white board during lecture!
c) What are the characteristics of over-fitting when learning decision trees? What can be
done to deal with over-fitting? [3]
overfitting: training error low[0.5], testing error not optimal[0.5],
models is too complex—the decision tree has to many nodes[1]
What to do to deal with it?
1. increase the degree of pruning in the decision tree learning
algorithms to obtain smaller decision trees [2]
2. increase the number of training examples [1]
Other answers might exist which might deserve some credit!
d) Are decision trees suitable for classification problems involving continuous attributes
when classes have multi-modal (http://en.wikipedia.org/wiki/Multimodal)
distributions? Give reasons for your answer!
Yes, decision tree can learn disjunctive concepts and can deal with multi-modal
classes, as follows: each path in the decision tree to a leaf node identifies a patch
where the decision tree model predicts the class of the leaf node; multi-modal
models for a class C can be obtained by using multiple patches with leaf nodes that
predict C.
e) Most machine learning approaches use training sets, test sets and validation sets to
derive models. Describe the role each of the three sets plays! [4]
Training set: used to learn the model [1.5]
Test set: used to evaluate the model, particularly its accuracy [1.5]
Validation set: used to determine the “best” input parameter(s) for the algorithm
which learns the model; e.g. parameters which control the degree of pruning of a
decision tree learning algorithm. [2]
1
(GINI before the split) minus (GINI after the split)
1
2
2. APRIORI
a) What is the APRIORI property?
XYi(X)≥i(Y)
b) Assume the APRIORI algorithm identified the following 7 4-item sets that satisfy a
user given support threshold: abcd, acde, acdf, acdg, adfg, bcde, and bcdf; what initial
candidate 5-itemsets are created by the APRIORI algorithm; which of those survive
subset pruning?
abcde, abcdf, abcef, bcdef
All four 5-item sets are pruned
c) What is the final result that the APRIORI algorithm computes?
Let D be the set of items and  be the support threshold for which APRIORI is run
and T is the transaction database
{YD | support(Y,T) ≥}
d) Assume an association rule if smoke then cancer has a confidence of 86% and a
high lift of 5.4. What does this tell you about the relationship of smoking and cancer?
P(Cancer|Smoke)=P(Cancer and Smoke)/P(Smoke)=0.86 “you class mate Arko
Barman was correct that we have to divide by P(smoke), as we make a
statement about people who smoke”
P(Cancer|Smoke)/P(Cancer)=4.3
e) What are the main differences between the APRIORI algorithm and GSP (the
“apriori”-like algorithm which generalizes APRIOR to mine sequences)
1. Order of items matters in sequences, but not in sets  more patterns
2. ARIORI is based on sets/subsets and GSP is based on sequences/subsequences
3. …
3. A little more on clustering
a) What of the following cluster shapes K-means is capable to discover? i) triangles ii)
clusters inside clusters iii) the letter ‘T ‘iv) any polygon of 5 points v) the letter ’I’
yes, no, no, not if the polygon is concave, yes
b) What the weaknesses of the DBSCAN clustering algorithm ?
a. Does not work well for high dimensional datasets
b. Parameter selection is difficult
c. Not very fast; O(n*log(n)) at best; O(n**2) for most implementations
d. Problems dealing with clusters with varying densities.
e. …
3
4) Exploratory Data Analysis
a) Assume attribute A has a correlation of -0.95 with attribute B; what does this say about
the relationship of the two attributes?
Strong linear relation; if A goes up B goes down and vice versa.
b) Assume you have a dataset with 3 attributes and the entries of the covariance matrix
have positive numbers in the diagonal, but all other entries are 0. What does this say
about the relationship of the 3 attributes? Interpret the following histogram for the body
weight in a group of cancer patients!
Two peaks around body weight 63kg and 78kg [2]
Median around 70kg[0.5]
No gap or two small gabs at 112&118, not significantly skewed[2]
At most 4 points; other observations might deserve credit!
c) Assume you create an Boxplot for attribute A does not show any outlier—what does
this mean? Assume that the value of the 25% percentile is 7 and the value of the 75%
percentile is 2; that is, the box goes from 2 to 7?
Only points that are 1.5*IQR higher/lower than the upper/lower box boundary are
visualized as outliers. In the particular example, this means there are no values
lower than -5.5=2-7.5 and there are no values higher than 14.5=7+7.5.
4
Download