CSIS 5420 Week 2 Homework

advertisement
CSIS 5420 Week 2 Homework – Answers
1. (Computational Question #2, page 102) Answer the following:
a. Write the production rules for the decision tree shown in Figure 3.4 (page 74).
b. Can you simplify any of your rules without compromising rule accuracy?
a. IF Age > 43
THEN life insurance promotion = no
IF age <=43 & sex = female
THEN life insurance promotion = yes
IF age <=43 & sex = male & credit card insurance = no
THEN life insurance promotion = no
IF age <=43 & sex = male & credit card insurance = yes
THEN life insurance promotion = yes
b. Attribute age can be removed from the third and fourth rules.
2. (Computational Question #4, page 102) Using the attribute age, draw an expanded
version of the decision tree in Figure 3.4 (page 74) that correctly classifies all training
data instances.
One possibility is to split the Credit Card Insurance No branch on age >29 and age<=29. The two
instances following age >29 will have life insurance promotion = no. The two instances following
age <=29 once again split on attribute age. This time, the split is age <=27 and age > 27.
3. (Computational Question #8, page 103) Use the information in Table 3.5 (page 82) to
list three two-item set rules. Use the data in Table 3.3 (page 80) to compute confidence
and support values for each of your rules.
There are 22 possibilities, two for each item set entry. Here are the computations for the fourth
item set.
IF magazine promotion = yes
THEN sex = male { confidence = 4/7, support = 4/10}
IF sex = male
THEN magazine promotoin = yes { confidence = 4/6, support = 4/10}
4. (Computational Question #10, page 103) Perform the third iteration of the K-Means
algorithm for the example given in the section titled An Example Using K-Means (page
84). What are the new cluster centers?
Determine the third iteration of the K-Means algorithm. All values are approximate:
Distance (C1 – 1) = 1.053
Distance (C1 – 2) = 2.027
Distance (C1 – 3) = 1.204
Distance (C1 – 4) = 1.204
Distance (C1 – 5) =
Distance (C1 – 6) = 5.071
Distance (C2 – 1) = 3.417
Distance (C2 – 2) = 2.3863
Distance (C2 – 3) = 2.832
Distance (C2 – 4) = 1.421
Distance (C2 – 5) = 1.536
Distance (C2 – 6) = 2.606
After the third iteration, cluster 1 will contain instances 1,2,3, and 4. Cluster 2 will hold
instances 5 and 6. The new cluster center for cluster 1 is (1.5, 5). The new cluster center for
cluster 2 is (4.0, 4.25).
5. (Data Mining Question #1, page 141) This exercise demonstrates how erroneous data
is displayed in a spreadsheet file.
a. Copy the CreditCardPromotion.xls dataset into a new spreadsheet.
b. Modify the instance on line 17 to contain one or more illegal characters for age.
c. Add one or more blank lines to the spreadsheet.
d. Remove values from one or more spreadsheet cells.
e. Using life insurance promotion as the output attribute, initiate a data mining
session with ESX. When asked if you wish to continue mining the good instances,
answer No. Open the Word document located in the upper-left corner of your
spreadsheet to examine the error messages.
The error messages will vary.
6. (Data Mining Question #2, page 141) Suppose you suspect marked differences in
promotional purchasing trends between female and male Acme credit card customers.
You wish to confirm or refute your suspicion. Perform a supervised data mining
session using the CreditCardPromotion.xls dataset with sex as the output attribute.
Designate all other attributes as input attributes, and use all 15 instances for training.
Because there is no test data, the RES TST and RES MTX sheets will not be created.
Write a summary confirming or refuting your hypothesis. Base your analysis on:
a. Class resemblance scores.
b. Class predictability and predictiveness scores.
c. Rules created for each class. You may wish to use the re-rule feature.
The results given here assume all parameters are set at their default values.
a. The class resemblance score for sex = male is 0.429 which is lower than the domain
resemblance value of 0.460. This is a first indication that the input attributes do not
distinguish the two classes.
b.
Life insurance promotion = no is predictive and predictable of class membership for sex
= male. Likewise, Life insurance promotion = yes is predictive and predictable of class
membership for sex = female. Income range = 50-60k is predictive of class membership
for sex = female.
c.
Using the default settings, one rule is generated for each class. The rule for the female
class distinguishes the female class using attribute age. The rule for the male class
distinguishes males from females by their reluctance to take advanatage of the life
insurance promotion.
To summarize, there is some evidence that life insurance promotional purchasing trends
differ between males and females.
7. (Data Mining Question #4, page 141) For this exercise you will use ESX to perform a
data mining session with the cardiology patient data described in Chapter 2, page 37,
files: CardiologyNumericals.xls and CardiologyCategorical.xls). Load the
CardiologyCategorical.xls file into a new MS Excel spreadsheet. This is the mixed form
of the dataset containing both categorical and numeric data. Recall that the data
contains 303 instances representing patients who have a heart condition (sick) as well
as those who do not.
Save the spreadsheet to a new file and perform a supervised learning mining session
using class as the output attribute. Use 1.0 for the real-tolerance setting and select 203
instances as training data. the final 100 instances represent the test dataset. Generate
rules using the default settings for the rule generator. Answer the following based on
your results:
a. Provide the domain resemblance score as well as the resemblance score for each
class (sick and healthy).
b. What percent of the training data is female?
c. What is the most commonly occurring domain value for the attribute slope?
d. What is the average age of those individuals in the healthy class?
e. What is the most common healthy class value for the attribute thal?
f.
Specify blood pressure values for the two most typical sick class instances.
g. What percent of the sick class is female?
h. What is the predictiveness score for the sick class attribute value angina = true? In
your own words, explain what this value means.
i.
List one sick class attribute value that is high sufficient for class membership.
j.
What percent of the test set instances were correctly classified?
k. Give the 95% confidence limits for the test set. State what the confidence limit
values mean.
l.
How many test set instances were classified as being sick when in reality they
were from the healthy class?
m. List a rule with multiple conditions for the sick class that covers at least 50% on
the instances and is accurate in at least 85 % of all cases.
a. Domain resemblance = 0.520
Healthy class resemblance = 0.581
Sick class resemblance =0.553
b. 31%
c.
flat
d. 51.95
e. normal
f.
125 and 130
g. 17%
h. 0.75. Seventy five percent of all individuals having the value true for attribute angina are
in the sick class.
i.
Thal = rev, or #colored vessels = 2, or # colored vessels =3
j.
82%
k.
The lower bound correctness is 74.3%. The upper bound value is 89.7%. We can be 95%
confident that the test set model correctness score is somewhere between 74.3% and
89.7%.
l.
five
m. There are four such rules
angina = TRUE
and chest pain type = Asymptomatic
:rule accuracy 87.50%
:rule coverage 52.69%
chest pain type = Asymptomatic
and slope = Flat
:rule accuracy 83.61%
:rule coverage 54.84%
thal = Rev
and chest pain type = Asymptomatic
:rule accuracy 91.53%
:rule coverage 58.06%
thal = Rev
and slope = Flat
:rule accuracy 89.09%
:rule coverage 52.69%
n. Maximum heart rate
8. (Data Mining Question #6, page 142) In this exercise you will use instance typicality to
determine class prototypes. You will then employ the prototype instances to build and
test a supervised learner model.
a. Open the CardiologyCategorical.xls spreadsheet file.
b. Save the file to a new spreadsheet.
c. Perform a supervised mining session on the class attribute. Use all instances for
training.
d. When learning is complete, open the RES TYP sheet and copy the most typical
sick class instance to a new spreadsheet. Save the spreadsheet and return to the
RES TYP sheet.
e. Now, copy the most typical healthy class instance the the spreadsheet created in
the previous step and currently holding the single most typical sick class instance.
Save the updated spreadsheet file.
f.
Delete columns A, B, and Q of the new spreadsheet file that now contains the most
typical healthy class instance and the most typical sick class instance. Copy the
two instances contained in the spreadsheet.
g. Return to the original sheet 1 data sheet and insert two blank rows after row three
and paste the two instances copied in step f into sheet 1.
h. Your sheet 1 spreadsheet now contains 305 instances. The first instance in sheet 1
(row 4) is the most typical sick class instance. The second instance is the most
typical healthy class instance.
i.
Perform a data mining session using the first two instances as training data. The
remaining 303 instances will make up the test set.
j.
Analyze the test set results by examining the confusion matrix. What can you
conclude?
k. Repeat the above steps, but this time extract the least typical instance from each
class. How do your results compare with those of the first experiment?
The test set correctness using the two most typical instances for training is 81%. The accuracy
using the two least typical instances for training is 34%.
9. (Data Mining Question #8, page 143) Perform a supervised data mining session with
the Titanic.xls dataset. Use 1500 instances for training and the remaining instances as
test data.
a. Are any of the input attributes highly predictive of class membership for either
class?
b. What is the test set accuracy of the model?
c. Use the confidence limits to state the 95% test set accuracy range.
d. Why is the classification accuracy so much lower for this experiment than for the
quick mine experiment given in Section 4.8 (pages 135 - 139)?
Answers are based on default parameter settings:
a. The predictiveness score for sex = female is 0.81 for the survivors. The predictiveness
score for sex = male is 0.77 for the non-survivors. For the non-survivors, class = third
has a predictiveness score of 0.72 and class = crew has a predictiveness score of 0.76.
b. 77%
c.
The lower-bound accuracy is 73.8%. The upper-bound accuracy is 80.2%.
d. The test set for the example in section 4.8 contains 190 non-survivors and 10 survivors.
That is, 95% of the test data holds non-survivor instances. The test data does not reflect
the ratio of survivors to non-survivors seen in the entire dataset. The test set for this
problem contains 77% non-survivors and 23% survivors. This non-survivor to survivor
ratio more closely matches the ratio seen in the entire domain of data instances.
Download