(4) 2014 test 1 - NCSU Statistics

advertisement
Assessment 1, DM, Sept. 2013, Dickey
Name _______________________
Answer with numbers whenever possible.
1.
I know that the probability of item A and item B being purchased together is 0.30 and the
probability of B being purchased (with or without A) is 0.50. For the following list, if it is possible
to compute the requested number from this information, do so. If not possible just leave it
blank.
a. Confidence of the rule A=>B ______
b. Confidence of the rule B=>A _____
c. Expected confidence of the rule A=>B _______
d. Expected confidence of the rule B=>A ________
e. Support of the rule B=>A
2. In a data set I have a feature DI = debt to income ratio and I have a target which is defaulting on
a loan (Y=1) or not (Y=0). My observations in a 2 by 2 table are
Low DI
High DI
Y=0
(no default)
800
200
Y=1
(default)
100
400
a. Compute the confidence _____ for the rule “High debt to income ratio => default”
b. Besides confidence, expected confidence, and support there is one other item that
is typically computed in association analysis.
Name it _____________________ and compute its value_____________for the rule
“High debt to income ratio => default”
c. Compute the expected numbers used in computing the Chi-square test (for the
hypothesis of no association between Y=defaulting and DI=debt to income ratio).
Low DI
High DI
Y=0
____
____
Y=1
____
____
d. Suppose the p-value associated with the test statistic in part (c) is 0.00001. Compute
the associated logworth________
e. Now suppose the table above gave the smallest p-value (still 0.00001) among all
possible divisions of DI (debt to income ratio) into high and low values where,
among our 1500 subjects there were 101 distinct DI values. We might split our data
based on the table’s high and low DI as the first split in a decision tree. In looking for
a better split, suppose we try splitting on gender (M or F) and get a p-value 0.00023.
Compute the Bonferroni adjusted p-values for gender _______ and DI _________.
Using Bonferroni, which of the two variables has the higher logworth? __________
************************************* answers **************************************
We can compute Pr{A|B} = Pr{A and B}/Pr{B}=0.3/0.5=0.6, the confidence for B=>A. We can compute
the expected confidence Pr{B}=0.5 for the rule A=>B. We can compute the support Pr{A and B} = .3 for
either rule.
2 a. 400/600 = 0.667 b. Lift: 400/1500 divided by (600/1500)(500/1500) so 3(400/600)=2.0
c. 600 300 d. log(10-5 )=-5 so logworth=5
400 200 e. 100(.00001) = 0.00100 and 1(.00023) = 0.00023<0.00100 so use gender. It has the larger
log worth because it has the smaller p-value.
Download