Assessment 1, DM, Sept. 2013, Dickey Name _______________________ Answer with numbers whenever possible. 1. I know that the probability of item A and item B being purchased together is 0.30 and the probability of B being purchased (with or without A) is 0.50. For the following list, if it is possible to compute the requested number from this information, do so. If not possible just leave it blank. a. Confidence of the rule A=>B ______ b. Confidence of the rule B=>A _____ c. Expected confidence of the rule A=>B _______ d. Expected confidence of the rule B=>A ________ e. Support of the rule B=>A 2. In a data set I have a feature DI = debt to income ratio and I have a target which is defaulting on a loan (Y=1) or not (Y=0). My observations in a 2 by 2 table are Low DI High DI Y=0 (no default) 800 200 Y=1 (default) 100 400 a. Compute the confidence _____ for the rule “High debt to income ratio => default” b. Besides confidence, expected confidence, and support there is one other item that is typically computed in association analysis. Name it _____________________ and compute its value_____________for the rule “High debt to income ratio => default” c. Compute the expected numbers used in computing the Chi-square test (for the hypothesis of no association between Y=defaulting and DI=debt to income ratio). Low DI High DI Y=0 ____ ____ Y=1 ____ ____ d. Suppose the p-value associated with the test statistic in part (c) is 0.00001. Compute the associated logworth________ e. Now suppose the table above gave the smallest p-value (still 0.00001) among all possible divisions of DI (debt to income ratio) into high and low values where, among our 1500 subjects there were 101 distinct DI values. We might split our data based on the table’s high and low DI as the first split in a decision tree. In looking for a better split, suppose we try splitting on gender (M or F) and get a p-value 0.00023. Compute the Bonferroni adjusted p-values for gender _______ and DI _________. Using Bonferroni, which of the two variables has the higher logworth? __________ ************************************* answers ************************************** We can compute Pr{A|B} = Pr{A and B}/Pr{B}=0.3/0.5=0.6, the confidence for B=>A. We can compute the expected confidence Pr{B}=0.5 for the rule A=>B. We can compute the support Pr{A and B} = .3 for either rule. 2 a. 400/600 = 0.667 b. Lift: 400/1500 divided by (600/1500)(500/1500) so 3(400/600)=2.0 c. 600 300 d. log(10-5 )=-5 so logworth=5 400 200 e. 100(.00001) = 0.00100 and 1(.00023) = 0.00023<0.00100 so use gender. It has the larger log worth because it has the smaller p-value.