Lab 9 – Creating Association Rules (due M 4/30) In this lab we will use the apriori algorithm to create association rules for the Adult data set available on the UCI Machine Learning Repository. This data set involves Census data and contains a class attribute of whether or not income exceeds $50,000. PART 1 In this part we turn numerical attributes into nominal and create transaction data out of a data matrix. 1. Download the Adult data set from the UCI repository and create a csv file. http://archive.ics.uci.edu/ml/datasets/Adult The data is adult.data and the attribute names are in adult.names. 2. Open R and read in the data set. > adult<-read.csv("Adult.csv",header=T) > summary(adult) How attributes are there? How many records? Which attributes are integers? 3. Remove some attributes and convert integer attributes to factors. a. Remove fnlwgt and education.num. > adult[["fnlwgt"]] <- NULL > adult[["education.num"]] <- NULL b. Group age. > adult[[ "age"]] <- ordered(cut(adult[[ "age"]], c(15,25,45,65,100)), labels = c("Young", "Middle", "Older", "Senior")) c. Group hours.per.week. > adult[[ "hours.per.week"]] <- ordered(cut(adult[[ "hours.per.week"]], c(0,25,40,60,168)), labels = c("Part-time", "Full-time", "Over-time", "VeryHigh")) d. Group capital.gain and capital.loss. > adult[[ "capital.gain"]] <- ordered(cut(adult[[ "capital.gain"]], c(-Inf,0, median(adult[["capital.gain"]][adult[["capital.gain"]]>0]), Inf)), labels = c("None", "Low", "High")) > adult[[ "capital.loss"]] <- ordered(cut(adult[[ "capital.loss"]], c(-Inf,0, median(adult[["capital.loss"]][adult[["capital.loss"]]>0]), Inf)), labels = c("None", "Low", "High")) We will need the association rules package arules. Documentation is available on e-reserve. Load the package in R. > library(arules) 4. Create transaction data. > adult2 <- as(adult, "transactions") > adult2 > inspect(adult2[1:2]) How many transactions are there? How many items? PART 2 For this part we will use the apriori algorithm to create association rules from the census data. 5. Create association rules. Use minsup=0.5 and minconf=0.85. > rules <- apriori(adult2, parameter = list(supp = 0.5, conf = 0.85, target = "rules", minlen=2)) > summary(rules) > inspect(rules) How many rules met the minimum threshold criteria? What intrestingness measures are provided by default? Which rule has the highest of each? > inspect(head(sort(rules, by = "confidence"))) 6. Add additional interestingness measures. Add IS and phi-coefficient. > quality(rules) <- cbind(quality(rules),IS = interestMeasure(rules, method = "cosine",adult2),phi=interestMeasure(rules,method="phi",adult2)) Which rule has the highest of each of the new measures? 7. Find the rules with lift above 1 and rhs= “native.country= United-States”. What are the top 3 rules? > rules.sub <- subset(rules, subset = rhs %pin% "native.country= United-States" & lift > 1) > inspect(sort(rules.sub)[1:3]) PART 3 In this part we will try to predict the class using the strongest rules. 8. Locate all rules that have > rules.sub <- subset(rules, subset = rhs %pin% "Class") > rules.sub > inspect(sort(rules.sub, by = "support")[1:10]) > inspect(sort(rules.sub, by = "confidence")[1:10]) How many have Class = >50K and how many Class = <=50K? Comment on the support of the most confident. How confident are you about the ones with the highest support? How useful are the rules for classifying income? 9. Are there any attributes you would exclude from this analysis? Consider looking at the distribution of certain attributes. > summary(adult) Are there any heavily weighted towards one value? Which ones? Did you notice these attributes in the rules? How do they impact support and confidence? Is the class attribute weighted heavily towards one value? How might you change this attribute for our analysis (if you had the original income information)? 10. Repeat the association analysis without these attributes. Will this create any new rules? Why or why not? How do your results change? Comment on your findings. What rules seem potentially useful (and why)?