Data Mining TCSS555 Autumn 2007 Midterm Review 1. How is a data warehouse different from a database? How are they similar? 2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem. 1 Data Mining TCSS555 Autumn 2007 3. Discuss issues to consider during data integration. 4. Briefly outline the major steps of decision tree classification. 2 Data Mining TCSS555 Autumn 2007 5. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Answer the following: a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0]. b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age is 12.94 years. c) Use normalization by decimal scaling to transform the value 35 for age. d) Comment on which method you would prefer to use for the given data, giving reasons as to why. 3 Data Mining TCSS555 Autumn 2007 6. Consider the association rule below, which was mined from the student database at Big- University: (Rule 1) Suppose that the number of students at the university (that is the number of task-relevant data tuples) is 5000, that 70% of the students are majoring in science, that 64% of the students are registered in programs leading to undergraduate degrees, and that 56% of the undergraduates at the university major in science. a) Compute the confidence and support for the rule (Rule 1). b) Consider (Rule 2) below. (Rule 2) Suppose that 30% of science students are majoring in biology. Would you consider (Rule 2) to be novel with respect to (Rule 1)? Explain. 4 Data Mining TCSS555 Autumn 2007 5 Data Mining TCSS555 Autumn 2007 7. a) Find all the frequent itemsets using APriori and FP-Growth, respectively, from the following transaction database: b) List all of the strong association rules (with support s and confidence c) matching the following metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g. “A”, “B”, etc.): 6 Data Mining TCSS555 Autumn 2007 7