IT 241 Information Discovery and Architecture Exam 3 Page 1 December 1, 2011 Name _____________________________ This exam is open book and on-line but no contact with a live person. Be careful with your time! 1. In what ways are data mining and data visualization connected? How does one support the other? [10 pts] Both reveal/confirm relationships of attributes in data; both allow ways to drill into data Data mining can determine attributes to visualize; the output of the data mining may be visualized Visualization may allow you to see patterns and confirm the patters in the data mining such as correlation matrices 2. Describe briefly the process(es) of data warehousing used in preparation for data mining. Do not just narrate the data warehousing architecture diagram. [5 pts] Identify data to be archived (not needed for day-to-day operations. Clean data, determine attributes not needed for further data mining. Some aggregation/counts may be generated. Denormalize into cube or star schema 3. Give an example of shallow knowledge as found in the national census data. Similarly give an example of hidden knowledge as might be found in the national census data. [5 pts] Shallow – any simple query, counts, sums. Just return facts. Deep/hidden – any data mining example, rules, cluster, regression. IT 241 Information Discovery and Architecture Exam 3 Page 2 4. Data types in data mining. [15 pts] a. Give an example of nominal data that cannot be converted to ordinal data. Favorite color b. Give an example of ordinal data that may be limited in conversion to numeric data. Interest levels (likert, agree/disagree) c. Give an example of a numeric data set that can be mapped to nominal data. Explain the mapping. hat sizes, small medium large to numeric sizes 5. In the data preprocessing of the rainfall.txt file for cities in the US, describe 4 problems found in the data set and what resolutions you took for those problems, before exporting it into CSV format. [15 pts] a. Duplicate data b. Zeroes or missing data c. Outliers, regarding city names d. Columns may be irrelevant 6. Describe how you might determine which attributes you could eliminate from a data set with dozens of attributes? [5 pts] matrix of scatter plots correlation matrix attribute selection algorithms IT 241 Information Discovery and Architecture Exam 3 Page 3 [15 pts] 7. Decision trees. a. Given the decision tree rule for the above dataset IF Sex=Male && WatchPromo=No THEN CreditCardInsurance=No Determine its accuracy = ___3 out of 4 or 75_% and its coverage = __4 out of 15 or 27_% b. Draw a decision tree to correspond with these three production rules. (Not all leaves are defined.) IF Sex=Female THEN CreditCardInsurance=No IF Sex=Male && WatchPromo=No THEN CreditCardInsurance=No IF Sex=Male && IncomeRange =30-40K THEN CreditCardInsurance=Yes Sex male / \ female / \ watchPromo CCI=no no/ \yes / CCI=no income /30-40K CCI=yes c. In predicting CreditCardInsurance, why is the entropy=0 bits for Salary=”50-60K”? both instances map to ‘no’ d. In predicting CreditCardInsurance, the entropy for Salary=”30-40K” is expressed as info([ _3__ , __2_ ]) = entropy ( ___3_/_5__ , _2__/_5__) There are 3different numbers in these 6 blanks. IT 241 Information Discovery and Architecture Exam 3 Page 4 IT 241 Information Discovery and Architecture Exam 3 Page 5 8. Association Rules. [15 pts] a. Using the credit card data from the previous page, identify 5 single items sets would be generated with a coverage threshold of 33%? (exclude the age attribute) single item sets Number of items A. Income = 30-40K 5 B. Mag Promo = Yes 8 C. Mag Promo = No 7 D. Watch Promo = Yes 8 E. Watch promo = no 7 LifeIns = yes 9 b. What pairings of your 5 item sets A-E, if any, also meet the 33% coverage threshold? A&B5 c. If you had the pairing (which you may not necessarily have) of MagazinePromo=Yes and LifeInsPromo=Yes, what two rules could be expressed? And then, calculate their coverage as a ratio. i. IF ___MagPromo=yes_____ THEN ____LifeInsPromo=yes____ (__6_ / __8__) ii. IF ____LifeInsPrem-yes________ THEN ___MagPromo=yes_________ (__6_ / __9__) 9. K-means algorithm. [8 pts] a. If you were to apply the cluster K-means algorithm to the credit card data set, what might you first do with the salary and sex attributes? convert to numeric b. Why would you, or not, include the creditCardIns attribute? Do not use because of low coverage [7 pts] 10. True/false. __T__.A data cube for data mining is created by multiple joins of tables from the operational and/or archival database. __F__.A data cube contains typically a mix of historical data and transactions in progress. __ T__ A data cube will likely contain repetitive information. ___T__ A pivot table is Excel’s form of a data cube. ___T__ Drill down and roll up are complementary operations on a data cube. ___T__.Linear regression requires all attributes to be numerical. __ F__.Linear regression modeling maximizes the distance of each instance to the calculated linear regression line.