Total Score 67 out of 70 Nancy McDonald CSIS 5420 Homework Week 3 Score 10 out of 10 1.a. Data cleaning involves accounting for noise (random errors in attribute values such as duplicate records, invalid attribute value, etc.). Data cleaning is preferably, a preprocessing step, i.e. it is done before the data is permanently stored in something such as a data warehouse. Good Data transformation is promoting consistency throughout the data. One type of transformation is Data Type Conversion: if one data source uses male = ‘m’ and another source uses male =1, the data must be modified so that all attributes use the same value for male. Another form of data transformation is ensuring that numeric values for an attribute fall within a specified range. Data transformation is done to data pulled from a warehouse and is the step that precedes the actual data mining step in the “scientific method” of data mining. Good b. Internal data smoothing is data smoothing that is done during [attribute] classification. Good External data smoothing takes place before classification. Examples of this are rounding and computing mean values. Good c. Decimal scaling and z-score normalization are both techniques to normalize attribute values so that attributes with a wide range of values are less likely to outweigh attributes with small ranges. Decimal scaling is taking attribute numbers that are greater than one and converting them to values between 0 and 1. For example, if the attribute values will range between –100 and 100, you would divide the value by 100 to produce values between –1 and 1. This technique is good when you now that attribute values will be between a minimum and a maximum value. Good Z-score normalization converts an attribute value to a “standard” score by taking the original value, subtracting the attribute mean from it (thus giving you this instance’s deviation from the mean), and then dividing this result by the attribute standard deviation. The book says that this is a good technique when the maximum and minimum values are not known. Good Score 10 out of 10 2. Using sonaru.xls to perform unsupervised clustering, I played around with instance similarity and real-tolerance parameters until I formed 15 classes. My trials were: i.s. = 45 r-t = 1.0 classes = 2 i.s. = 75 r-t = 0.5 classes = 5 i.s. = 90 r-t = 0.3 classes = 15 a. Yes the clustering shows a class structure similar to the actual classes found within the data because the resemblance scores for all 15 classes [greatly] exceed the domain resemblance score. Good b. Classes whose instances naturally cluster together – I believe they are all classes except 2 & 7. Good c. Classes that tend to intermix their instances are 2 and 7 whose resemblance scores are extremely low. Good In general, s-deciduous, n-deciduous, dark-barren, br-barren-1, brbarren-2 and urban form their own clusters. Shallow and deep water cluster together. The two agricultural classes cluster together. Marsh, turf-grass, wooded_swamp as well as shrub_swamp form a single cluster. Score 10 out of 10 3. Perform unsupervised data mining on CardiologyNumerical.xls using the 4-step process. Step1: Identify Goal The goal is to determine if the input attributes defined in the above Excel file are appropriate choices for building a supervised learner model. To do this, I will apply unsupervised clustering to see if 2 classes, healthy and sick, naturally form 2 clusters. Step 2: Prepare the data. Load CardiologyNumerical.xls into a test file. Change class from O (output) to D (display-only) to perform unsupervised data mining. Note: healthy = 1, sick = 0. Step 3: Apply data mining. I want 2 clusters - hopefully with mostly healthy in one and mostly sick the another. Step 4: I tried several unsupervised data mining sessions to create 2 clusters. When I increased the instance similarity and left real-tolerance alone, I generated more than two classes. Also, I sometimes decreased the real-tolerance and generated more than two classes. Here are the test parameters for sessions generating just 2 classes: 1. i.s. = 45 r-t = 1.0 classes = 2 2. i.s. = 45 r-t = 0.75 classes = 2 3. i.s. = 45 r-t = 0.9 classes = 2 The best results were with 1. when real-tolerance was 1.0. Resemblance for both clusters was better than the domain resemblance. Cluster = 1 (where the majority of instances were sick) had a resemblance of .564 and Cluster = 2 (where the majority of instances were healthy) had a resemblance of .607. These compare to a domain resemblance of .52, so this is positive evidence of well-defined clustering. Cluster 1 had 9 rules and Cluster 2 had over 30 rules. The book depicts that a wealth of rules also indicates well-defined clusters. Do healthy and sick cluster together? No Cluster 1 is mainly sick instances (class mean = .2) and Cluster 2 is mainly healthy instances (class mean = .84) One cluster will contain 112 sick instances and 28 healthy instances. The second cluster will contain 137 healthy instances and 26 sick instances. Score 10 out of 10 4. Min-Max Normalization for age in Table 2.3 old min = 19 new min = 0 old max = 55 new max = 1 formula: newValue = (origValue-oldMin*(newMax-newMin)+newMin)/(oldMax-oldMin) newValue = (origValue-19*(1-0)+0)/(55-19) therefore, finding the new value for age = 35: newValue = (35-19)/36 = .444 Good Score 10 out of 10 5. Element 1 of genetic testing: input attributes: income range, credit card insurance, sex, age fitness score: 60% correct Element 2 of genetic testing: input attributes: credit card insurance, age fitness score: 60% correct Element 2 of genetic testing: input attributes: sex, age fitness score: 80% correct Very good Score 10 out of 10 6.a. Healthy males = 93. Good b. Healthy females with >=3 colored vessels = 0. Good c. When #colored vessels = 2 and angina = true, all individuals are sick. (When angina = true and #colored vessels = 1, there are still some healthy individuals.) Good d. Hypothesis: majority of individuals with #colored vessels = 0 are healthy. True. 134 healthy individuals with #colored vessels = 0 as opposed to 45 sick individuals with #colored vessels = 0. Good e. Hypothesis: typical healthy individual will show no symptoms of angina and will have a value of normal for thal. True. 93 out of 134 have no angina and thal = normal. The next closest is 18 of 134 have thal = rev and angina = false. Third best is 13 of 134 have thal = normal and angina = true. Good Score 7 out of 10 7.a. Hypothesis: Individuals who purchased all 3 promotional offerings also purchased credit card insurance. False. By moving magazine promo to the rows and adding credit card insurance to the columns, the results show that only 1 individual who purchased the other 3 promos also purchased credit card insurance. However, 3 individuals who purchased the other 3 promos did not purchase credit card insurance. Good a. b. c. d. two zero zero The hypothesis is false as three of the four individuals who purchased all three promotions did not purchase credit card insurance. P.S. These pivot tables are neat!