Matthew Kolker CSIS 5420 Week 3 Assignment 1. Differentiate between the following terms: a. Data cleaning and data transformation. Data transformation is used to promote data consistency when transporting data. This is needed when data is coming from multiple sources and is not always in the same format or may use different codes or such. Data cleaning is a form of data preprocessing that is used to address data noise and missing information. Whereas data transformation works to get data to look alike, data cleaning looks to make sure that the data is complete and relatively accurate. b. Internal and external data smoothing. Internal data smoothing occurs during the classification process whereas external data smoothing occurs prior to classification. Two common forms of external data smoothing are rounding and computing mean values. c. Decimal scaling and Z-score normalization. Decimal scaling and Z-score normalization both strive to get data into a sizeable range; however, decimal scaling does this by dividing values by a fixed multiple of 10 whereas Z-score normalization converts the value to a standard score. Decimal scaling is more applicable when there is a known range of values while Z-score normalization is better when the minimum and maximum values are not known. 2. Perform an unsupervised clustering using sonaru.xls data file. Experiment with the instance similarity and real-tolerance parameters in an attempt to form 15 well-defined clusters similar to the actual classes contained in the data. Be sure to designate the class attribute as display-only. a. Does the clustering show a class structure similar to the actual classes found within the data? I could not get this to occur. I tried using 95/~.6 and 40/~.05 for my values. Both of which will get clusters for urban but will show many other classes together. b. Which classes show instances that naturally cluster together? Urban. c. Which classes tend to intermix their instances with the instances of other classes? I noticed that the shallow_water and deep_water instances tended to be together. 3. The CardiologyNumerical.xls data file contains the same instances as the CardiologyCategorical.xls file but the categorical attributes have been changed to numeric equivalents. a. Load the CardiologyNumerical.xls file into Excel. Please refer to Table 2.1 (page 38) to see how the categorical attributes are mapped to corresponding numerical equivalents. For example, the table shows that values male and female for attribute sex are replaced with a 1 and a 0. Likewise the values angina, abnormal angina, noTang, and asymptomatic for attribute chest pain type are respectively replaced with 1, 2, 3, and 4. Note that the class attribute represents and instance of the healthy class with a 1 and an instance of the sick class with a 0. Done. b. Perform the first experiment in Section 5.10 (page 164) using this dataset. Follow the four-step model and write a short description of what was found at each step of the process. Be sure to designate the class attribute as display-only. You may have to manipulate the real-tolerance setting to get a best result. Do the healthy and sick instances cluster together? Step 1 306 instances with 15 input attributes each. Step 2 Changed Class from “O” to “D”. Step 3 Used 45 and 1 for parameters to create 2 classes. Step 4 The class resemblance scores were both higher then the domain resemblance score indicating positive evidence of well-defined clustering. The means for the “class” attribute for the clusters were approximate .8 and .2 indicating that the healthy and sick classes did tend to cluster together. 4. Set up a general formula for a Min-Max normalization as it would be applied to the attribute age for the data in Table 2.3 (page 44). Transform the data so the new minimum value is 0 and the new maximum value is 1. Apply the formula to determine a transformed value for age = 35. newValue = (originalValue – 19 (newMax – newMin) + newMin) / (oldMax – oldMin) = (originalValue – 19 (newMax – newMin) + newMin) / 36 newValue = (originalValue – 19) / (55 – 19) = (originalValue – 19) / 36 newValue = (35-19) / 36 = .444 5. Use the CreditCardPromotion.xls data file together with the initial element population shown in Table 5.1 (page 159) to perform the first iteration of the genetic algorithm for attribute selection. Specify life insurance promotion as the output attribute. Use 10 instances for training and the remaining instances as a test set. Use classification correctness on the test data as your fitness function. List the fitness scores for each of the three elements of the initial population. Element 1: 40% Element 2: 40% Element 3: 40% 6. Construct a pivot table with the CardiologyCategorical.xls database file. Make angina and thal row attributes and class a column attribute. Place class, angina, and thal in the data area. Specify slope, sex, and #colored vessels as page variables. Use the pivot table to answer the following questions: a. How many healthy males are in the database? 93 b. How many healthy females have three or more colored vessels? 0 c. Determine values for #colored vessels and angina that are sufficient for defining a sick individual. Angina = True #Colored Vessels = 2 d. Verify or refute the hypothesis: The majority of individuals with #colored vessels = 0 are healthy. True: 134/179 = 75% e. Verify or refute the hypothesis: A typical healthy individual will show no symptoms of angina and will have a value of normal for attribute thal. True: 115/165 = 70% 7. Recreate the pivot table shown in Figure 6.16 (page 206) to answer the following: a. How many cardholders did not purchase a single promotional offering? 2 b. How many cardholders took advantage of the magazine and watch promotions but did not purchase the life insurance promotion? 0 c. How many male cardholders make between $50,000 and $60,000? 0 d. Verify or refute the hypothesis: Individuals who purchased all three promotional offerings also purchased credit card insurance? False: only 1 of 4 that purchased all three promotional offerings also purchased credit card insurance.