Kolker-Week3 CSIS5420

advertisement
Matthew Kolker
CSIS 5420
Week 3 Assignment
1. Differentiate between the following terms:
a. Data cleaning and data transformation.
Data transformation is used to promote data consistency when transporting data. This is
needed when data is coming from multiple sources and is not always in the same format
or may use different codes or such. Data cleaning is a form of data preprocessing that is
used to address data noise and missing information. Whereas data transformation works
to get data to look alike, data cleaning looks to make sure that the data is complete and
relatively accurate.
b. Internal and external data smoothing.
Internal data smoothing occurs during the classification process whereas external data
smoothing occurs prior to classification. Two common forms of external data smoothing
are rounding and computing mean values.
c. Decimal scaling and Z-score normalization.
Decimal scaling and Z-score normalization both strive to get data into a sizeable range;
however, decimal scaling does this by dividing values by a fixed multiple of 10 whereas
Z-score normalization converts the value to a standard score. Decimal scaling is more
applicable when there is a known range of values while Z-score normalization is better
when the minimum and maximum values are not known.
2. Perform an unsupervised clustering using sonaru.xls data file. Experiment with the
instance similarity and real-tolerance parameters in an attempt to form 15 well-defined
clusters similar to the actual classes contained in the data. Be sure to designate the class
attribute as display-only.
a. Does the clustering show a class structure similar to the actual classes found within the
data?
I could not get this to occur. I tried using 95/~.6 and 40/~.05 for my values. Both of
which will get clusters for urban but will show many other classes together.
b. Which classes show instances that naturally cluster together?
Urban.
c. Which classes tend to intermix their instances with the instances of other classes?
I noticed that the shallow_water and deep_water instances tended to be together.
3. The CardiologyNumerical.xls data file contains the same instances as the
CardiologyCategorical.xls file but the categorical attributes have been changed to
numeric equivalents.
a. Load the CardiologyNumerical.xls file into Excel. Please refer to Table 2.1 (page 38)
to see how the categorical attributes are mapped to corresponding numerical equivalents.
For example, the table shows that values male and female for attribute sex are replaced
with a 1 and a 0. Likewise the values angina, abnormal angina, noTang, and
asymptomatic for attribute chest pain type are respectively replaced with 1, 2, 3, and 4.
Note that the class attribute represents and instance of the healthy class with a 1 and an
instance of the sick class with a 0.
Done.
b. Perform the first experiment in Section 5.10 (page 164) using this dataset. Follow the
four-step model and write a short description of what was found at each step of the
process. Be sure to designate the class attribute as display-only. You may have to
manipulate the real-tolerance setting to get a best result. Do the healthy and sick instances
cluster together?
Step 1
306 instances with 15 input attributes each.
Step 2
Changed Class from “O” to “D”.
Step 3
Used 45 and 1 for parameters to create 2 classes.
Step 4
The class resemblance scores were both higher then the domain resemblance score
indicating positive evidence of well-defined clustering.
The means for the “class” attribute for the clusters were approximate .8 and .2 indicating
that the healthy and sick classes did tend to cluster together.
4. Set up a general formula for a Min-Max normalization as it would be applied to the
attribute age for the data in Table 2.3 (page 44). Transform the data so the new minimum
value is 0 and the new maximum value is 1. Apply the formula to determine a
transformed value for age = 35.
newValue = (originalValue – 19 (newMax – newMin) + newMin) / (oldMax – oldMin)
= (originalValue – 19 (newMax – newMin) + newMin) / 36
newValue = (originalValue – 19) / (55 – 19) = (originalValue – 19) / 36
newValue = (35-19) / 36 = .444
5. Use the CreditCardPromotion.xls data file together with the initial element population
shown in Table 5.1 (page 159) to perform the first iteration of the genetic algorithm for
attribute selection. Specify life insurance promotion as the output attribute. Use 10
instances for training and the remaining instances as a test set. Use classification
correctness on the test data as your fitness function. List the fitness scores for each of the
three elements of the initial population.
Element 1: 40%
Element 2: 40%
Element 3: 40%
6. Construct a pivot table with the CardiologyCategorical.xls database file. Make angina
and thal row attributes and class a column attribute. Place class, angina, and thal in the
data area. Specify slope, sex, and #colored vessels as page variables. Use the pivot table
to answer the following questions:
a. How many healthy males are in the database?
93
b. How many healthy females have three or more colored vessels?
0
c. Determine values for #colored vessels and angina that are sufficient for defining a sick
individual.
Angina = True
#Colored Vessels = 2
d. Verify or refute the hypothesis: The majority of individuals with #colored vessels = 0
are healthy.
True: 134/179 = 75%
e. Verify or refute the hypothesis: A typical healthy individual will show no symptoms of
angina and will have a value of normal for attribute thal.
True: 115/165 = 70%
7. Recreate the pivot table shown in Figure 6.16 (page 206) to answer the following:
a. How many cardholders did not purchase a single promotional offering?
2
b. How many cardholders took advantage of the magazine and watch promotions but did
not purchase the life insurance promotion?
0
c. How many male cardholders make between $50,000 and $60,000?
0
d. Verify or refute the hypothesis: Individuals who purchased all three promotional
offerings also purchased credit card insurance?
False: only 1 of 4 that purchased all three promotional offerings also purchased credit
card insurance.
Download