McDonald - Homework Week 3

advertisement
Total Score 67 out of 70
Nancy McDonald
CSIS 5420
Homework Week 3
Score 10 out of 10
1.a.
Data cleaning involves accounting for noise (random errors in attribute values
such as duplicate records, invalid attribute value, etc.). Data cleaning is preferably, a preprocessing step, i.e. it is done before the data is permanently stored in something such as
a data warehouse. Good
Data transformation is promoting consistency throughout the data. One type of
transformation is Data Type Conversion: if one data source uses male = ‘m’ and another
source uses male =1, the data must be modified so that all attributes use the same value
for male. Another form of data transformation is ensuring that numeric values for an
attribute fall within a specified range. Data transformation is done to data pulled from a
warehouse and is the step that precedes the actual data mining step in the “scientific
method” of data mining. Good
b.
Internal data smoothing is data smoothing that is done during [attribute]
classification. Good
External data smoothing takes place before classification. Examples of this are
rounding and computing mean values. Good
c. Decimal scaling and z-score normalization are both techniques to normalize attribute
values so that attributes with a wide range of values are less likely to outweigh attributes
with small ranges.
Decimal scaling is taking attribute numbers that are greater than one and
converting them to values between 0 and 1. For example, if the attribute values will
range between –100 and 100, you would divide the value by 100 to produce values
between –1 and 1. This technique is good when you now that attribute values will be
between a minimum and a maximum value. Good
Z-score normalization converts an attribute value to a “standard” score by taking
the original value, subtracting the attribute mean from it (thus giving you this instance’s
deviation from the mean), and then dividing this result by the attribute standard deviation.
The book says that this is a good technique when the maximum and minimum values are
not known. Good
Score 10 out of 10
2. Using sonaru.xls to perform unsupervised clustering, I played around with instance
similarity and real-tolerance parameters until I formed 15 classes. My trials were:
i.s. = 45
r-t = 1.0
classes = 2
i.s. = 75
r-t = 0.5
classes = 5
i.s. = 90
r-t = 0.3
classes = 15
a. Yes the clustering shows a class structure similar to the actual classes found within
the data because the resemblance scores for all 15 classes [greatly] exceed the
domain resemblance score. Good
b. Classes whose instances naturally cluster together – I believe they are all classes
except 2 & 7. Good
c. Classes that tend to intermix their instances are 2 and 7 whose resemblance scores
are extremely low. Good
In general, s-deciduous, n-deciduous, dark-barren, br-barren-1, brbarren-2 and urban form their own clusters. Shallow and deep water
cluster together. The two agricultural classes cluster together. Marsh,
turf-grass, wooded_swamp as well as shrub_swamp form a single
cluster.
Score 10 out of 10
3. Perform unsupervised data mining on CardiologyNumerical.xls using the 4-step
process.
Step1: Identify Goal
The goal is to determine if the input attributes defined in the above Excel file are
appropriate choices for building a supervised learner model. To do this, I will apply
unsupervised clustering to see if 2 classes, healthy and sick, naturally form 2 clusters.
Step 2: Prepare the data.
Load CardiologyNumerical.xls into a test file. Change class from O (output) to D
(display-only) to perform unsupervised data mining. Note: healthy = 1, sick = 0.
Step 3: Apply data mining. I want 2 clusters - hopefully with mostly healthy in one and
mostly sick the another.
Step 4: I tried several unsupervised data mining sessions to create 2 clusters. When I
increased the instance similarity and left real-tolerance alone, I generated more than two
classes. Also, I sometimes decreased the real-tolerance and generated more than two
classes. Here are the test parameters for sessions generating just 2 classes:
1. i.s. = 45
r-t = 1.0
classes = 2
2. i.s. = 45
r-t = 0.75
classes = 2
3. i.s. = 45
r-t = 0.9
classes = 2
The best results were with 1. when real-tolerance was 1.0. Resemblance for both
clusters was better than the domain resemblance. Cluster = 1 (where the majority of
instances were sick) had a resemblance of .564 and Cluster = 2 (where the majority of
instances were healthy) had a resemblance of .607. These compare to a domain
resemblance of .52, so this is positive evidence of well-defined clustering. Cluster 1 had
9 rules and Cluster 2 had over 30 rules. The book depicts that a wealth of rules also
indicates well-defined clusters.
Do healthy and sick cluster together? No Cluster 1 is mainly sick instances (class
mean = .2) and Cluster 2 is mainly healthy instances (class mean = .84)
One cluster will contain 112 sick instances and 28 healthy instances. The
second cluster will contain 137 healthy instances and 26 sick instances.
Score 10 out of 10
4. Min-Max Normalization for age in Table 2.3
old min = 19 new min = 0
old max = 55 new max = 1
formula:
newValue = (origValue-oldMin*(newMax-newMin)+newMin)/(oldMax-oldMin)
newValue = (origValue-19*(1-0)+0)/(55-19)
therefore, finding the new value for age = 35:
newValue = (35-19)/36 = .444 Good
Score 10 out of 10
5. Element 1 of genetic testing:
input attributes: income range, credit card insurance, sex, age
fitness score: 60% correct
Element 2 of genetic testing:
input attributes: credit card insurance, age
fitness score: 60% correct
Element 2 of genetic testing:
input attributes: sex, age
fitness score: 80% correct
Very good
Score 10 out of 10
6.a. Healthy males = 93. Good
b. Healthy females with >=3 colored vessels = 0. Good
c. When #colored vessels = 2 and angina = true, all individuals are sick. (When angina
= true and #colored vessels = 1, there are still some healthy individuals.) Good
d. Hypothesis: majority of individuals with #colored vessels = 0 are healthy.
True. 134 healthy individuals with #colored vessels = 0 as opposed to 45 sick
individuals with #colored vessels = 0. Good
e. Hypothesis: typical healthy individual will show no symptoms of angina and will
have a value of normal for thal.
True. 93 out of 134 have no angina and thal = normal.
The next closest is 18 of 134 have thal = rev and angina = false.
Third best is 13 of 134 have thal = normal and angina = true. Good
Score 7 out of 10
7.a. Hypothesis: Individuals who purchased all 3 promotional offerings also purchased
credit card insurance.
False. By moving magazine promo to the rows and adding credit card insurance to
the columns, the results show that only 1 individual who purchased the other 3 promos
also purchased credit card insurance. However, 3 individuals who purchased the other 3
promos did not purchase credit card insurance. Good
a.
b.
c.
d.
two
zero
zero
The hypothesis is false as three of the four individuals who
purchased all three promotions did not purchase credit card
insurance.
P.S. These pivot tables are neat!
Download