Statistics 503X Project Breast Cancer Data

advertisement
Statistics 503X
Project
Breast Cancer Data
Younghun Han
Norbert Karp
Sven Stanzel
1. Description
For women breast cancer is a fairly common and severe disease. Therefore it would be
really helpful to see if the kind of the tumor can be predicted. There are two kinds of
tumors, benign and malignant ones. Any tumor, either benign or malignant in type, may
produce death by local effects if it is appropriately situated. The common and more
specific definition of malignancy implies an inherent tendency of the tumor's cells to
metastasize (invade the body widely and become disseminated by subtle means ) and
eventually to kill the patient unless all the malignant cells can be eradicated. Having
diagnosed a tumor it is impossible to say what status the tumor has without surgery.
Current cancer treatment depends on drugs and hormones (chemotherapy), surgery,
radiation therapy, or a combination of these. The earlier cancer is diagnosed and the
sooner treatment can be implemented, the greater the chances of a successful cure.
According to the National Cancer Institute's " What You need to know about breast
cancer ", the following are known risk factors for breast cancer:
age, family history, personal history (women who already have had breast cancer face an
increased risk of getting breast cancer again ),menstruating at an early age (before 12)
and having the first child after the age of 30.
The data was provided by Richard D. De Veaux. He studied 1622 women who went into
surgery after a tumor had been detected in their breasts. The only risk factor mentioned
above that is included in the data is age. Additionally there are physical measurements for
each patient.
The data available are:
Histologie:
A surgery had been performed on the patients and the tumor
analyzed is benign (0) or malignant (1)
Age:
Age of the patient
Typsein:
type of tissue:
0 = light,
1 = dense
Cote:
location of the tumor
0 = left,
1 = right
Taille:
size of the suspicious cluster in mm
Nombre:
Number of microcalcifications in the cluster
1 = <10,
2 = 10-30,
3 = 30+,
4 = 10-20,
5 = 20-30
Foyer:
Number of suspicious clusters, 1 or 2
Forme:
Shape of the microcalcifications:
1-5: the order corresponds to the degree of malignancy
from least to most.
Polymorphisme:
Are there many types of microcalcifications in one cluster?
0 = no,
1 = yes
Contour:
shape of the cluster
1 = circular,
2 = angular,
3 = other
Retro:
Is the cluster under the nipple?
0 = no,
1 = yes
Profondeur:
Are the microcalcifications deep under the skin?
0 = no,
1 = yes
We want to use the data to see which variables are related to the state of histologie and if
we can predict the kind of tumor.
2. Suggested Approaches
Data Restructuring
Reason:
recode the variable Nombre by deleting observations with the value
2 for Nombre2 because levels 4 (10-20) and 5 (20-30) are
subgroups of level 2 (10-30) in the data set and we do not have any
information to which of those subgroups an observation with level 2
belongs
The new groups are now:
1: <10
2: 10-20
3: 20-30
4: 30+
We deleted 226 of the 1622 original observations that had missing
values after checking that the distribution for the other variables for
the deleted observations was the same as for the remaining
observations.
Create dummy variables for all categorical variables to aid in LDA
and Logistic Regression
For each of the categorical variables k-1 dummy variables must be
created, where k is the number of levels of that categorical variable
Each dummy variable has two possible values:
1, if the variable is at the level for which you create the dummy
variable
0, otherwise ( variable takes on any other possible level of that
variable )
Summary Statistics for the continuous variables
Reason:
extract location/scale information
Type of questions
addressed:
"What is the average age?",
"What is the average cluster size?"
Histograms for the variables Age and Taille
Reason:
explore univariate distributions
Type of questions
addressed:
"Are there unusual patterns in the distributions of Age and Taille?"
(Outliers, distribution shape, quirky structures)
One-way tables for the categorical variables
Reason:
explore univariate distributions
Type of questions
addressed:
"What do the distributions of the categorical variables look like?"
Two-way tables for Histologie versus the categorical variables
Reason:
Explore bivariate distribution and dependencies
between Histologie and the categorical variables
Type of questions
addressed:
"Which categorical variables are related to Histologie?"
Three-way tables for Histologie versus combinations of the categorical variables
Reason:
Explore multivariate distribution and dependencies between
Histologie and the categorical variables
Type of questions
addressed:
"Which categorical variables are related to Histologie?"
Dotplots for Age and Taille color-coded with respect to Histologie
Reason:
Examine dependencies between Histologie and Age and Taille
Type of questions
addressed:
"Which continuous variables are related to Histologie?"
Pairwise Scatterplot for Age and Taille color-coded with respect to Histologie
Reason:
Explore bivariate distribution and dependencies between Histologie
and the continuous variables
Type of questions
addressed:
"Is Histologie related with the combination of Age and Taille?"
Mosaic Plots
Reason:
Explore multivariate distribution and dependencies between
Histologie and the categorical variables
Type of questions
addressed:
"Which categorical variables are related to Histologie?"
Cluster Analysis
Reason:
Find clusters of "similar" patients
Type of questions
addressed:
"Are there clusters of similar patients?, "And if so, what is the
relation to Histologie?”
Principle Components Analysis for the explanatory variables
Reason:
Reduce the dimensionality of the data
Type of questions
addressed:
"Can the data be described by few principal components without loss
of information?" , "And if so what is the relationship between these
principal components and Histologie?"
Numerical Analysis
Methods:
CART
Neural Networks
Linear Discriminant Analysis
Type of questions
addressed:
"Can we find a classification rule for Histologie based on the other
measurements?"
Logistic Regression
Reason:
Determining the most important factors to Histologie
Type of questions
addressed:
"Which factors are useful in predicting Histologie?"
3. Actual Approaches
3.1 Summary statistics for the continuous variables
Total number of observations = 1396
Number of variables = 13
variable
Age ( years )
Size of the
cluster (mm )
average
51.2
15.3
standard
deviation
9.27
12.88
min
22
2
max
86
100
3.2 Histograms of variables Age and Taille
The distribution of variable Age (Plot 3.2.a.) is roughly symmetric. It is unimodal with a
peak around 50. For us it is quite surprising that there are some patients with age below
40.
The distribution of Taille (Plot 3.2.b.) is heavily skewed to the right. It is unimodal with a
peak at 0-10. Most of the tumors have a cluster size of below 20 mm; cluster sizes above
50 mm are pretty rare.
3.3 One-way tables for the categorical variables
Histologie
0
1
Count
816
580
58.4% of the tumors are benign.
Typsein
0
1
Count
703
693
The percentage of light and dense tissue is almost the same.
Cote
0
1
Count
675
721
The distributions of tumors on the left and right side are approximately equal.
Foyer
1
2
Count
1004
392
Most of the patients (71.9%) have only one suspicious cluster. There are no more than 2
suspicious clusters in each patient.
Forme
1
2
3
4
5
Count
29
328
355
500
184
Most of the observations have levels two, three or four ( 84.7% ).
Polymorphisme
0
1
Count
561
835
In 59.8% of the cases there are many types of microcalcifications in one cluster.
Contour
1
2
3
Count
383
598
415
Angular cluster shape (42.8%) is more common than circular cluster shape (27.4%). For
almost 30% of the observed women the shape of the suspicious cluster(s) is neither
circular nor angular.
Retro
0
1
Count
1167
229
In 83.6% of the cases the cluster was not under the nipple.
Profondeur
0
1
Count
583
813
Microcalfications are deep under the skin in 58.2% of the observations.
Nombre2
1
2
3
4
Count
190
371
293
542
In 38.8% of the cases there are more than 30 microcalfications per cluster.
Note: Only the numbers of the levels 1 and 4 are not effected by deleting observations
that belong to the original level 2.
3.4 Two-way tables for the categorical variables and Histologie
Typsein
0
1
Histologie
0
438
378
1
265
315
If tissue is light then more tumors are benign, while for dense tissues the ratio of benign
and malignant tumors is close to 1.
Cote
0
1
Histologie
0
400
416
1
275
305
The ratios of benign and malignant tumors are roughly the same for both levels of Cote.
Foyer
1
2
Histologie
0
624
192
1
380
200
When the number of suspicious clusters is 1, 62.1% of the tumors are benign whereas
when there are 2 suspicious clusters, the ratio of benign and malignant tumors is almost
1.
Forme
1
2
3
4
5
Histologie
0
28
264
253
243
28
1
1
64
102
257
156
If level of Forme is between 1 and 3, the proportion of benign tumors is 76.5%, while for
levels 4 and 5 the rate of malignant tumors is higher. For level 5, 84.8% of the tumors are
malignant.
Polymorphisme
0
1
Histologie
0
414
402
1
147
433
If there are not many types of microcalcifications,73.8% of the tumors are benign. When
there are many types of microcalcifications, the ratio of benign and malignant tumors is
close to 1.
Contour
1
2
3
Histologie
0
273
268
275
1
110
330
140
When the tumor is circular, the proportion of benign tumors is 71.3%, when it is angular,
the proportion of benign tumors is 44.8% . When the tumor is neither angular nor
circular, the proportion of benign tumors is 66.3%.
Retro
0
1
Histologie
0
691
125
1
476
104
The distributions of malignant and benign tumors are similar, whether the cluster is under
the nipple or not.
Profondeur
0
1
Histologie
0
367
449
1
216
364
If the microcalfications are not deep under the skin, the proportion of benign tumors is
62.3% , but if they are deep under the skin, the proportion of benign tumors is
Nombre2
1
2
3
4
Histologie
0
142
240
172
262
1
48
131
121
180
The proportion of benign tumors for levels 1 and 2 combined is 68.1% . The less
microcalfications in a cluster the higher the rate of benign tumors.
3.5 Three way-tables for the categorical variables and Histologie
Variables Typsein and Foyer
HISTOLOGIE=0
Typsein
0
1
Foyer
1
326
298
2
112
80
HISTOLOGIE=1
Typsein
0
1
Foyer
1
160
220
2
105
95
If the type of tissue is light and there is only one suspicious cluster the ratio of benign and
malignant tumors is approximately two to one whereas for all other combinations it is
almost one to one.
Variables Foyer and Contour
HISTOLOGIE=0
Foyer
1
2
Contour
1
214
59
2
208
60
3
202
73
HISTOLOGIE=1
Foyer
1
2
Contour
1
75
35
2
212
118
3
93
47
If the shape is angular then there are more malignant tumors when there are two
suspicious clusters. When the shape is not angular, there are less malignant tumors.
Variables Forme and Contour
HISTOLOGIE=0
Forme
1
2
3
4
5
Contour
1
7
80
95
82
9
2
10
60
83
102
13
3
11
124
75
59
6
HISTOLOGIE=1
Forme
1
2
3
4
5
Contour
1
0
16
35
49
10
2
0
27
34
140
129
3
1
21
33
68
17
For angular-shaped tumors and level 5 of Forme the ratio between benign and malignant
tumors is 1 to 10.
Variables Forme and Nombre2
HISTOLOGIE=0
Forme
1
2
3
4
5
Nombre2
1
9
78
17
33
5
2
6
70
63
97
4
3
8
46
53
55
10
4
5
70
120
58
9
2
0
20
15
78
18
3
0
11
23
61
26
4
0
16
59
97
108
HISTOLOGIE=1
Forme
1
2
3
4
5
Nombre2
1
1
17
5
21
4
For the combination of 20-30 microcalcifications per cluster and level 5 of Forme the
ratio of benign and malignant tumors is 1 to 12.
3.6.1. Pairwise Scatterplot for Age and Taille color- and glyph-coded
with respect to Histologie ( Plot 3.6.1. )
open circle = malignant
plus
= benign
The plot does not help at all to separate between benign and malignant tumors.
3.6.2. Tables of Age and Taille Categories versus Histologie
Age
Age
Benign
Malignant
Total
<35
27
(75%)
9
(25%)
36
35-39
40-44
56
155
(76.7%) (65.7%)
17
81
(23.3%) (34.3%)
73
236
45-49
177
(61%)
113
(39%)
290
50-59
60-69
>70
285
92
56
(55.7%) (45.5%) (51.4%)
227
110
53
(44.3%) (54.5%) (48.6%)
512
202
109
For women below the age of 45 the risk of a malignant tumor is lower than the average
risk of about 40%. For women above the age of 50 the risk of a malignant tumor is higher
than the average risk.
Taille
Taille
Benign
Malignant
Total
<5
>40
96
21
(72.7%) (38.9%)
36
33
(27.3%) (61.1%)
132
54
For values of Taille below 5 the risk of a malignant tumor is lower than the average risk.
For values of Taille above 40 the risk of a malignant tumor is higher than the average
risk. For values of Taille between 5 and 40 we did not find any interesting patterns.
3.7. Mosaic Plots
3.7.1. Variables Histologie and Forme ( Plot 3.7.1. )
The mosaic plot shows that the proportion of malignant tumors increases with the level of
variable Forme. For level one of Forme there are very few malignant tumors, for level
five of Forme almost all tumors are malignant.
3.7.2. Variables Histologie, Nombre2 and Forme ( Plot 3.7.2. )
The proportion of malignant tumors increases with the level of variable Nombre2. Given
a level of Nombre2 the distribution of Histologie depends on Forme; the lower the level
of Forme, the higher the proportion of benign tumors. Hence the distribution of
Histologie depends on both variables.
3.7.3. Variables Histologie, Nombre2 and Foyer ( Plot 3.7.3. )
As mentioned before the proportion of malignant tumors increases with the level of
variable Nombre2. Only given level 4 of Nombre2 the distribution of Histologie depends
on variable Foyer. In that case, for malignant tumors one and two suspicious clusters are
nearly equally likely, whereas benign tumors mostly have only one suspicious cluster
3.7.4. Variables Histologie, Polyphormisme and Foyer ( Plot 3.7.4. )
For level 5 of Forme and level 1 of Polyphormisme the proportion of malignant tumors is
much larger than for all the other combinations of these variables. The distribution of
Histologie depends on both variables.
3.8. Cluster Analysis
We used cluster analysis only for the explanatory variables. Even after using the distance
for the standardized data the grouping done by cluster analysis was always dominated by
the continuous variables Age and Taille. The groups the procedure gave us did not show
any interesting features related to Histologie. Additionally these two variables did not
show any relevance in predicting malignancy in the previous analysis. So we decided to
exclude the continuous variables from the cluster analysis. For the categorical variables
the results using the single, complete and average linkage methods with the distances for
raw, standardized and sphered data were very similar. The k-means method choosing four
clusters using the average linkage method for the distances of the standardized data
provided the most interesting groups with respect to Histologie.
Based on this method the 1396 observations are grouped as follows:
Color
Group
Histologie
0
1
blue
1
yellow
2
green
3
red
4
301
109
132
292
225
55
158
124
In group 1 the proportion of benign tumors is 73.4%, in group 3 it is 80.4%. In group 2
the proportion of malignant tumors is 68.9%. Interestingly this clustering is mainly based
on the bivariate plot of Forme vs. Nombre2. If Forme is at levels 4 or 5 and Nombre2 is
at levels 1 or 2 then the observation is clearly grouped into cluster 4. For Nombre2 at
levels 3 or 4 and the same levels for Forme as before they are grouped into cluster 2. For
the combinations of levels 3 or 4 for Nombre2 and levels 1, 2 or 3 of Forme all
observations are grouped into cluster 1, while most of the observations with level 1 or 2
for Nombre2 and levels 1, 2 or 3 for Forme are grouped into cluster 3. There are a few
observations with the latter level combinations for Forme and Nombre2 that are grouped
into cluster 4 instead of cluster 3. For further details see plot 3.8..
3.9 Principal Components Analysis
For Principal Components we encountered a similar problem as we did with Cluster
Analysis. Even for the standardized data the continuous variables dominate the Principal
Components. Therefore we excluded the two continuous variables from the Principal
Components Analysis.
For this data set the first few principal components do not account for most of the
variation in the data. Hence, it is not possible to describe the data by a few principal
components without loss of information.
Proportion of the variation in the data explained by the first eight Principal
Components
Principal Component
Principal Component 1
Principal Component 2
Principal Component 3
Principal Component 4
Principal Component 5
Principal Component 6
Principal Component 7
Principal Component 8
Proportion
0.1925
0.1376
0.1348
0.1154
0.1105
0.0974
0.0932
0.0745
Cumulative
0.1925
0.3301
0.4649
0.5803
0.6985
0.7883
0.8815
0.9560
Six Principal Components are needed to explain about 80% of the variation in the data.
Since the total number of categorical variables is only 9, six Principal Components
cannot be considered a "few". Hence we decided that Principal Component Analysis is
not very informative for this data set.
3.10. Numerical Analysis
3.10.1. CART ( Plot 3.10.1. )
The most important variables according to the Classification Tree are Forme, Contour
and Nombre2. If the level of Forme is 3 or less the observation is classified as benign in
the very first step without any further splits, which misclassifies 167 malignant tumors as
benign ones. We assume that misclassifying a malignant tumor as a benign one is the
more severe error in this case. Because of this and the misclassification error rate of
0.2772 we think that CART does not do a good job.
3.10.2. Neural Networks
The best result using Neural Networks was
Histologie
0
1
NN prediction
0
708
185
1
108
395
The total misclassification rate was 0.21.
To find the most important variables for predicting Histologie we used the predictions for
status of Histologie given by the Neural Networks procedure. We created a new variable
"prediction" taking the value 0 if a benign tumor was predicted and the value 1 otherwise.
Next we used XGobi first to color-code the observations according to the variable
"prediction" and second to look at the color-coded dotplots for each of the explanatory
variables to see which variables were crucial in constructing the Neural Networks
prediction rule for Histologie. We observed that the most important variables in building
the prediction rule were Forme, Contour, Nombre2 and Polymorphisme. Then we used
Neural Networks to build a new prediction rule using only these important explanatory
variables in order to see if these four variables alone do nearly as well as the full set of
explanatory variables.
The best misclassification rate we reached was 0.25, which is close enough to 0.21 to say
that these four variables are the most important in predicting the status of Histologie.
3.10.3. Linear Discriminant Analysis
First we transformed the highly right-skewed variable Taille by using a logtransformation. As PROC DISCRIM in SAS only allows to use quantitative variables as
explanatory variables we had to create dummy variables for the categorical variables.
PROC DISCRIM offers an option to use priors for the values of Histologie. These priors
indicate the "cost" of misclassification. We think it is more "costly" to misclassify a
malignant tumor as being benign. These priors have to be real values bigger than 0 and
less than 1 and they have to sum to 1. We tried different sets of priors to find a
classification rule with a low misclassification rate for malignant tumors and a tolerable
misclassification rate for benign tumors.
Our first choice was to use equal priors:
Histologie
0
1
LDA Classification
0
601
(73.65%)
192
(33.10%)
1
215
(26.35%)
388
(66.9%)
Total
816
1
0.3310
0.5
Total
0.2973
580
Error Count Estimates:
Histologie
0
0.2635
0.5
Rate
Priors
The total misclassification rate of 29.7% is very close to the one obtained using CART.
The misclassification rates for malignant tumors are a little bigger than it is for benign
tumors.
Note: For the total misclassification rate the misclassification rates for malignant and
benign tumors are weighted by the priors, not by the proportions of malignant and benign
tumors.
Our second choice was to use priors of 0.6 for malignant tumors and
0.4 for benign ones:
Histologie
0
1
LDA Classification
0
503
(61.64%)
127
(21.90%)
Error Count Estimates:
1
313
(38.36%)
453
(78.10%)
Total
816
580
Histologie
0
0.3836
0.4
Rate
Priors
1
0.2190
0.6
Total
.2848
The misclassification rate for malignant tumors is still 21.9% and the one for benign
tumors is even 38.4%.
Our third choice was to use priors of 0.8 for malignant tumors and
0.2 for benign ones:
Histologie
0
1
LDA Classification
0
201
(24.63%)
32
(5.52%)
1
816
(75.37%)
548
(94.48%)
Total
816
1
0.0552
0.8
Total
0.1949
580
Error Count Estimates:
Histologie
0
0.7537
0.2
Rate
Priors
Now the misclassification rate for malignant tumors is only 5.52%. However this is
achieved by classifying almost all observations as malignant tumors, the misclassification
rate for benign tumors is 75.4%.
Next we wanted to see if it is easier to find a good classification rule for benign tumors.
Our fourth choice was to use priors of 0.4 for malignant tumors and 0.6 for benign
ones. These priors are close to the proportions of benign and malignant tumors :
Histologie
0
1
LDA Classification
0
687
(84.19%)
267
(46.03%)
Error Count Estimates:
1
129
(15.81%)
313
(53.97%)
Total
816
580
Rate
Priors
Histologie
0
0.1581
0.6
1
0.4603
0.4
Total
0.2790
We observe a low misclassification rate for benign tumors (15.8%), but an intolerable
misclassification rate for malignant tumors (46.0%).
Since none of the classification rules seemed to be useful to us, we did not include the
formulas for them.
The Total Canonical Structure is:
Age
Log(Taille)
Nombre2=1
Nombre2=2
Nombre2=4
Typsein
Cote
Foyer
Forme=2
Forme=3
Forme=4
Forme=5
Polymorphisme
Contour=1
Contour=3
Retro
Profondeur
CAN1
-0.326050
-0.273302
0.270798
0.157217
-0.337550
-0.162532
-0.032695
-0.247997
0.511615
0.313531
-0.308365
-0.705787
-0.526934
0.330434
0.212886
-0.071779
-0.159566
This shows that the variable Forme is very important for the LDA classification rule, for
levels four and five of Forme, one of Polymorphisme, four of Nombre2 and level two of
Contour and high values of Age and Log(Taille) the probability of classifying the
observation as a malignant tumor is highest. For levels two and three of Forme, level one
of Nombre2 and level three of Contour the probability of classifying the observation as a
benign tumor is highest.
3.11. Logistic Regression
As in the LDA we had to use dummy variables for the categorical variables. We used the
Forward Selection Procedure in PROC LOGISTIC in SAS to identify the most important
variables in predicting malignancy. The significance level for entering the model we
chose was 0.1.
The most important continuous variables and levels of categorical variables are:
Age, level four of Nombre2, Foyer, levels one, three, four and five of Forme and levels
one and two of Contour.
The table below gives parameter estimates, Wald Chi-Square TestStatistic, p-value and
odds ratio for these variables mentioned above.
Variable
Intercept
Age
Nombre2=4
Foyer=2
Forme=1
Forme=3
Forme=4
Forme=5
Contour=1
Contour=2
Parameter
Estimate
-3.2997
0.0310
0.5258
0.4685
-1.8482
0.4044
1.3796
2.7184
-0.3374
0.3439
Wald ChiSquare
68.8147
20.6061
15.6533
11.4887
3.2159
4.5200
64.5870
109.7614
4.0209
5.1701
Pr.>ChiSquare
0.0001
0.0001
0.0001
0.0007
0.0729
0.0335
0.0001
0.0001
0.0449
0.0230
OddsRatio
1.031
1.692
1.598
0.158
1.498
3.973
15.156
0.714
1.410
Interpretation of the odds ratios:
For the variable Age an increase of one year corresponds to an increase in the risk of a
malignant tumor by 3%. The magnitude of this increase is not very large but you have to
keep in mind that the variable age has a range of about 60 years. For level four of
variable Nombre2 the risk of a malignant tumor is 69% higher than for the other levels of
this variable. For two suspicious clusters (Foyer=2) the risk of a malignant tumor is 60%
higher than for one suspicious cluster. For the level one of Forme the risk of a malignant
tumor is only 16% of the risk of a malignant tumor for the other four levels of Forme.
For the level five of Forme the risk of a malignant tumor is 1515.6% higher than it is for
the other four levels of Forme. For an angular cluster shape compared to all other shapes
the risk of a malignant tumor is 41% higher.
Model evaluation:
Taking into account that the data is messy the concordant value of 77.3% and the
Gamma value of 0.55 suggest that the model we used is appropriate.
4. Summary
 We think that a Classification rule can only be used in practice if the misclassification
rates are “low”, maybe 10%. Misclassifying a malignant tumor as being benign would
result in the wrong treatment and cause severe consequences for the patient.
 Because the data set is so messy, the Touring Plots in XGOBI did not give us any
separation between malignant and benign tumors.
 We were not able to find a “good” classification rule using the relatively complicated
Neural Networks and LDA procedures, the misclassification rates for Neural Networks
and LDA were 21% and 29.7%.
 The Classification Tree, which is easier to interpret and to use, gave a misclassification
rate of 27.7%, which is still too high. We only want to use to see which variables are
most important in predicting malignancy:
If the level of Forme is less than four then the observation is classified as a benign
tumor in the first step. If the level of Forme is more than three the next splits are based
on Forme, Nombre2 and Contour. For more details see plot 3.10.1. .
 The variables that are most important in predicting malignancy are:
- Forme: The rate of malignant tumors increases with the level of Forme;
For levels one and two of Forme combined the rate of benign tumors is 81.8%,
For level five of Forme the rate of malignant tumors is 84.8%.
- Nombre2: For level one Nombre2 the rate of benign tumors is 74.7%.
- Forme and Contour combined: For level five of Forme and level two of Contour the
ratio between benign and malignant tumors is 13 to 129 which is close to a ratio of 1
to 10.
- Forme and Nombre2: For level five of Forme and level four of Nombre2 the ratio
between benign and malignant tumors is 9 to 108 which is a ratio of 1 to 12
 Age is thought of one of the most important factors in predicting malignancy. In our
analysis it was only significant for LDA and Logistic Regression.
The youngest woman with a diagnosed breast cancer in this study is 22, the oldest
woman is 86. The average age of the remaining 1396 ( after deleting observations with
missing values ) women in the study is 51.2 .
We split the data set into two subgroups. One for women younger than 50, one for
women older than 49.
Performing LDA (with equal priors) for the first group we got a misclassification rate
of 26.23% which is better than the one for the complete data set.
Unfortunately we ran out of time and did not have the opportunity to further explore
those two groups.
5. References
- Homepage of the Encyclopaedia Britannica
- Homepage of the National Cancer Institute
- Statistics 557 Course Notes, Kenneth Koehler ISU
- Statistics 501 Course Notes, Kenneth Koehler ISU
- B.D. Ripley, Pattern Recognition and Neural Networks, 1996.
Download