Statistics 503X Project Breast Cancer Data Younghun Han Norbert Karp Sven Stanzel 1. Description For women breast cancer is a fairly common and severe disease. Therefore it would be really helpful to see if the kind of the tumor can be predicted. There are two kinds of tumors, benign and malignant ones. Any tumor, either benign or malignant in type, may produce death by local effects if it is appropriately situated. The common and more specific definition of malignancy implies an inherent tendency of the tumor's cells to metastasize (invade the body widely and become disseminated by subtle means ) and eventually to kill the patient unless all the malignant cells can be eradicated. Having diagnosed a tumor it is impossible to say what status the tumor has without surgery. Current cancer treatment depends on drugs and hormones (chemotherapy), surgery, radiation therapy, or a combination of these. The earlier cancer is diagnosed and the sooner treatment can be implemented, the greater the chances of a successful cure. According to the National Cancer Institute's " What You need to know about breast cancer ", the following are known risk factors for breast cancer: age, family history, personal history (women who already have had breast cancer face an increased risk of getting breast cancer again ),menstruating at an early age (before 12) and having the first child after the age of 30. The data was provided by Richard D. De Veaux. He studied 1622 women who went into surgery after a tumor had been detected in their breasts. The only risk factor mentioned above that is included in the data is age. Additionally there are physical measurements for each patient. The data available are: Histologie: A surgery had been performed on the patients and the tumor analyzed is benign (0) or malignant (1) Age: Age of the patient Typsein: type of tissue: 0 = light, 1 = dense Cote: location of the tumor 0 = left, 1 = right Taille: size of the suspicious cluster in mm Nombre: Number of microcalcifications in the cluster 1 = <10, 2 = 10-30, 3 = 30+, 4 = 10-20, 5 = 20-30 Foyer: Number of suspicious clusters, 1 or 2 Forme: Shape of the microcalcifications: 1-5: the order corresponds to the degree of malignancy from least to most. Polymorphisme: Are there many types of microcalcifications in one cluster? 0 = no, 1 = yes Contour: shape of the cluster 1 = circular, 2 = angular, 3 = other Retro: Is the cluster under the nipple? 0 = no, 1 = yes Profondeur: Are the microcalcifications deep under the skin? 0 = no, 1 = yes We want to use the data to see which variables are related to the state of histologie and if we can predict the kind of tumor. 2. Suggested Approaches Data Restructuring Reason: recode the variable Nombre by deleting observations with the value 2 for Nombre2 because levels 4 (10-20) and 5 (20-30) are subgroups of level 2 (10-30) in the data set and we do not have any information to which of those subgroups an observation with level 2 belongs The new groups are now: 1: <10 2: 10-20 3: 20-30 4: 30+ We deleted 226 of the 1622 original observations that had missing values after checking that the distribution for the other variables for the deleted observations was the same as for the remaining observations. Create dummy variables for all categorical variables to aid in LDA and Logistic Regression For each of the categorical variables k-1 dummy variables must be created, where k is the number of levels of that categorical variable Each dummy variable has two possible values: 1, if the variable is at the level for which you create the dummy variable 0, otherwise ( variable takes on any other possible level of that variable ) Summary Statistics for the continuous variables Reason: extract location/scale information Type of questions addressed: "What is the average age?", "What is the average cluster size?" Histograms for the variables Age and Taille Reason: explore univariate distributions Type of questions addressed: "Are there unusual patterns in the distributions of Age and Taille?" (Outliers, distribution shape, quirky structures) One-way tables for the categorical variables Reason: explore univariate distributions Type of questions addressed: "What do the distributions of the categorical variables look like?" Two-way tables for Histologie versus the categorical variables Reason: Explore bivariate distribution and dependencies between Histologie and the categorical variables Type of questions addressed: "Which categorical variables are related to Histologie?" Three-way tables for Histologie versus combinations of the categorical variables Reason: Explore multivariate distribution and dependencies between Histologie and the categorical variables Type of questions addressed: "Which categorical variables are related to Histologie?" Dotplots for Age and Taille color-coded with respect to Histologie Reason: Examine dependencies between Histologie and Age and Taille Type of questions addressed: "Which continuous variables are related to Histologie?" Pairwise Scatterplot for Age and Taille color-coded with respect to Histologie Reason: Explore bivariate distribution and dependencies between Histologie and the continuous variables Type of questions addressed: "Is Histologie related with the combination of Age and Taille?" Mosaic Plots Reason: Explore multivariate distribution and dependencies between Histologie and the categorical variables Type of questions addressed: "Which categorical variables are related to Histologie?" Cluster Analysis Reason: Find clusters of "similar" patients Type of questions addressed: "Are there clusters of similar patients?, "And if so, what is the relation to Histologie?” Principle Components Analysis for the explanatory variables Reason: Reduce the dimensionality of the data Type of questions addressed: "Can the data be described by few principal components without loss of information?" , "And if so what is the relationship between these principal components and Histologie?" Numerical Analysis Methods: CART Neural Networks Linear Discriminant Analysis Type of questions addressed: "Can we find a classification rule for Histologie based on the other measurements?" Logistic Regression Reason: Determining the most important factors to Histologie Type of questions addressed: "Which factors are useful in predicting Histologie?" 3. Actual Approaches 3.1 Summary statistics for the continuous variables Total number of observations = 1396 Number of variables = 13 variable Age ( years ) Size of the cluster (mm ) average 51.2 15.3 standard deviation 9.27 12.88 min 22 2 max 86 100 3.2 Histograms of variables Age and Taille The distribution of variable Age (Plot 3.2.a.) is roughly symmetric. It is unimodal with a peak around 50. For us it is quite surprising that there are some patients with age below 40. The distribution of Taille (Plot 3.2.b.) is heavily skewed to the right. It is unimodal with a peak at 0-10. Most of the tumors have a cluster size of below 20 mm; cluster sizes above 50 mm are pretty rare. 3.3 One-way tables for the categorical variables Histologie 0 1 Count 816 580 58.4% of the tumors are benign. Typsein 0 1 Count 703 693 The percentage of light and dense tissue is almost the same. Cote 0 1 Count 675 721 The distributions of tumors on the left and right side are approximately equal. Foyer 1 2 Count 1004 392 Most of the patients (71.9%) have only one suspicious cluster. There are no more than 2 suspicious clusters in each patient. Forme 1 2 3 4 5 Count 29 328 355 500 184 Most of the observations have levels two, three or four ( 84.7% ). Polymorphisme 0 1 Count 561 835 In 59.8% of the cases there are many types of microcalcifications in one cluster. Contour 1 2 3 Count 383 598 415 Angular cluster shape (42.8%) is more common than circular cluster shape (27.4%). For almost 30% of the observed women the shape of the suspicious cluster(s) is neither circular nor angular. Retro 0 1 Count 1167 229 In 83.6% of the cases the cluster was not under the nipple. Profondeur 0 1 Count 583 813 Microcalfications are deep under the skin in 58.2% of the observations. Nombre2 1 2 3 4 Count 190 371 293 542 In 38.8% of the cases there are more than 30 microcalfications per cluster. Note: Only the numbers of the levels 1 and 4 are not effected by deleting observations that belong to the original level 2. 3.4 Two-way tables for the categorical variables and Histologie Typsein 0 1 Histologie 0 438 378 1 265 315 If tissue is light then more tumors are benign, while for dense tissues the ratio of benign and malignant tumors is close to 1. Cote 0 1 Histologie 0 400 416 1 275 305 The ratios of benign and malignant tumors are roughly the same for both levels of Cote. Foyer 1 2 Histologie 0 624 192 1 380 200 When the number of suspicious clusters is 1, 62.1% of the tumors are benign whereas when there are 2 suspicious clusters, the ratio of benign and malignant tumors is almost 1. Forme 1 2 3 4 5 Histologie 0 28 264 253 243 28 1 1 64 102 257 156 If level of Forme is between 1 and 3, the proportion of benign tumors is 76.5%, while for levels 4 and 5 the rate of malignant tumors is higher. For level 5, 84.8% of the tumors are malignant. Polymorphisme 0 1 Histologie 0 414 402 1 147 433 If there are not many types of microcalcifications,73.8% of the tumors are benign. When there are many types of microcalcifications, the ratio of benign and malignant tumors is close to 1. Contour 1 2 3 Histologie 0 273 268 275 1 110 330 140 When the tumor is circular, the proportion of benign tumors is 71.3%, when it is angular, the proportion of benign tumors is 44.8% . When the tumor is neither angular nor circular, the proportion of benign tumors is 66.3%. Retro 0 1 Histologie 0 691 125 1 476 104 The distributions of malignant and benign tumors are similar, whether the cluster is under the nipple or not. Profondeur 0 1 Histologie 0 367 449 1 216 364 If the microcalfications are not deep under the skin, the proportion of benign tumors is 62.3% , but if they are deep under the skin, the proportion of benign tumors is Nombre2 1 2 3 4 Histologie 0 142 240 172 262 1 48 131 121 180 The proportion of benign tumors for levels 1 and 2 combined is 68.1% . The less microcalfications in a cluster the higher the rate of benign tumors. 3.5 Three way-tables for the categorical variables and Histologie Variables Typsein and Foyer HISTOLOGIE=0 Typsein 0 1 Foyer 1 326 298 2 112 80 HISTOLOGIE=1 Typsein 0 1 Foyer 1 160 220 2 105 95 If the type of tissue is light and there is only one suspicious cluster the ratio of benign and malignant tumors is approximately two to one whereas for all other combinations it is almost one to one. Variables Foyer and Contour HISTOLOGIE=0 Foyer 1 2 Contour 1 214 59 2 208 60 3 202 73 HISTOLOGIE=1 Foyer 1 2 Contour 1 75 35 2 212 118 3 93 47 If the shape is angular then there are more malignant tumors when there are two suspicious clusters. When the shape is not angular, there are less malignant tumors. Variables Forme and Contour HISTOLOGIE=0 Forme 1 2 3 4 5 Contour 1 7 80 95 82 9 2 10 60 83 102 13 3 11 124 75 59 6 HISTOLOGIE=1 Forme 1 2 3 4 5 Contour 1 0 16 35 49 10 2 0 27 34 140 129 3 1 21 33 68 17 For angular-shaped tumors and level 5 of Forme the ratio between benign and malignant tumors is 1 to 10. Variables Forme and Nombre2 HISTOLOGIE=0 Forme 1 2 3 4 5 Nombre2 1 9 78 17 33 5 2 6 70 63 97 4 3 8 46 53 55 10 4 5 70 120 58 9 2 0 20 15 78 18 3 0 11 23 61 26 4 0 16 59 97 108 HISTOLOGIE=1 Forme 1 2 3 4 5 Nombre2 1 1 17 5 21 4 For the combination of 20-30 microcalcifications per cluster and level 5 of Forme the ratio of benign and malignant tumors is 1 to 12. 3.6.1. Pairwise Scatterplot for Age and Taille color- and glyph-coded with respect to Histologie ( Plot 3.6.1. ) open circle = malignant plus = benign The plot does not help at all to separate between benign and malignant tumors. 3.6.2. Tables of Age and Taille Categories versus Histologie Age Age Benign Malignant Total <35 27 (75%) 9 (25%) 36 35-39 40-44 56 155 (76.7%) (65.7%) 17 81 (23.3%) (34.3%) 73 236 45-49 177 (61%) 113 (39%) 290 50-59 60-69 >70 285 92 56 (55.7%) (45.5%) (51.4%) 227 110 53 (44.3%) (54.5%) (48.6%) 512 202 109 For women below the age of 45 the risk of a malignant tumor is lower than the average risk of about 40%. For women above the age of 50 the risk of a malignant tumor is higher than the average risk. Taille Taille Benign Malignant Total <5 >40 96 21 (72.7%) (38.9%) 36 33 (27.3%) (61.1%) 132 54 For values of Taille below 5 the risk of a malignant tumor is lower than the average risk. For values of Taille above 40 the risk of a malignant tumor is higher than the average risk. For values of Taille between 5 and 40 we did not find any interesting patterns. 3.7. Mosaic Plots 3.7.1. Variables Histologie and Forme ( Plot 3.7.1. ) The mosaic plot shows that the proportion of malignant tumors increases with the level of variable Forme. For level one of Forme there are very few malignant tumors, for level five of Forme almost all tumors are malignant. 3.7.2. Variables Histologie, Nombre2 and Forme ( Plot 3.7.2. ) The proportion of malignant tumors increases with the level of variable Nombre2. Given a level of Nombre2 the distribution of Histologie depends on Forme; the lower the level of Forme, the higher the proportion of benign tumors. Hence the distribution of Histologie depends on both variables. 3.7.3. Variables Histologie, Nombre2 and Foyer ( Plot 3.7.3. ) As mentioned before the proportion of malignant tumors increases with the level of variable Nombre2. Only given level 4 of Nombre2 the distribution of Histologie depends on variable Foyer. In that case, for malignant tumors one and two suspicious clusters are nearly equally likely, whereas benign tumors mostly have only one suspicious cluster 3.7.4. Variables Histologie, Polyphormisme and Foyer ( Plot 3.7.4. ) For level 5 of Forme and level 1 of Polyphormisme the proportion of malignant tumors is much larger than for all the other combinations of these variables. The distribution of Histologie depends on both variables. 3.8. Cluster Analysis We used cluster analysis only for the explanatory variables. Even after using the distance for the standardized data the grouping done by cluster analysis was always dominated by the continuous variables Age and Taille. The groups the procedure gave us did not show any interesting features related to Histologie. Additionally these two variables did not show any relevance in predicting malignancy in the previous analysis. So we decided to exclude the continuous variables from the cluster analysis. For the categorical variables the results using the single, complete and average linkage methods with the distances for raw, standardized and sphered data were very similar. The k-means method choosing four clusters using the average linkage method for the distances of the standardized data provided the most interesting groups with respect to Histologie. Based on this method the 1396 observations are grouped as follows: Color Group Histologie 0 1 blue 1 yellow 2 green 3 red 4 301 109 132 292 225 55 158 124 In group 1 the proportion of benign tumors is 73.4%, in group 3 it is 80.4%. In group 2 the proportion of malignant tumors is 68.9%. Interestingly this clustering is mainly based on the bivariate plot of Forme vs. Nombre2. If Forme is at levels 4 or 5 and Nombre2 is at levels 1 or 2 then the observation is clearly grouped into cluster 4. For Nombre2 at levels 3 or 4 and the same levels for Forme as before they are grouped into cluster 2. For the combinations of levels 3 or 4 for Nombre2 and levels 1, 2 or 3 of Forme all observations are grouped into cluster 1, while most of the observations with level 1 or 2 for Nombre2 and levels 1, 2 or 3 for Forme are grouped into cluster 3. There are a few observations with the latter level combinations for Forme and Nombre2 that are grouped into cluster 4 instead of cluster 3. For further details see plot 3.8.. 3.9 Principal Components Analysis For Principal Components we encountered a similar problem as we did with Cluster Analysis. Even for the standardized data the continuous variables dominate the Principal Components. Therefore we excluded the two continuous variables from the Principal Components Analysis. For this data set the first few principal components do not account for most of the variation in the data. Hence, it is not possible to describe the data by a few principal components without loss of information. Proportion of the variation in the data explained by the first eight Principal Components Principal Component Principal Component 1 Principal Component 2 Principal Component 3 Principal Component 4 Principal Component 5 Principal Component 6 Principal Component 7 Principal Component 8 Proportion 0.1925 0.1376 0.1348 0.1154 0.1105 0.0974 0.0932 0.0745 Cumulative 0.1925 0.3301 0.4649 0.5803 0.6985 0.7883 0.8815 0.9560 Six Principal Components are needed to explain about 80% of the variation in the data. Since the total number of categorical variables is only 9, six Principal Components cannot be considered a "few". Hence we decided that Principal Component Analysis is not very informative for this data set. 3.10. Numerical Analysis 3.10.1. CART ( Plot 3.10.1. ) The most important variables according to the Classification Tree are Forme, Contour and Nombre2. If the level of Forme is 3 or less the observation is classified as benign in the very first step without any further splits, which misclassifies 167 malignant tumors as benign ones. We assume that misclassifying a malignant tumor as a benign one is the more severe error in this case. Because of this and the misclassification error rate of 0.2772 we think that CART does not do a good job. 3.10.2. Neural Networks The best result using Neural Networks was Histologie 0 1 NN prediction 0 708 185 1 108 395 The total misclassification rate was 0.21. To find the most important variables for predicting Histologie we used the predictions for status of Histologie given by the Neural Networks procedure. We created a new variable "prediction" taking the value 0 if a benign tumor was predicted and the value 1 otherwise. Next we used XGobi first to color-code the observations according to the variable "prediction" and second to look at the color-coded dotplots for each of the explanatory variables to see which variables were crucial in constructing the Neural Networks prediction rule for Histologie. We observed that the most important variables in building the prediction rule were Forme, Contour, Nombre2 and Polymorphisme. Then we used Neural Networks to build a new prediction rule using only these important explanatory variables in order to see if these four variables alone do nearly as well as the full set of explanatory variables. The best misclassification rate we reached was 0.25, which is close enough to 0.21 to say that these four variables are the most important in predicting the status of Histologie. 3.10.3. Linear Discriminant Analysis First we transformed the highly right-skewed variable Taille by using a logtransformation. As PROC DISCRIM in SAS only allows to use quantitative variables as explanatory variables we had to create dummy variables for the categorical variables. PROC DISCRIM offers an option to use priors for the values of Histologie. These priors indicate the "cost" of misclassification. We think it is more "costly" to misclassify a malignant tumor as being benign. These priors have to be real values bigger than 0 and less than 1 and they have to sum to 1. We tried different sets of priors to find a classification rule with a low misclassification rate for malignant tumors and a tolerable misclassification rate for benign tumors. Our first choice was to use equal priors: Histologie 0 1 LDA Classification 0 601 (73.65%) 192 (33.10%) 1 215 (26.35%) 388 (66.9%) Total 816 1 0.3310 0.5 Total 0.2973 580 Error Count Estimates: Histologie 0 0.2635 0.5 Rate Priors The total misclassification rate of 29.7% is very close to the one obtained using CART. The misclassification rates for malignant tumors are a little bigger than it is for benign tumors. Note: For the total misclassification rate the misclassification rates for malignant and benign tumors are weighted by the priors, not by the proportions of malignant and benign tumors. Our second choice was to use priors of 0.6 for malignant tumors and 0.4 for benign ones: Histologie 0 1 LDA Classification 0 503 (61.64%) 127 (21.90%) Error Count Estimates: 1 313 (38.36%) 453 (78.10%) Total 816 580 Histologie 0 0.3836 0.4 Rate Priors 1 0.2190 0.6 Total .2848 The misclassification rate for malignant tumors is still 21.9% and the one for benign tumors is even 38.4%. Our third choice was to use priors of 0.8 for malignant tumors and 0.2 for benign ones: Histologie 0 1 LDA Classification 0 201 (24.63%) 32 (5.52%) 1 816 (75.37%) 548 (94.48%) Total 816 1 0.0552 0.8 Total 0.1949 580 Error Count Estimates: Histologie 0 0.7537 0.2 Rate Priors Now the misclassification rate for malignant tumors is only 5.52%. However this is achieved by classifying almost all observations as malignant tumors, the misclassification rate for benign tumors is 75.4%. Next we wanted to see if it is easier to find a good classification rule for benign tumors. Our fourth choice was to use priors of 0.4 for malignant tumors and 0.6 for benign ones. These priors are close to the proportions of benign and malignant tumors : Histologie 0 1 LDA Classification 0 687 (84.19%) 267 (46.03%) Error Count Estimates: 1 129 (15.81%) 313 (53.97%) Total 816 580 Rate Priors Histologie 0 0.1581 0.6 1 0.4603 0.4 Total 0.2790 We observe a low misclassification rate for benign tumors (15.8%), but an intolerable misclassification rate for malignant tumors (46.0%). Since none of the classification rules seemed to be useful to us, we did not include the formulas for them. The Total Canonical Structure is: Age Log(Taille) Nombre2=1 Nombre2=2 Nombre2=4 Typsein Cote Foyer Forme=2 Forme=3 Forme=4 Forme=5 Polymorphisme Contour=1 Contour=3 Retro Profondeur CAN1 -0.326050 -0.273302 0.270798 0.157217 -0.337550 -0.162532 -0.032695 -0.247997 0.511615 0.313531 -0.308365 -0.705787 -0.526934 0.330434 0.212886 -0.071779 -0.159566 This shows that the variable Forme is very important for the LDA classification rule, for levels four and five of Forme, one of Polymorphisme, four of Nombre2 and level two of Contour and high values of Age and Log(Taille) the probability of classifying the observation as a malignant tumor is highest. For levels two and three of Forme, level one of Nombre2 and level three of Contour the probability of classifying the observation as a benign tumor is highest. 3.11. Logistic Regression As in the LDA we had to use dummy variables for the categorical variables. We used the Forward Selection Procedure in PROC LOGISTIC in SAS to identify the most important variables in predicting malignancy. The significance level for entering the model we chose was 0.1. The most important continuous variables and levels of categorical variables are: Age, level four of Nombre2, Foyer, levels one, three, four and five of Forme and levels one and two of Contour. The table below gives parameter estimates, Wald Chi-Square TestStatistic, p-value and odds ratio for these variables mentioned above. Variable Intercept Age Nombre2=4 Foyer=2 Forme=1 Forme=3 Forme=4 Forme=5 Contour=1 Contour=2 Parameter Estimate -3.2997 0.0310 0.5258 0.4685 -1.8482 0.4044 1.3796 2.7184 -0.3374 0.3439 Wald ChiSquare 68.8147 20.6061 15.6533 11.4887 3.2159 4.5200 64.5870 109.7614 4.0209 5.1701 Pr.>ChiSquare 0.0001 0.0001 0.0001 0.0007 0.0729 0.0335 0.0001 0.0001 0.0449 0.0230 OddsRatio 1.031 1.692 1.598 0.158 1.498 3.973 15.156 0.714 1.410 Interpretation of the odds ratios: For the variable Age an increase of one year corresponds to an increase in the risk of a malignant tumor by 3%. The magnitude of this increase is not very large but you have to keep in mind that the variable age has a range of about 60 years. For level four of variable Nombre2 the risk of a malignant tumor is 69% higher than for the other levels of this variable. For two suspicious clusters (Foyer=2) the risk of a malignant tumor is 60% higher than for one suspicious cluster. For the level one of Forme the risk of a malignant tumor is only 16% of the risk of a malignant tumor for the other four levels of Forme. For the level five of Forme the risk of a malignant tumor is 1515.6% higher than it is for the other four levels of Forme. For an angular cluster shape compared to all other shapes the risk of a malignant tumor is 41% higher. Model evaluation: Taking into account that the data is messy the concordant value of 77.3% and the Gamma value of 0.55 suggest that the model we used is appropriate. 4. Summary We think that a Classification rule can only be used in practice if the misclassification rates are “low”, maybe 10%. Misclassifying a malignant tumor as being benign would result in the wrong treatment and cause severe consequences for the patient. Because the data set is so messy, the Touring Plots in XGOBI did not give us any separation between malignant and benign tumors. We were not able to find a “good” classification rule using the relatively complicated Neural Networks and LDA procedures, the misclassification rates for Neural Networks and LDA were 21% and 29.7%. The Classification Tree, which is easier to interpret and to use, gave a misclassification rate of 27.7%, which is still too high. We only want to use to see which variables are most important in predicting malignancy: If the level of Forme is less than four then the observation is classified as a benign tumor in the first step. If the level of Forme is more than three the next splits are based on Forme, Nombre2 and Contour. For more details see plot 3.10.1. . The variables that are most important in predicting malignancy are: - Forme: The rate of malignant tumors increases with the level of Forme; For levels one and two of Forme combined the rate of benign tumors is 81.8%, For level five of Forme the rate of malignant tumors is 84.8%. - Nombre2: For level one Nombre2 the rate of benign tumors is 74.7%. - Forme and Contour combined: For level five of Forme and level two of Contour the ratio between benign and malignant tumors is 13 to 129 which is close to a ratio of 1 to 10. - Forme and Nombre2: For level five of Forme and level four of Nombre2 the ratio between benign and malignant tumors is 9 to 108 which is a ratio of 1 to 12 Age is thought of one of the most important factors in predicting malignancy. In our analysis it was only significant for LDA and Logistic Regression. The youngest woman with a diagnosed breast cancer in this study is 22, the oldest woman is 86. The average age of the remaining 1396 ( after deleting observations with missing values ) women in the study is 51.2 . We split the data set into two subgroups. One for women younger than 50, one for women older than 49. Performing LDA (with equal priors) for the first group we got a misclassification rate of 26.23% which is better than the one for the complete data set. Unfortunately we ran out of time and did not have the opportunity to further explore those two groups. 5. References - Homepage of the Encyclopaedia Britannica - Homepage of the National Cancer Institute - Statistics 557 Course Notes, Kenneth Koehler ISU - Statistics 501 Course Notes, Kenneth Koehler ISU - B.D. Ripley, Pattern Recognition and Neural Networks, 1996.