An Exploratory Analysis of a Large Health Cohort... Bayesian Networks By Delin Shen

An Exploratory Analysis of a Large Health Cohort Study Using
Bayesian Networks
By Delin Shen
S.B. Biomedical Engineering, S.B. Mechanical Engineering
Tsinghua University, 1994
S.M. Biomedical Engineering
Tsinghua University, 1997
SUBMITTED TO THE HARVARD-MIT DIVISION OF HEALTH SCIENCES AND
TECHNOLOGY IN PARTIAL FULFILLMENT OF THE REQURIEMENTS FOR THE
DEGREE OF
DOCTOR OF PHILOSOPHY IN MEDICAL ENGINEERING
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2006
@ 2006 Massachusetts Institute of Technology
All rights reserved
:'Ji.
A
-N
Signature of Author ...................................
..................................................
Harvard-MIT Division of Health Sciences and Technology
January 26, 2006
Certified by
......
...........................
62'
Peter Szolovits,Ph.D.
Professor of Computer Science and Engineering
Professor of Health Sciences and Technology
Thesis Supervisor
Acceptedby........................................ -
.
............................
--Martha L. Gray, Ph.D.
Edward Hood Taplin Professor of Medical ar Electrical Engineering
Co-Director, Harvard-MIT Division of Health Sziences and Technology
An Exploratory Analysis of a Large Health Cohort Study Using Bayesian Networks
By Delin Shen
Submitted to the Harvard-MIT Division of Health Sciences and Technology on January 26, 2006
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Health Sciences and Technology
ABSTRACT
Large health cohort studies are among the most effective ways in studying the causes, treatments
and outcomes of diseases by systematically collecting a wide range of data over long periods. The
wealth of data in such studies may yield important results in addition to the already numerous
findings, especially when subjected to newer analytical methods.
Bayesian Networks (BN) provide a relatively new method of representing uncertain relationships
among variables, using the tools of probability and graph theory, and have been widely used in
analyzing dependencies and the interplay between variables. We used BN to perform an
exploratory analysis on a rich collection of data from one large health cohort study, the Nurses'
Health Study (NHS), with the focus on breast cancer.
We explored the data from the NHS using BN to look for breast cancer risk factors, including a
group of Single Nucleotide Polymorphisms (SNP). We found no association between the SNPs
and breast cancer, but found a dependency between clomid and breast cancer. We evaluated clomid
as a potential riskfactor after matching on age and number of children. Our results showed for
clomid an increased risk of estrogen receptor positive breast cancer (odds ratio 1.52, 95% CI
1.11-2.09) and a decreased risk of estrogen receptor negative breast cancer (odds ratio 0.46, 95%
CI 0.22-0.97).
We developed breast cancer risk models using BN. We trained models on 75% of the data, and
evaluated them on the remaining. Because of the clinical importance of predicting risks for
Estrogen Receptor positive and Progesterone Receptor positive breast cancer, we focused on this
specific type of breast cancer to predict two-year, four-year, and six-year risks. The concordance
statistics of the prediction results on test sets are 0.70 (95% CI: 0.67-0.74), 0.68 (95% CI:
0.64-0.72), and 0.66 (95% CI: 0.62-0.69) for two, four, and six year models, respectively.
We also evaluated the calibration performance of the models, and applied a filter to the output to
improve the linear relationship between predicted and observed risks using Agglomerative
Information Bottleneck clustering without sacrificing much discrimination performance.
Thesis Supervisor: Peter Szolovits, Ph.D.
Title: Professor of Computer Science and Engineering
Professor of Health Sciences and Technology
-2-
To my grandmother,
my parents,
and my wife
-3-
Acknowledgments
There are many people who have helped me during my thesis research. First and foremost, I
would like to thank my thesis supervisor, Professor Peter Szolovits, of the MIT Computer Science
and Artificial Intelligence Lab. I am grateful for his continuous guidance in learning and doing
research, his careful reading of my thesis, and especially his moral support during my study at
MIT. From time to time, I had challenges in my research work, and talking with Peter always
helped me to overcome the difficulties, while his sense of humor made it easier for me to deal
with the pressure. I would also very much like to thank my mentor and friend, Professor Marco
Ramoni, of the Harvard Medical School and Children's Hospital, especially for his tremendous
help with my research and the encouragement I received from his own experiences. Marco spent
a large amount of time discussing with me the research, going through with me the presentations,
without which I could never have finished the thesis work. I am also grateful to Professor Graham
Colditz, of the Harvard School of Public Health, whose insights and expertise in breast cancer
and the Nurses' Health Study were instrumental in both my research and thesis defense. Finally, I
cannot continue without expressing my lasting gratitude to Professor Tommi Jaakkola, of the MIT
Computer Science and Artificial Intelligence Lab, and Professor Lucila Machado, of the Harvard
Medical School and Brigham Women's Hospital, who gave me invaluable feedbacks and
suggestions in how to apply machine learning techniques.
I am grateful to the members of the Channing Lab, who gave me so many help and support
during the research. Dr. Karen Corsano helped me to get familiar with the Nurses' Health Study,
and always answered my questions promptly. Lisa Li spent tremendous amount of time extracting
the data for our research, which is an extra to her already heavy loaded daily work. Dr. David
Hunter and Rong Chen provided and helped me understand the genomics data. Marion McPhee
helped me on the log-incidence model and provided related data. Their contributions are all
crucial to this thesis work. I would also like to thank all MEDG members, past and present, at the
MIT Computer Science and Artificial Intelligence Lab, who have each helped in their individual
ways. I would especially like to mention Fern DeOliveira, our awesome assistant, for her always
being there and prompt help.
I am indebted to my friends and family who have continued to believe in me, and who have kept
me sane, entertained, and happy. I would particularly like to thank Alfred Falco, Wei Guo, Jinlei
Liu, Jingsong Wu, Chao Wang, Yi Zhang, Hao Wang, Selina Lam, Jianyi Cui, Jing Wang, Minqu
Deng, Xingchen Wang, and Song Gao.
This thesis is based upon work supported in part by a HST-MEMP fellowship, the Defense
Advanced Research Projects Agency grant F30602-99-1-0509, and the National Institute of
Health grant U54LM008748-01.
-4-
Table of Content
Chapter 1 Introduction .............................................................................-
8 -
Chapter 2 Background........................................................................... 10
- 2.1 Nurses'Health
Study ...................................................................................................-
2.2 Breast Cancer Models ..................................................................................................2.3 Evaluation and Validation of Breast Cancer Models ...................................................2.4 Breast Cancer and Genomics .......................................................................................
2.5 Machine Learning Techniques .....................................................................................2.6 Bayesian Networks ......................................................................................................2.7 Bayesware Discoverer .................................................................................................-
10-
11 17 19
- 23 27 30 -
Chapter 3 Landscaping Clinical and Life-style and Genotypic Data- 32 3.1 Landscaping Clinical and Life-style Variables and SNPs............................................3.1.1 Data.......................................................................................................................3.1.2 Exploring the SNPs and Breast Cancer .................................................................3.1.3 Exploring Clinical and Life-style Factors and SNPs ............................................3.2 Estrogen Receptor Status and Clomid .........................................................................3.3 Summary......................................................................................................................-
32 33 36 38 41 47 -
Chapter 4 Breast Cancer Model............................................................- 49 4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
Log-]Incidence Model for Breast Cancer ......................................................................Evaluating Risk Score as an Index for Breast Cancer .................................................Learning a Classifier for Breast Cancer Incidence ......................................................Classifier for ER+/PR+ Breast Cancer ........................................................................Evaluation of Risk Prediction Model ...........................................................................
Filtering of Risk Prediction Probabilities Based on Clustering ...................................Classifiers Predicting Long Term Risks.......................................................................Summary......................................................................................................................-
Chapter 5 Discussion ..............................................................................
49 51 54 59 60
- 64 71 75 -
78
- -
5.1 On Exploratory Analysis ..............................................................................................- 78 5.1.1 Expert Knowledge .................................................................................................- 78 5.1.2 Exploration of the Dependencies in Learned Bayesian Networks ........................ - 78 5.1.3 Split of Training and Test Set................................................................................- 79 5.1.4 Discretization and Consolidation of Variable Values............................................- 80 5.2 On Learning Bayesian Networks from Data................................................................- 80 -5-
5.2.1 Reading a Bayesian Network Learned From Data ................................................- 81 5.2.2 Bayesian Network and Highly Interdependent Data.............................................- 82 5.3 On Evaluation of Risk Predicting Models ...................................................................- 83 5.3.1 Evaluation of Models ............................................................................................- 83 5.3.2 Comparison with the Model in Clinical Use .........................................................- 84 5.3.3 Clustering of Bayesian Network Prediction ..........................................................- 85 5.4 Summary of Contributions ...........................................................................................- 86 5.5 Future Work.................................................................................................................- 87 5.5.1 Data Pre-processing ..............................................................................................- 87 5.5.2 Evaluation of Risk Score As an Index for Breast Cancer ..................................... - 87 5.5.3 Dealing with Highly Interdependent Data ............................................................- 88 5.5.4 Prediction of 5-year Risks .....................................................................................- 88 5.5.5 Clomid and Breast Caner ......................................................................................- 88 -
Bibliography ............................................................................................-
-6-
91 -
List of Figures
Figure Bayesian Network structures of three variables ..........................................................- 28 Figure 2 Bayesian Network of 12 SNPs and breast cancer.......................................................- 37 Figure 3 Bayesian Network for interactions among clinical and life-style variables and SNPs- 40 Figure 4 Clomid use and breast cancer with ER status .............................................................- 42 Figure 5 Comparison of birth year between nurses who used and never used clomid..............- 43 Figure 6 Comparison of number of children on clomid use ......................................................- 46 Figure 7 Bayesian Network learned for evaluating risk score as an index for breast cancer ....- 53 Figure 8 Bayesian Network for predicting breast cancer ..........................................................- 58 Figure 9 Bayesian Network for predicting ER+/PR+ breast cancer ..........................................- 59 Figure 10 ROC curve of ER+/PR+ breast cancer prediction on test set....................................- 60 Figure 11 Calibration curve of ER+/PR+ breast cancer prediction on test set ..........................- 63 Figure 12 Clustering of prediction on training set.................................................................... - 68 Figure 13 ROC curve after filtering on test set .........................................................................- 70 Figure 14 Calibration curve after filtering on test set................................................................- 71 Figure 15 Bayesian Network for predicting 4-year risk of ER+/PR+ breast cancer ................. - 73 Figure 16 Bayesian Network for predicting 6-year risk of ER+/PR+ breast cancer ................. - 73 List of Tables
Table 1 Variable list for exploratory analysis ............................................................................- 34 Table 2 Variable orders for exploring clinical and life-style factors and SNPs .........................- 39 Table 3 Clomid and breast cancer with ER status, no matching ...............................................- 45 Table 4 Clomid and breast cancer with ER status, matched on age for clomid use .................. - 45 Table 5 Clomid and breast cancer with ER status, matched on age and parity for clomid use .- 47 Table 6 Variable counts in Markov Blankets of breast cancer in ten networks with permuted order
............................................................................................................................................
- 56 Table
Table
Table
Table
Table
7 Variable order................................................................................................................8 Prediction results of ER+/PR+ breast cancer ................................................................9 Prediction results on test set after filtering ...................................................................10 Prediction of 2-year, 4-year, and 6-year risk models for ER+/PR+ breast cancer ......11 Prediction after filtering of 2-year, 4-year, and 6-year ER+/PR+ risk models ...........-
-7-
57 62 69 74 74 -
Chapter 1 Introduction
Health care is one of the major concerns of modern society, and much effort has been invested in
studying the causes, treatments and outcomes of disease. Large health cohort studies are among
the most effective, because they follow a large and relatively stable population over a long period
of time and systematically collect comparable longitudinal data, sometimes over decades. There
are numerous large health cohort studies being carried on, such as the Framingham Heart Study
from 1948, The British Doctors Study from 1954, the Dunedin Multidisciplinary Health and
Development Study started in 1972, and the Nurses' Health Study since 1976. Such studies have
led to numerous important insights and many publications, and they often form the basis for
health care recommendations and policies. We suspect that the wealth of data in such studies may
yet yield many additional important results, especially when subjected to newer analytical
methods.
In order to work efficiently, we focused our analysis on breast cancer, one of the most frequently
diagnosed cancers in the US, and one of the motivating diseases for the Nurses' Health Study. We
explored the data from the Nurses' Health Study using Bayesian Networks to look for potential
risk factors for breast cancer, and also developed and evaluated risk predicting models of breast
cancer.
Bayesian Networks provide a relatively new method of representing uncertain relationships
among variables, using the tools of probability and graph theory. Such a network allows a concise
representation of probabilistic dependencies and is often used to model potential causal pathways.
We have used heuristic techniques that automatically induce a Bayesian Network that fits
observational data to suggest the likely dependencies in the data.
Bayesian Networks have been
widely used in analyzing such dependencies and the interplay between variables. In this work, we
-8-
have used Bayesian Networks to perform an exploratory analysis on a rich collection of data from
one large health cohort study, the Nurses' Health Study [1].
This thesis is organized into several major chapters. We first introduce the background of this
research in Chapter 2, including the Nurses' Health Study, a large health cohort study from which
we obtained the data, the tools we used to explore the data, a brief literature review of breast
cancer models and their evaluation, and a brief introduction to machine learning and Bayesian
Networks.
In Chapter 3, we describe the exploratory analysis performed on the first data sets we obtained,
including a proof of concept Bayesian Network, dependency analysis among clinical and
life-style variables and a small set of genotypic data, and a finding of association between clomid
use and breast cancer with specific estrogen receptor status.
Chapter 4 presents a group of risk predicting models for breast cancer based on the same data
used to derive a log-incidence model previously published by Colditz et al. We built the new
models using Bayesian Networks, an approach different from the original log-incidence model.
The models were evaluated on discrimination and calibration abilities. We also used
agglomerative information bottleneck clustering to filter the prediction results, and achieved
improved linear relationship between the predicted risks and observed risks.
Chapter 5 discusses issues we encountered in this work, summarizes lessons learned from the
research, and gives some possible future research direction from this work.
-9-
Chapter 2 Background
This chapter introduces the background of our research, including a brief introduction to breast
cancer and beast cancer models, Bayesian Networks and other machine learning techniques we
employed in our research, and the Nurses' Health Study data on which we based our study.
2.1 Nurses' Health Study
The Nurses' Health Study (NHS), established in 1976 by Dr. Frank Speizer and funded by the
National Institute of Health, is among the "largest prospective investigations into the risk factors
for major chronic diseases in women" [1]. Registered nurses were selected to be followed
prospectively because they were anticipated to be able to "respond with a high degree of accuracy
to brief, technically-worded questionnaires and would be motivated to participate in a long term
study" due to their nursing education [1]. A short questionnaire with health-related questions is
sent to the members every two years starting from 1976, and a long questionnaire with food
frequency questions is sent every four years starting from 1980. Questions about quality of life
were added to questionnaires since 1992. 33,000 blood samples were collected in 1989 and are
stored and used in case/control analyses. 2448 blood samples have been genotyped at 47 Single
Nucleotide Polymorphism (SNPs) sites [1][2].
With a follow-up rate exceeding 90% [3], the Nurses' Health Study has a very good longitudinal
record of phenotypes and clinical and life-style factors, including physical data, health status, life
styles, nutritional intake, family history, etc. Questionnaires were sent to the nurses every other
year, and the data available to this study are generally from 1976 to 2000, except for those who
passed away or left the study for other reasons.
The collected variables can be divided into different categories, and below is an incomplete but
indicative list (with the focus on breast cancer).
- 10-
*
general information: year of birth, month of birth, race, height, weight, geographic
location, education, parity (number of children), age at first birth, breast feeding history,
weight of heaviest child, father's occupation, marital status, husband's education, living
status, years worked in operating room, total number of years on night shifts, hours of
sleep, sleeping position, snoring, birth weight, breast fed in infancy, and handedness
*
clinically related information: smoking status, age at start of smoking, passive smoking,
body mass index, waist and hip measurement, age at menopause, type of menopause, post
menopausal hormone use, oral contraceptive use, breast cancer history of mother and
sisters, total activity score, menstrual cycle age, regularity of period, tan or sunburn, tan
or sunburn as a child, moles, lipstick use, breast implant, multi-vitamin use, social and
psychological characteristics
*
medical history: alcohol dependency problem, aspirin use, tagamet use, tubal ligation,
clomid use, tamoxifen use, talcum powder use, hip or arm fracture, TB test, elevated
cholesterol, heart disease, high blood pressure, diabetes, cancer report, lung cancer,
ovarian cancer, DES treatment, breast cancer diagnosis and type, estrogen receptor and
progesterone receptor test
*
diet information: different kinds of food intake
Many valuable medical findings have originated from this study, though contemporary machine
learning techniques (except logistic regression) were rarely employed. Therefore, we hope that by
exploring the data from NHS using modern machine learning tools we can find interesting new
medical evidence, e.g. develop a new breast cancer model, or learn helpful experiences from such
an exploratory analysis.
2.2 Breast Cancer Models
Breast cancer has been the most frequently diagnosed and the second most deadly cancer in
women in the US for many years. In 1998, breast cancer constituted about 30% of all cancers and
- 11 -
caused about 16% of cancer deaths in women. [1] Recently available statistics from the American
Cancer Society estimate that more than two hundred thousand newly diagnosed cases and more
than forty thousand deaths resulted from breast cancer in 2004. [6]
The effort to understand the risk factors of breast cancer, and one step further, to predict breast
cancer risks, can be traced back to the late 60's. [7-11] In early studies, it was first recognized that
age at menarche (the onset of menstruation), age at first birth, and age at menopause are three
major risk factors for breast cancer. Generally, early age at menarche and late age at menopause
are considered to be associated with higher risk of breast cancer, while early first full-term
pregnancy is associated with lower risk. Postmenopausal weight, family history, duration of
having a menstrual period, age, pregnancy history, and other risk factors were also investigated in
later studies. [ 13-21 ]
Starting from the early 80's, scientists started to build mathematical models that can predict a
probability, or risk, of getting breast cancer. Interestingly enough, most of these research efforts
fall into two groups. Some of them focused on the inheritance of breast cancer, building models
based on family history and/or genotype data. The others tried to build the model by combining
individual risk factors, mostly reproductive variables such as age at menarche, age at menopause,
parity, etc.
Ottman et al. published a simple model in 1983 that calculates a probability of breast cancer
diagnosis for mothers and sisters of breast cancer patients. [24] They used life-table analysis to
estimate the cumulative risks to various ages based upon two groups of patients from the Los
Angeles County Cancer Surveillance Program, then derived a probability within each decade
between ages 20 and 70 for mothers and sisters of the patients, according to the age of diagnosis
of the patient and whether the disease was bilateral or unilateral.
- 12-
Claus et al. developed a genetic model to estimate age-specific breast cancer risks for women
with at least one relative with breast cancer. [25-26] The data they used were from the Cancer and
Steroid Hormone (CASH) Study, a population-based, case-control study evaluating the impact of
oral contraceptives on the risk of breast cancer. The model derived risk estimates based on the
relative's age at diagnosis and the degree of relationship of the relative(s). Their results showed
that women with first-degree relatives who were diagnosed with breast cancer at early ages have
very high lifetime risks of breast cancer. They also suggested BRCA1 susceptibility for breast
cancer.
Inspired by the discovery of breast cancer susceptibility genes BRCA1 and BRCA2 between
1994 and 1995, risk models were also developed to predict the probability that an individual
might be a carrier of a mutant gene, either the known BRCA1 and BRCA2, or a hypothetical
unknown gene BRCAu.
Couch et al. examined families with at least two breast cancer cases for germline mutations in
BRCA1, and built a model with logistic regression, using average age at breast cancer diagnosis,
ovarian cancer history, and Ashkenazi Jewish ancestry as risk factors. [29] Shattuck-Eidens et al.
developed a similar model on a different group of families, but without the limitation to a family
history of breast cancer. [30] Using logistic regression, Frank et al. identified ovarian cancer,
bilateral breast cancer, and age of diagnosis for breast cancer before 40 as predictors for both
BRCA1 and BRCA2. [31] Parmigiani et al. developed a Bayesian model to evaluate the
probabilities that a woman is a carrier of a mutation of BRCA1 and BRCA2 using breast and
ovarian cancer history of first and second degree relatives as predictors. [33]
In the literature, there are many more research projects trying to identify carriers of a mutant gene
based on family history, than those trying to predict the risk of breast cancer using genotype data.
This is probably due to the high cost of genotyping and the unavailability of a reliable genotype
- 13 -
data set in past times. As indicated by Parmigiani, it cost $2400 to test for both BRCAI and
BRCA2 in 1997.[33] Recently, however, scientists started to build models to predict breast cancer
risk using BRCA1 and BRCA2. For example, Tyrer et al. published a model that incorporated
BRCA1, BRCA2, and a hypothetical low penetrance gene, as well as some personal risk factors.
[39] No doubt in the foreseeable future we will see more and more models using genotypic data.
In the other group using individual risk factors, Moolgavkar et al. proposed one of the earliest
risk prediction models in 1980, which predicts age-specific incidence of breast cancer in females,
based upon physiologic response of breast tissue to menarche, menopause, and pregnancy on the
cellular level. [18] They suggested a two-stage model that incorporates growth of breast tissues to
derive an age-specific incidence curve, which can explain with close quantitative agreement the
observed risk due to age at menarche, age at menopause, and parity in a combined data set
including data from Connecticut, Denmark, Osaka (Japan) Iceland, Finland, and Slovenia. This
"tissue aging theory" has been modified and extended in many later research projects.
Pike et al. proposed a quantitative description of "breast tissue age" based on age at menarche,
first full-term pregnancy, and age at menopause, which fits well a linear log-log relationship
between breast cancer incidence and age. [23] The Pike model assumes breast tissue aging started
from menarche at a constant f0, dropped after the first full-term pregnancy to another constant f,
and then decreased linearly from age 40, which they called the perimenopausal period, to the last
menstrual period, and finally kept constant at that level.
In 1989, Gail et al. proposed what is now called the Gail model, a breast cancer risk model
clinically applied today, based on the Breast Cancer Detection Demonstration Project (BCDDP).
[27] The relative risk of the original model was derived from a matched case-control subset from
BCDDP, using unconditional logistic regression on five risk factors: age, age at menarche, age at
first live birth, number of previous biopsies, and number of first-degree relatives with breast
- 14-
cancer. Women in BCDDP regularly received mammographic screening, so special caution must
be applied when applying this model to women who don't receive regular mammographic
screening to avoid risk overestimation or other inaccurate results. The model expresses the log of
the odds ratio for disease as:
logO(D: D)
= - 0.74948 + 0.0940 1m+ 0.52926b
+0.218631+ 0.95830n +0.0108 la
-0.28804b x a- 0.190811xn
where
m
=
age at menarche
b
=
number of previous breast biopsies
I
=
age at first live birth
n
=
number of first degree relatives who have breast cancer
a
=
1 if age>50, 0 otherwise
This equation suggests that breast cancer risks increase with older age at menarche, more number
of previous breast biopsies, older age at first live birth, more number of first degree relatives who
have breast cancer, and older age. It is worth noting that the model also compensated for the
covariance of two pairs of variables: number of previous breast biopsies and age, and the
covariance of age at first live birth and number of first degree relatives who have breast cancer,
using the last two terms with negative coefficients. That is, the increased risk of breast cancer due
to the two risk factors together in either of the above two pairs is less than the sum of the effects
of the two risk factors alone. In other words, there are interactions between the two risk factors in
either pair and their effects on breast cancer are dependent.
The equation is used to derive relative risk only, because it is trained on a case-control data set. In
order to predict absolute risks, a baseline risk must be established for a specific configuration of
the variables, for which the relative risk is 1. The absolute risks of women with different
configurations can then be calculated by multiplying this baseline risk with the relative risks
- 15 -
derived from the equation. This absolute risk is then projected to get long-term probabilities . The
Gail model has been widely used in breast cancer counseling and research subject screening.
Anderson et al. modified the original Gail model, or Gail model 1, to project the risk of
developing invasive breast cancer, and this model was referred to as model 2. [28] They used the
same model structure and risk factors, but derived the model parameters from the Surveillance,
Epidemiology, and End Results (SEER) Program of the National Cancer Institute (NCI) instead
of BCDDP, and included only invasive breast cancer cases.
Pathak and Whittemore fit a biologically motivated breast cancer incidence rate function to data
from published case-control studies conducted in different countries at high, moderate and low
incidence of breast cancer. [32] The data include 3,925 breast cancer cases and 11,327 controls
interviewed in selected hospitals in 1964-1968. The function parameters specify the dependence
of age-specific breast cancer incidence rates on age at menarche, age at menopause, occurrence
and timing of full-term pregnancies, and body mass. They reported three patterns: "1) Incidence
rates jump to a higher level after first childbirth, but then increase with age more slowly thereafter.
2) Rates increase with age more slowly after menopause than before. 3) Rates change
quadratically with body mass index among all women, although the main trend varies: Rates
decrease with body mass among premenopausal women in high-risk countries, but increase with
body mass in all other groups of women." [32]
In the late 90's, Colditz and Rosner developed a log-incidence model of cumulative breast cancer
risks to incorporate temporal relations between risk factors and incidence of breast cancer. [34, 35]
They evaluated reproductive history, benign breast disease, use of postmenopausal hormone,
weight, alcohol intake, and family history as risk factors, and derived the model based on the
Nurses' Health Study (NHS). Colditz et al. also modified this model to fit incidence data from
patients having breast cancer with specific estrogen receptor (ER) and progesterone receptor (PR)
- 16-
status, and the result suggested better discrimination ability on ER positive and PR positive
incidence than on ER negative and PR negative incidence. [36] Part of our research is based on
this model, and it will be introduced in more detail later.
2.3 Evaluation and Validation of Breast Cancer Models
After reviewing many breast cancer models from the literature, it appeared that almost all models
were developed in the following steps. The scientists designed a mathematical model based on
expert knowledge and known risk factors, selected a target population, then fit the model to the
data, in many cases using logistic regression, to derive model parameters, relative risks and
sometimes absolute risks. The natural question is: how well do these models perform in practice,
on general population data?
To answer this question, models need to be evaluated and validated. In machine learning, it is a
common practice to build the model on training set data, and then evaluate the model on a
separate test set. Methods such as cross-validation and leave-one-out are also popular. In the
breast cancer research described above, model validation did not receive as much attention as the
models themselves. The Gail model, however, as one of the most widely used models clinically,
was validated in a few studies.
Bondy et al. evaluated Gail model 1 in 1994 on a cohort of women who participated in the
American Cancer Society 1987 Texas Breast Cancer Screening Project and had a family history
of breast cancer. [57] They compared the observed ()
and expected (E) breast cancer incidence,
and found that Gail model 1 had a better performance among women who adhered to the
American Cancer Society mammographic screening guidelines (O/E = 1.12, 95% CI: 0.75-1.61)
than it did for those who did not adhere to the guidelines (O/E = 0.41, 95% Cl: 0.2 - 0.75). This
finding confirmed that the Gail model overestimates breast cancer risks for women not taking
annual mammnographicscreening. They also employed the Hosmer-Lemeshow Goodness-of-fit
- 17-
test, which did not find an overall lack of fit between the observed and expected number of breast
cancers, despite the overestimation result. (When a goodness-of-fit test gives no significant lack
of fit result, it does not prove nor guarantee a good fit. Actually this test can only prove lack of fit,
but it is always helpful to report a null finding of this test.)
About the same time, Spiegelman et al. evaluated Gail model 1 on another cohort of women from
the Nurses' Health Study, showing that the model overestimated risk among premenopausal
women, women with extensive family history of breast cancer, and women with age at first birth
younger than 20 years. [58] They also evaluated the model using the correlation coefficient
between observed and predicted risk, which was 0.67.
Costantino et. al evaluated both Gail model 1 and Gail model 2 on data from women enrolled in
the Breast Cancer Prevention Trial. [38] They compared the ratio of expected to observed number
of breast cancers, and the result showed better performance by model 2 than model 1, which
underestimated breast cancer risk in women more than 59 years of age.
Rockhill et al. evaluated Gail model 2 both on goodness of fit and its discriminatory accuracy
using the Nurses' Health Study data. [59] They evaluated goodness of fit by comparing the ratio
of expected to observed number of breast cancers, and evaluated discriminatory accuracy using
the concordance statistic (i.e. C-index, equivalent to Area Under ROC curve). They also
compared the highest and lowest deciles of relative risks derived from Gail model 2 to get a range
of discrimination of the model. They reported that the model fit well in the sense of predicting
stratified breast cancer incidence, but with modest discriminatory accuracy.
Recently, Gail et al. published a paper on criteria for evaluating models of absolute risk, showing
that the community is now paying more attention to model risk prediction evaluation. [121] In
their paper, Gail summarized general criteria for assessing absolute risk models, including
- 18-
calibration, discrimination, accuracy, and proportion of variation explained.
For calibration evaluation, they cited goodness-of-fit statistics and comparison of expected
number of cases with observed number of cases from [38] and [59], and also mentioned the Brier
statistic, or
ean squared error, which is a combined measurement of calibration and
discrimination. For discrimination evaluation, they reviewed area under the receiver operator
characteristics (ROC) curve, which has been widely used, and the Lorenz curve, which is more
frequently used in economics research. The Lorenz curve describes the relationship between the
proportion of disease cases and the proportion of population that has a risk up to a specific risk
value, and hence measures the concentration of risk populations. For rare diseases, the Lorenz
measurement is approximately the same as the ROC curve. They also mentioned the
measurement of proportion of variation explained using entropy and fractional reduction in
entropy.
2.4 Breast Cancer and Genomrics
The human genome has roughly 3 billion base pairs (bp), and now is estimated to have 30,000 to
40,000 genes [74]. It has been a well-known fact that most of the base pairs in the genome
sequence are identical across the population, while approximately one out of a thousand base
pairs will be different when comparing the genome sequences from two different persons. Such
differences, or polymorphisms, are used as markers in gene mapping and linkage analysis.
Genetic polymorphism is defined as "the occurrence of multiple alleles at a locus, where at least
two alleles appear with frequencies greater than 1 percent," or a heterozygote frequency of at
least 2 percent [81]. Commonly used polymorphisms include restriction fragment length
polymorphism (RFLP), variable number of tandem repeats (VNTR), microsatellites, and single
nucleotide polymorphisms (SNPs).
- 19-
A SNP is a sequence polymorphism differing in a single base pair, and is the most common type
of polymorphism. SNPs can occur in the gene coding regions, while other types of
polymorphisms mainly occur in non-coding regions. Such properties make SNPs the best marker
for gene mapping and linkage analysis, because they are common (about 1 per thousand base
pairs) and can be very close to the genes and mutations of interest.
SNPs have been reported from many research groups [82-85], while an accumulated public SNPs
database (roughly 1.2 million SNPs) can be found on line at several websites, including the
ENSEMBL (http://www.ensembl.org), NCBI (National Center for Biotechnology Information;
http://www.ncbi.nlm.nih.gov), and TSC (http://snp.cshl.org) websites. The estimated total number
of SNPs in the human genome may be over 10 million [86], perhaps as many as 30 million [87].
Breast cancer has long been known to be related to family history, and therefore, to be a
hereditary disease. Two relatively high penetrance genes, BRCA1 and BRCA2, have been
identified [88-90], but they do not account for all hereditary breast cancers. At least one
susceptible area of another major breast cancer gene has been proposed [91-92], while quite a few
polymorphisms have been investigated, including rare genetic syndromes associated with
increased breast cancer risks and low penetrant breast cancer susceptibility genes. These
suspected genes include proto-oncogenes (HRAS1), metabolic pathway genes (NAT1, NAT2,
GSTM1, GSTP1, GSTT1, CYPlA1,
and CYPlB1), estrogen pathway genes (CYP17 and
CYP19), estrogen receptor gene (ER), progesterone receptor gene (PR), androgen receptor gene
(AR), and many other genes (COMT, UGT1A1, HLA, TNF[alpha], HSP70, HFE, TFR, VDR,
and VPC) [93].
However, convincing results are hard to achieve. Some of the studies show positive association of
one polymorphism with breast cancer while others show mild or no association of the same
polymorphism. For example, Helzlsouer [94] and Charrier [95] suggested positive association
- 20 -
between GSTM1 and breast cancer risk, while Ambrosone [96] and Garcia-Closas [102] showed
evidence against this hypothesis. Another example is that Kristensen [97] reported increased
breast cancer risk with CYP19, and Haiman's reports [98-99] find no evidence of positive
association. The current findings need to be treated with caution, and further studies are necessary
to make a definitive conclusion.
The lack of affirmative results is possibly due to the fact that breast cancer is a complex trait.
Only a relatively small portion of breast cancer is hereditary, and a considerable part of these
familial cases are involved with BRCA1 and/or BRCA2, whose strong association may shadow
possible weak associations with other genes. In addition, there may be multiple genes interacting
with each other or interacting with clinical and life-style factors leading to the incidence. All these
issues make it difficult to reach a convincing conclusion.
Some of the studies examined the combinations of a few polymorphisms. Bailey et al. examined
CYPlA1, GSTM1, and GSTT1, reported no significant association with breast cancer risk of
these polymorphisms individually or combined [100]. Huang et al. examined CYP17, CYPlA1,
and COMT, and reported that COMT genotype has a significant association with breast cancer,
either individually or combined with CYP 17 and CYP 1Al, and CYP 17 and CYP lA 1 play a
minor role in the association [101]. Garcia-Closas et al. evaluated the association between
GSTM 1 and (GSTT1 gene polymorphisms and breast cancer risk, and provided evidence against a
substantially increased risk of breast cancer associated with GSTM1 and/or GSTT1 homozygous
gene deletions [102]. These combined investigations are very few in number compared to the
abundant number of studies of a single polymorphism.
Some of the studies investigated certain polymorphisms and a few clinical and life-style factors.
Ishibe et al. evaluated the associations between the CYPlA1 polymorphisms and breast cancer
risk, as well as the potential modification of these associations by cigarette smoking, and report a
-21 -
suggestive increase in breast cancer risk among women who had commenced smoking before the
age of 18 and had the CYPlA1-MspI variant genotype compared to nonsmokers who were
homozygous wild type for the polymorphism [103]. Millikan et al. examined the effects of
smoking and N-acetylation genetics on breast cancer risk, and reported little evidence for
modification of smoking effects according to genotype, except among postmenopausal women
[104]. Hunter et al. assessed the relation between NAT2 acetylation status and breast cancer risk,
and its interaction with smoking, and reported that cigarette smoking was not appreciably
associated with breast cancer among either slow or fast NAT2 acetylators [105].
Clinical and life-style factors other than smoking are also investigated in some studies. Gertig et
al. examined the associations between meat intake, cooking method, NAT2 polymorphism and
breast cancer risk, and observed no significant association between meat intake, NAT2, and breast
cancer risk, therefore suggesting that heterocyclic amines produced by high-temperature cooking
of meat and animal protein may not be a major cause of breast cancer [106]. Hines et al.
calculated relative risks and confidence intervals to assess breast cancer risk for ADH3 genotype
and alcohol consumption level, and suggested that the ADH3 polymorphism modestly influences
the response of some plasma hormones to alcohol consumption but is not independently
associated with breast cancer risk and does not modify the association between alcohol and breast
cancer risk [107]. Haiman et al. assessed the association between the A2 allele of CYP17 and
breast cancer risk, and observed that the inverse association of late age at menarche with breast
cancer may be modified by the CYP17 A2 allele through endogenous hormone levels [108].
Polymorphisms related to breast cancer also have been studied together with family history and
ethnic groups, respectively [ 109-1 1 ].
Limited studies examined more polymorphisms and some clinical and life-style factors. For
example, Haiman et al. studied 10 polymorphisms in 8 genes and two clinical and life-style
factors (menopausal status and postmenopausal hormone use), and suggested some interaction
- 22 -
between UGTlAl
genotype and menopausal status [112], which can possibly be modified by
postmenopausal hormone use. These previous successful examples imply that we need to put
genotypic data, phenotypic data, and clinical and life-style factors together to explain the complex
traits.
The Nurses' Health Study provides an abundant collection of clinical and life-style data as well as
personal risk factors for breast cancer, and recently a nested case-controlled group of nurses were
genotyped for a collection of SNPs in suspected gene areas. Combing these data together, we
have a chance to look for gene interaction and clinical and life-style contributors, and explore
their relationships from an overall point of view.
2.5 Machine Learning Techniques
Empirically, machine learning techniques are computer algorithms that attempt to find the best of
a class of possible models and to tune its parameters so as to best fit a data set of interest
according to specified criteria. When the learning process is completed, and if the model's
performance is reasonably good, the model and its parameters will reveal, at least to some extent,
the intrinsic structure of the data set.
Machine learning techniques can be divided into two major groups: supervised learning and
unsupervised learning. Supervised learning will try to relate the variables in the data set to one
specific variable, the class variable, and therefore disclose which variable or variable
combinations in the data set are best predictors for the class variable. For example, we can put
polymorphism and phenotype data together to make a data set, and use this data set to train a
model using one of the supervised learning algorithms. If we pick a variable of interest, say,
breast cancer as the class variable, the trained model will try to find the best predictors, maybe
one or more of the polymorphism or phenotype variables, or maybe a combination of certain
variables (which will require extensive work if using traditional statistical methods to obtain).
- 23 -
Unsupervised learning doesn't have a class variable, and it doesn't focus on the associations to
any specific variable. Rather, it explores the associations among all the variables and tries to
group them according to their own dependencies. If we use the above hypothetical data set to
train an unsupervised model, it will try to find the relationships and possible interactions among
all the polymorphisms and phenotypes, including breast cancer, but not exclusively.
Both supervised and unsupervised machine learning are non-hypothesis driven processes. When
training the model, the algorithm will automatically search the hypothesis space and try to find a
best fit. Hence, such techniques can examine a large number of hypotheses in one single run and
save researchers' labor.
The limitation of machine learning is computation power and the problem of overfitting. One
critical component of a machine learning algorithm is the model. If the model is not flexible
enough, the algorithm doesn't have the power to search a large enough hypothesis space, and thus
may miss important possible models. If the model is too flexible, the hypothesis space will be so
large that it is impractical to search the whole space. In such situations, we need to determine a
search strategy to make good use of the available computation power.
In many applications, the available data set has a very limited sample size compared to the
number of variables, thus there will be many hypotheses that can fit the data, causing the
overfitting problem. Overfitting can be checked by using a separate testing data set to confirm the
trained model. Another way to avoid overfitting is to reduce the number of variables, usually
referred to as feature selection.
Bayesian Network
Induction of Bayesian networks (BN) is one of the most popular machine
learning techniques. Bayesian networks can explore the relationships or dependencies among all
- 24 -
variables within a data set, not restricted to pair-wise models of interactions, and therefore can
describe and assess complex associations and dependencies of multiple variables. A Bayesian
network (BN) is a directed acyclic graph (DAG), in which the nodes represent statistical variables
and the links represent the conditional dependencies among the variables. By looking at the
network, one can easily tell the underlying relationships of the variables within the data set,
which are now represented by the links connecting the variables. A brief introduction to Bayesian
networks will follow in the next section.
Clustering
Clustering is an unsupervised machine learning technique, which has been widely
used in functional genomics. A clustering algorithm tries to group data into clusters based on
certain similarity or distance measurements, and to find natural or intrinsic partitions in the data.
It may involve finding a hierarchic structure of partitions (a cluster of clusters).
Support Vector Machine
Support vector machine (SVM) is a supervised learning algorithm. A
support vector machine tries to maximize the discrimination "margin" between the samples with
different class labels. Usually only relatively few samples are at the borders or intersections
between different classes, and these samples alone will determine the margin, so they are called
"support vectors." SVM uses a kernel function (which can be linear, polynomial, or radial) to find
the margin and support vectors.
Nonlinear kernels permit the representation of complex
discrimination boundaries.
Logistic Regression
Logistic regression (LR) is one of the most commonly used machine
learning algorithms in medical applications. The underlying model for logistic regression is two
Gaussian distributions with equal covariance, and LR tries to fit the data to this model using
maximum likelihood criteria. An additional advantage of logistic regression, which probably is
the reason that it is popular, is that a trained LR model gives weights for each variable in the form
of a likelihood ratio, so that the importance of the correlation of each variable with the class
- 25 -
variable is clearly represented.
Classification Tree A classification tree fits the data with a hierarchical tree structure. At each
branch point, the algorithm performs a test on one variable to decide on which branch to continue.
The leaves are labeled by the values of the class variable. In the learning processes, the algorithm
learns what test (on which variable) to perform, and which label to give each leaf, based on a
local optimization of information gain. One advantage of the classification tree method is that the
tree structure is a convenient representation of knowledge.
Naive Bayes
Naive Bayes is a relatively simple classifier based on probability and
independence assumptions. It assumes all variables except the class variable are conditionally
independent given the class variable. Thus one can calculate the joint distribution by multiplying
the marginal distributions of every variable using Bayes' rule. Empirically, even though the
independence assumptions don't stand, this algorithm works surprisingly well in many
applications.
Ensemble Classifiers
Above are a few examples of common machine learning algorithms.
Many other algorithms have been developed and tried in various applications, such as genetic
algorithm, kernel density methods, K-nearest neighbors, etc. In order to get more robust
performance, sometimes multiple algorithms or multiple classifiers with the same algorithm are
combined to form a ensemble classifier. Ensemble classifiers often have better performance but
require more computation power. Examples include stacking and bootstrapping.
Stacking is a combination of classifiers based on different algorithms. Each classifier is trained on
the same training set, and gives its own output. The output of these classifiers and the class
variable together constitute a meta data set. A further classifier is trained on this meta data set and
the output of this final classifier becomes the final output. [64]
- 26 -
Bootstrapping, or bagging, is an ensemble algorithm that can increase stability and reduce
variance. Given a training set D of size N. We creates M new training sets also of size N, each
uniformly re-sampled from D with replacement. M classifiers are trained on these M training sets,
and their output are combined by voting. [65]
Random forest is an example of an ensemble classifier that consists of many classification trees,
each generated using a small portion of the variables randomly picked from all variables. Every
tree is trained on a bootstrapping of the training set and not pruned. The classification result of the
random forest is the vote of all trees. [66] Such voting can be a plurality vote, or can be a
normalized sum of the probability outputs of the trees.
2.6 Bayesian Networks
Bayesian Networks (BN) are Directed Acyclic Graphs (DAG) describing dependency structures
among variables. [69] A Bayesian Network uses arrows, or links, to depict the dependency
relationship between variables, or nodes. An arrow pointing from variable A to variable B means
variable B is dependent on variable A, and vice versa. When there is a link pointing from A to B,
A is a parent of B, and B is a child of A. Using these graphical symbols, Bayesian Networks
visualize conditional independency structures in such a way that the probability of a variable can
be fully described by the probability of its parents, and its conditional probability table given the
parents.
- 27 -
CB(-
CBrs
(a)
(b)
(c)
A
(f)
(e)
(d)
Figure 1 Bayesian Network structures of three variables
Six very simple Bayesian Networks are shown in Figure 1 for illustration. In each of the six
figures, there are three nodes representing three random variables A, B, and C. In Figure 1a, the
three variables are mutually independent, as completely separated nodes, which corresponds to
P(ABC) = P(A)P(B)P(C). In Figure lb, two variables A and B are dependent, while variable C is
independent of A or B. In Figure
c, all three variables are dependent on each other,
corresponding to the general representation of a joint probability distribution P(ABC) =
P(CIAB)P(BIA)P(A).
Conditional independence is a very important concept in Bayesian Networks. In Figure d, C is
dependent on B, which in turn depends on A. A and C becomes conditional independent, or
d-separated, given B. The joint distribution can be written as P(ABC) = P(CIB)P(BIA)P(A). In
Figure le, B and C are d-separated by A and represent a similar conditional independence
structure.
Figure If shows a different situation where A and B are independent without knowing C, but
- 28 -
becomes dependent given C.
In a Bayesian Network, the set of variables including all parents of variable X, all its children,
and all parents of its children is call the Markov blanket for variable X. The Markov blanket will
d-separate X rom all other variables. In other words, X is conditionally independent of all other
variables given the variables in the Markov blanket.
Bayesian Networks have been widely used to represent human expert knowledge and for
probabilistic reasoning. In early applications, Bayesian Networks were constructed by asking
human experts the dependencies among the variables of interest as well as the conditional
probabilities. People have also tried to learn Bayesian Networks directly from data. Current
Bayesian techniques to learn the graphical structure of a Bayesian Network from data are based
on the evaluation of the posterior probability of network structures [72]. Searching the space of
possible network structures has been shown to be NP-hard, [70] but many approximation methods
have been developed. We used Bayesware Discoverer as a Bayesian Network learning tool, which
applies a greedy search approach. [74][75]
By learning a Bayesian Network from the data, we can obtain a landscape view of the variables
and study the interactions among them, as well as an opportunity to discover conditional
independency structures that would be overlooked otherwise.
An important difference between a Bayesian Network constructed from human expert knowledge
and a Bayesian Network learned from data is that the former often represents causal relationships,
while the latter represents conditional independency structures, not necessarily causal or
chronological. In statistical and artificial intelligence applications, however, it is a common
objective to find causal interpretations of the observed data. Therefore, when learning a Bayesian
Network from data with causal interpretations in mind, the variables shall be ordered in such a
- 29 -
way that respects the direction of time and causation. When such ordering is infeasible due to
lack of knowledge or other concerns, the learned Bayesian Network must be interpreted with care,
and no causal relationship shall be suggested without further investigation.
2.7 Bayesware Discoverer
"Bayesware Discoverer is an automated modeling tool able to transform a database into a
Bayesian network, by searching for the most probable model responsible for the observed data."
[73] Discoverer represents Bayesian Networks with a graphical interface as nodes and directed
edges, and integrates convenient analysis tools to manipulate the network and variables, as well
as to display useful information such as conditional distribution tables and Bayes Factors (log
likelihood ratios between possible parent sets of a variable). In most of the exploratory analysis of
this work, we employed Discoverer as the major tool.
Discoverer can learn both the structure and parameters (or the parameters of a given structure)
from the data, based on the K2 algorithm proposed by Cooper et al. in 1992. [74][75] As is
necessarily typical of computer programs that approximate NP-hard problems by heuristic
techniques, Discoverer runs efficiently enough to be useful, but cannot explore the vast space of
all possible hypotheses. Therefore, the network it chooses to fit a set of data may not, in fact, be
the most likely ("best") network.
Discoverer searches the space of possible conditional dependency structures using a greedy
approach based on a list of ordered variables. When searching for the optimal parent set of a
variable X, Discoverer uses a greedy approach, i.e. compare the log-likelihood score of a current
parent set (start from empty set) and that of the current set plus any additional candidate parent,
and update the current set with the additional parent if the score is higher, and then repeat this
process until no more parent can be added. The ratios between the log-likelihood score of this
final parent set and the other possible parent sets are called Bayes Factors (BFs), and a high BF
- 30 -
generally suggests that the final parent set is much more probable than others.
BF's are often
astronomically large, because even a slightly better-fitting model can make a large data set
enormously more likely.
Consequently, a BF that appears large, say 100, may not be very
significant.
Discoverer uses the ordering of the variables to guarantee the acyclic constraint in a way such that
only variables ordered before variable X are eligible as a candidate parent for variable X. We
sometime put the variables in temporal order, hoping that the found dependencies will be
consistent in temporal succession and thus could possibly reveal causal relationships, while
sometime we order the variables to emphasize certain dependencies based on expert knowledge
and known causal relationships. More details will be discussed in later chapters.
-31 -
Chapter 3 Landscaping Clinical and
Life-style and Genotypic Data
In this chapter, we review our attempt to explore the dependencies among a group of clinical and
life-style variables, personal risk factors, and a few Single Nucleotide Polymorphisms (SNPs)
using Bayesian Networks. The data are drawn from the Nurses' Health Study. We also describe
the dependency between breast cancer and clomid use, a possible new risk factor that we found in
the exploratory analysis.
3.1 Landscaping Clinical and Life-style Variables and SNPs
Breast cancer has long been known to be related to family history, and therefore, to be a
hereditary disease. The recently identified BRCA1 and BRCA2 mutations are responsible for
only part of hereditary breast cancer. There could be a "BRCA3," or some modifier genes and
clinical and life-style contributors. Most of the previous research on breast cancer has focused on
only one or a few genes, and rarely considered the influence of clinical and life-style factors. The
lack of definite results from past research illustrates that breast cancer is a complex trait and
therefore exploring the susceptible genes and possible contributing clinical and life-style factors
systematically can be a fruitful alternative to evaluating the genes individually and separately.
NHS provides an abundant collection of clinical and life-style data as well as personal risk factors
for breast cancer, and recently a nested case-control group of nurses were genotyped for a
collection of SNPs in suspected gene areas. Combing these data together, we had a chance to look
for gene interactions and clinical and life-style contributors, and explore their relationships from
an overall point of view.
- 32 -
3.1.1 Data
The genotypic data we had is provided by Ms. Rong Chen in the Channing Lab, including 12
SNPs picked from all SNPs genotyped in the nested case-control study by Dr. David Hunter in
the Channing Lab. These SNPs are:
*
atml,
atm2, atm3, atm4, and atm5: hyplotype-tagging
SNPs in ATM (ataxia
telangiectasia mutated protein) gene
*
ephxl and ephx2: non-synonymous SNPs in epoxide hydrolase
*
vdrbsmil 1 and vdrfokl: SNPs that make Restriction Fragment Length Polymorphisms in
the vitamin D receptor gene
*
xrcc3.466, xrcc3471, and xrcc3472: SNPs in XRCC3 (X-ray repair complementing
defective repair) gene
The data contain 1007 incident breast cancer cases and 1416 controls. The cases were selected
from breast cancer cases diagnosed by June 1, 1998, excluding any other prior cancer diagnosis
except for non-melanoma skin cancer. The controls were matched on year of birth, menopausal
status, PMH use at time of blood draw (1989), and time of day, time in the menstrual cycle and
fasting status at the time of blood draw.
The clinical and life-style data we had include 97 variables, manually selected among the
thousands of variables in NHS by Dr. Graham Colditz, Dr. Karen Corsano, and Ms Lisa Li from
the Channing Lab (different from the data in the example given at the beginning of this chapter).
The variables range from general information, such as race and age, to very specific information,
e.g. usage of specific medications. It also includes suspected or known risk factors for breast
cancer such as parity and menopausal status. A list of the variables and their meanings can be
found in Table 1. These data records are from various years (many from 1990), and most of them
don't include temporal information.
- 33 -
Table 1 Variable list for exploratory analysis
Name
Meaning
actmet
Met score of activity measurement
adep
Alcohol dependent problem
agedx
Age of diagnosis of breast cancer
agefb
Age of giving birth to first child
agesmk
Age started smoking
amnp
Age at menopause
asprin
Whether takes Aspirin
ball
Sunburn overall
bback
Sunburn on back
bbd
Benign breast disease
bface
Sunburn on face
bfeed
Whether was breast fed during infancy
blimb
Sunburn on limbs
bmi
Body Mass Index (BMI)
bp
SF-36 pain index
brfd
Breast fed children
brimp
Breast implant
bwt
Birth weight
case_ctrl
Breast cancer
chol
High cholesterol
clomid
Clomid (a fertility drug) use
crown
Crown-crisp scores, a measure of neurotic symptomatology
dadocu
Occupation of father
db
Diabetes
des
DES (a fertility drug) use
dmnp
Menopause status
dtdth
Date of death
dtdx
Date of birth
durmtv
Durartion of taking multi-vitamin
duroc
Duration of oral contraceptive use
durpmh
Duration of post menopausal hormone use
edu
Education level
era
Estrogen receptor status of breast cancer
fhxbr
Family history of breast cancer
frac
Fracture history
hand
Handedness
hbp
High blood pressure
- 34 -
Name
Meaning
hdye
Hair dye
height
Height
hip
Hip measurement
hrt
Heart disease
hsleep
Hours of sleep
husbedu
Education level of husband
id
ID number of the nurse within NHS
insi
Whether breast cancer is insitu
inva
Whether breast cancer is invasive
kidbwt
Birth weight of the heaviest child
kidsun
Sunburn as a kid
kidtan
Suntan as a kid
kpasmk
Passive smoking as a kid
lipst
Lip stick use
live
Living status (alone or with someone)
lungca
Lung cancer
marry
Marriage status
mh
SF-36 mental health index
mnp
Menopausal status
mnty
Menopausal type
moles
Amount of moles
nod
Whether lymph nodes present at breast cancer diagnosis
ocuse
Oral contraceptive use
oprm
Worked in operating room
ovca
Ovarian cancer
packyr
Total pack-year of smoking
parity
Number of children
pasmk
Passive smoking
Pctl0
DIAGRAMof body size at age 10
Pct2O
DIAGRAM of body size at age 20
Pct30
DIAGRAMof body size at age 30
Pct40
DIAGRAM of body size at age 40
Pct5
DIAGRAM of body size at age 5
pctfa
DIAGRAM of body size of father
pctma
DIAGRAM of body size of mother
pctnow
DIAGRAM of body size in 1988
pmh
Post menopausal hormone use
pra
Progesterone receptor status of breast cancer
race
Race
-35 -
Name
Meaning
re
SF-36 role-emotional index
regpd
Regularity of period
rp
SF-36 role-physical index
sf
SF-36 social functioning index
shift
Worked on night shift
sipos
Sleeping position
smkst
Smoking status
snore
Whether snore when sleeping
st1 5
State living at age 15
st3O
State living at age 30
state
State currently living
stborn
State born
tagamet
Tagamet (antiacid) use
talcum
Talcum powder use
tamox
Tamoxifen (a drug used in breast cancer treatment and prevention)
tb
Tuberculosis
tubal
Tubal ligation
tv
Time watching TV
vt
SF-36 vitality index
waist
Waist measurement
work
Working status
3.1.2 Exploring the SNPs and Breast Cancer
In the first step, we learned a Bayesian Network from the SNPs data plus a variable marking
breast cancer cases or controls. We used all available data, and imputed the missing values using
K-nearest neighbor method. This network is shown in Figure 2
- 36 -
Figure 2 Bayesian Network of 12 SNPs and breast cancer
The Bayesian Network shown in Figure 2 didn't find any dependencies between the SNPs and
breast cancer (the BF between breast cancer having no dependent variable and having one
dependent variable is 97, suggesting no dependencies with a high confidence), though it did show
dependencies among the SNPs from the same genes (atml-atm5, xrcc3 SNPs). We also tried
removing all missing values, but that result didn't show any dependencies between the SNPs and
breast cancer, either.
Considering the fact that some breast cancer may not be genetically related, and that breast cancer
genes generally have low penetrance, we removed from the data breast cancer cases without a
family history of breast cancer and controls with a family history. We hoped to reduce the noise
caused by the complexity of breast cancer as a hereditary disease with complex traits, and to
increase the chance to find the dependency between the SNPs and breast cancer, if there is any.
After such manipulation, we got a sub set with 152 cases and 1274 controls. The Bayesian
Network learned from this sub set has the same structure as in Figure 2, showing no dependencies
between the SNPs and breast cancer.
- 37 -
While SNPs (or genes) alone are independent of breast cancer, there might be some clinical and
life-style factors that can "turn on" these genes. To explore such possible interactions, we need to
study the SNPs and the clinical and life-style variables together.
3.1.3 Exploring Clinical and Life-style Factors and SNPs
In the second step, we put the SNPs data together with the clinical and life-style variables to
construct a combined data set. This data set contains 96 variables, 12 SNPs and 84 clinical and
life-style variables. 13 out of the 97 clinical and life-style variables were not used because of
different reasons. Some variables were not included because they were not applicable for this
analysis, such as id, and date of death (dtdth). Some variables were combined with other variables
to reduce the search space, including pack year (packyr) combined with smoking status, date of
diagnosis (dtdx) combined with age of diagnosis, duration of oral contraceptive use (duroc)
combined with oral contraceptive use, duration of post menopausal hormone use (durpmh)
combined with post menopausal hormone use, and menopause status (mnp) combined with
menopause type (mnty). Some variables are characteristics of breast cancer, and thus not included
in the Bayesian Network learning, but were used to select cases in a later study. These include
estrogen receptor (era), progesterone receptor (pra), in-situ (insi), invasive (inva), and nodular
(nod). One variable (kidbwt, birth weight of heaviest child) was removed because of too many
missing values.
We used all records and imputed the missing values using K-nearest neighbor method. The
variables are ordered in a way such that closely related variables are grouped together, while we
tried to make the overall order consistent with temporal order, as in the example we discussed at
the beginning of this chapter. The detailed order of variables is listed in Table 2.
- 38 -
Table 2 Variable orders for exploring clinical and life-style factors and SNPs
#
Variable
#
Variable
#
Variable
#
Variable
regpd
1
atml
25
marry
49
hdye
73
2
atm2
26
husbedu
50
moles
74
amnp
3
atm3
27
pasmk
51
kidsun
75
dmnp
4
atm4
28
oprm
52
kidtan
76
mnty
5
atm5
29
shift
53
bback
77
pmh
6
ephxl
30
pctma
54
bface
78
hbp
7
ephx2
31
pctfa
55
blimb
79
chol
8
vdrbsmrn11
32
pct5
56
ball
80
db
9
vdrfokl1
33
pct10
57
asprin
81
tb
10
xrcc3466
34
pct20
58
tagamet
82
hrt
11
xrcc3471
35
pct30
59
talcum
83
frac
12
xrcc3472
36
pct40
60
tamox
84
bbd
13
race
37
pctnow
61
clomid
85
lungca
14
dadocu
38
height
62
des
86
ovca
39
waist
63
tubal
87
actmet
15
stborn
16
st15
40
hip
64
ocuse
88
rp
17
st30
41
bmi
65
parity
89
bp
18
state
42
sipos
66
agefb
90
vt
19
fhxbr
43
snore
67
brfd
91
sf
44
lipst
68
adep
92
re
20
kpasrnk
21
bwt
45
brimp
69
work
93
mh
22
bfeed
46
agesmk
70
live
94
crown
23
hand
47
smkst
71
hsleep
95
agedx
24
edu
48
durmtv
72
tv
96
casectrl
The Bayesian Network learned from these 96 variables is shown in Figure 3. The majority of the
graph is occupied by the clinical and life-style variables, while the SNPs are at the upper right end
of the graph, marked by the circle.
- 39 -
Figure 3 Bayesian Network for interactions among clinical and life-style variables and SNPs
This network presents a landscape of the clinical and life-style variables and describes their
interactions. It also shows, again, the dependencies among the SNPs from the same genes.
However, this result still fails to show any dependencies between the SNPs and breast cancer, or
any other phenotype variables. In later studies, we also tried to limit the data to specific breast
cancer type, estrogen receptor positive and progesterone receptor positive, and those results show
no dependency between breast cancer or other clinical and life-style variables and the SNPs,
either.
Because we were also looking for any variable whose presence would make breast cancer and
any SNPs become dependent, we tried different variable orders to facilitate such searching as
well (e.g. putting breast cancer and SNPs on top of all other variables), but did not find any such
dependencies, either.
- 40 -
In our attempt to explore the relationship between breast cancer and a small collection of SNPs
from different genes, we didn't find any dependency that links a SNP and any clinical and
life-style variable, including breast cancer. There are a few possible reasons. It could be that the
selected SNPs are truly not associated with breast cancer, as suggested by the null finding with a
reasonable amount of effort to look for dependencies. It could also be because one or more SNPs
in our study are associated with a mutant gene that does contribute to breast cancer, but only
triggered by certain clinical and life-style factors that are not included in our study. Considering
the wide range of variables included in this study, this is not very likely, and even if it is true, the
association must be weak or the triggering clinical and life-style factor must be rare in daily life,
or we would have found a link between that SNP and breast cancer, at least. Another possibility is
that the association between SNPs and breast cancer is very weak, and covered by the random
noise.
Considering the size of the study, both in terms of the number of cases and variables, we consider
the first assumption above, that the selected SNPs are not associated with breast cancer, a
reasonably plausible result of this work.
3.2 Estrogen Receptor Status and Clomid
During the exploration process in landscaping the clinical and life-style variables, we noticed that
in many of the Bayesian Networks we learned, there is a dependency between history of using
clomid (a fertility drug) and estrogen receptor status of breast cancer, as shown in Figure 4. (Only
part of the network is shown for a clear look.) This network is built on the data set that we used
for landscaping, thus matched on age and menopausal status for breast cancer. After removing
records with missing values on more than 12 variables, the data set used to learn this Bayesian
Network contains 514 ER+ breast cancer nurses, 151 ER- breast cancer, and 834 non-breast
cancer.
-41 -
~
~
I
/
/
I
G
Figure 4 Clomid use and breast cancer with ER status
In the graph, the conditional
distribution
table of ER status of breast cancer (Era) is also shown.
From this table, we see that clomid use is associated with lower ER- breast cancer incidence
(0.316 vs. 0.559) and higher ER+ breast cancer risk (0.675
VS.
0.339).
Other variables shown in the graph are the Markov blanket of clomid, which d-separates clomid
from the rest of the graph. Among these variables, clomid use is positively associated with breast
implant
(Brimp) and negatively
associated
with number of children (Parity), high cholesterol
(Chol), post menopause (Dmnp), and retirement (Work).
The association with number of children can be easily understood:
nurses with more children are
less likely to take fertility drugs. Noting the fact that the data are age matched for breast cancer,
not clomid, most other associations
younger
generation
could probably be explained by the greater use of clomid in a
of nurses than among older ones, because when clomid became clinically
- 42 -
available in 1968, [71] older nurses in NHS were already past their reproductive
age. Figure 5
below illustrates this distribution over age for nurses who took or never took c1omid. In the graph,
we can see the birth years of nurses
who didn't
take clomid were approximately
evenly
distributed from 1921 to 1946, while most of the nurses who took c10mid were bom after the mid
30's.
10 c10mid = no _ clomid
= yes
I
0.14
0.12
I:
-----
0.1
o
'.;:I
(ll
"[ 0.08
o
....0o
I:
0.06
.2
tl
(ll
...
.... 0.04
I
0.02
!
f
II .. II •
o
~
V
~
~
~
I •
~
-
-
I. I (I
0/
~
,J
~
~
~
~
~
~
birth year
Figure 5 Comparison
of birth year between nurses who used and never used clomid
It is also worth noting that the c10mid data were collected in 1992 and 1994, when most of the
nurses in NHS passed their reproductive
clomid after that data collection
age, so it is unlikely that any nurses would start taking
point, and therefore,
there should be no bias introduced
by
censoring effects on c10mid data.
For the dependency
between clomid use and ER status of breast cancer, it cannot be explained by
- 43 -
age or generation difference, because the data were age-matched for breast cancer. To further
investigate this dependency, we used the full cohort data, excluding those missing clomid or ER
status.
Before looking at the statistics, we first checked the date of taking clomid and the date of
diagnosis for breast cancer, and found that no clomid was taken after the date of diagnosis for
breast cancer, while 10 nurses reported clomid use without specific dates, and 5 nurses who took
clomid at age 24, 25, 32, 34, 36 were diagnosed with breast cancer but without date of diagnosis.
These nurses were included in the study.
We compared clomid use versus ER status of breast cancer from the full cohort (excluding
missing values on clomid use and ER status), in Table 3. The table is organized in three groups,
showing pair-wise comparison of the effect of clomid use on ER- breast cancer (ER-), ER+ breast
cancer (ER+), and non breast cancer (non-brcn).
In all three groups, column 2 through 4 shows
the number of nurses in each category. Column 4 is the x2 value based on the null-hypothesis that
the compared pair (two of ER+, ER-, or non-brcn, indicated in the head row of each group) and
clomid are independent, and column 5 is the corresponding p-value, the probability that the null
hypothesis is true, or that clomid use is associated with the compared pair. Column 6 is the odds
ratio that the odds of the compared pair given clomid=y divided by the odds given clomid=n,
therefore reflecting the influence of clomid use on odds of the two variables.
The data in Table 3 shows that clomid use is associated with lower estrogen receptor negative
(ER-) breast cancer, indicated by odds ratio of 0.47, with no significant impact on estrogen
receptor positive (ER+) breast cancer.
- 44 -
Table 3 Clomid and breast cancer with ER status, no matching
# cases
clomid-y
ER8
clomid=n
953
Sum
961
ER-
ER+
60
Sum
68
3245
3305
4198
4266
non-brcn
8
1437
1445
clomid=n
953
80358
81311
Sum
961
81795
82756
ER+
clomid=n
Sum
non-brcn
p-value
4.586
0.032
0.45
0.22-0.95
4.731
0.030
0.47
0.23-0.94
0.063
0.80
1.03
0.80-1.34
Odds
ratio
95% CI
for Odds ratio
Sum
clomid=y
clomid=y
Chi-square
Sum
60
1437
1497
3245
3305
80358
81795
83603
85100
As discussed above, clomid use is dependent on age or generation of the nurses. Therefore, we
matched clomniduse on age. For every nurse with clomid usage, we randomly picked 13 nurses
who never used clomid.
Because the minimum ratio of nurses who never used clomid to nurses
who used clorid at different ages is 13, we could find at most 13 nurses with a history of non
clomid use to match every nurse who used clomid. These age-matched data are shown in Table 4.
Table 4 Clomid and breast cancer with ER status, matched on age for clomid use
# cases
ER-
ER+
Sum
clomid=y
8
60
68
clomid=n
200
520
720
Sum
208
580
788
ER-
non-brcn
8
1437
1445
clomid=n
200
18729
18929
Sum
208
20166
20374
non-brcn
p-value
8.200
0.0041
Odds
ratio
95% CI
for Odds ratio
0.35
0.16-0.74
Sum
clomid=y
ER+
Chi-square
l
3.361
0.067
0.52
0.26-1.06
Sum
clomid=y
60
1437
1497
clomid=n
520
18729
19249
Sum
580
20166
20746
_
8.725
0.0031
1.50
1.15-1.98
In the age-matched data, we see a slightly weaker negative association of clomid with ER- breast
- 45 -
cancer, and a new positive association of clomid with ER + breast cancer. This change in the odds
ratios was as expected because it is known that ER+ breast cancer happens more often in older
women than in younger women compared to ER- breast cancer, while clomid use happened more
often in younger nurses and more older nurses with no clomid use were removed in the matching
process.
(!J
clomid=yes
• clomid=no,
all data
0 clomid=no,
after age-matching
0.4
0.35
g
0.3
'';;
(I;l
"3 0.25
c..
o
c.. 0.2
c::
c:: 0.15
0
'';;
u
(I;l
J::
0.1
0.05
0
3
2
0
4
6
5
7
8
number of children
Figure 6 Comparison
However,
because the number
of number of children on clomid use
of children
is known to be a risk factor for breast cancer, or
specifically, nulliparous is associated with higher risk of breast cancer, we need to check parity of
these data as well. The data showed, as in Figure 6, that nurses who took clomid are more likely
to have fewer children
understanding
additionally
than those who did not take clomid,
of the purpose
of clomid
use. Therefore,
which
we matched
is consistent
with our
the data in Table 4
on parity, with a ratio of 3, the maximum number of available matches, and the result
is shown in Table 5.
- 46-
Table 5 Clomid and breast cancer with ER status, matched on age and parity for clomid use
# cases
clomid=y
ER8
ER+
60
Sum
68
clomid=n
52
118
170
Sum
60
178
238
ERclomid=y
clomid=n
Sum
non-brcn
1437
1445
52
60
4300
5737
4352
5797
clomid=y
non-brcn
p-value
9.128
0.0025
0.30
0.14-0.68
4.354
0.037
0.46
0.22-0.97
6.849
0.0088
1.52
1.11-2.09
Odds
ratio
95% CI
for Odds ratio
Sum
8
ER+
Chi-square
Sum
60
1437
1497
clomid=n
118
Sum
178
4300
5737
4418
5915
The statistics shown in Table 5, matched on both age and parity, suggest that clomid use is
associated with lower ER- breast cancer incidence versus higher ER+ breast cancer incidence
(odds ratio 0.30, 95% CI 0.14-0.68). This effect of clomid on breast cancer can also be interpreted
as increased ER+ risk (odds ratio 1.52, 95% CI 1.11 - 2.09), and decreased ER- risk (odds ratio
0.46, 95% CI 0.22 - 0.97).
3.3 Summary
In this chapter, we first described the results from our efforts in exploring data sets from the
Nurses' Health Study, including a small selection of SNPs. Even though the results showed
dependencies between environment variables and between SNPs from the same gene, we did not
find any dependencies between the SNPs and breast cancer, with or without the presence of other
clinical and life-style variables as possible "turn-on" factors. Our result suggests that there might
be no association between breast cancer and the selected SNPs.
During the exploratory analysis process, we discovered a dependency between clomid use and
breast cancer of specific estrogen receptor status in the nested case-control set. In the literature,
- 47 -
different results of studies on clomid (or clomiphene) and breast cancer have been reported.
Rossing et al. observed that use of clomid as treatment for infertility is associated with lower
breast cancer risk (adjusted relative risk: 0.5; 95% CI: 0.2-1.2). This cohort study contains 3837
women, among which 27 are breast cancer cases. [116]
Venn et al. reported a transient increase in the risk of having breast cancer in the first year after
in-vitro-fertilization (IVF) treatment (relative risk: 1.96; 95% Cl: 1.22-3.15). Clomid is one of the
fertility drugs used in the study from ten Australian IVF clinics (143 breast cancers, among which
87 were exposed to clomid). [117]
Brinton et al. recently reported that clomid use is associated with slight or non-significant
elevation of breast cancer risk (adjusted relative risk: 1.39; 95% CI: 0.9-2.1) for subjects followed
for more than 20 years, but when restricting the study to invasive breast cancer, the association
with clomid was significant (adjust relative risk: 1.60; 95% CI: 1.0-2.5). [118] This study includes
213 (175 for invasive) breast cancer cases, among which 29 (27 for invasive breast cancer) were
exposed to clomid and were followed for more than twenty years. [118]
In our study, we found dependencies between clomid use and breast cancer of specific estrogen
receptor status, and confirmed these dependencies on the full cohort, checked on the temporal
relationship between the time of taking clomid and the time of diagnosis of breast cancer. We also
matched nurses who used or never used clomid on age and number of children, and the final
result suggests that clomid use is associated with increased ER+ breast cancer risk, and at the
same time, decreased ER- breast cancer risk.
- 48 -
Chapter 4 Breast Cancer Model
4.1 Log-Incidence Model for Breast Cancer
Colditz and Rosner developed a log-incidence model of breast cancer to incorporate temporal
relations between risk factors and incidence of breast cancer. [34, 35] They also modified this
model to fit against breast cancer with specific estrogen receptor (ER) and progesterone receptor
(PR) status, and the result showed better discrimination ability on ER positive (ER+) and PR
positive (PR+) incidence than on ER negative (ER-) and PR negative (ER-) incidence. [36]
In their paper,, Colditz et al. made an assumption that breast cancer incidence rate is proportional
to the number of breast cell divisions accumulated through a life time. In addition, they also
assumed that the rate of increase of breast cell divisions at a certain age is exponential in a linear
combination of risk factors. Based on these two assumptions, they calculated the log incidence
rate with a linear function of risk factors, which varies with age and change of reproductive life
status. This formula is shown below.
Log I, = a + fo(t -to) + fl b +
2 (t]
-to) b,t-1 + yJ (t-tm)
+ cS dur PMHA + (52dur_PMHB+
+ 3 BMII + 83BMI 2 + /4 h 2 +
mA + Y2 (t-tm)
53durPMHc + 54PMHcurt + (4+5)PMHpast,t
f34 h 2
+ /3ALC,+ ,l 5ALC 2+ fl/3
5 ALC3
+ a BBD+ a2 BBDto+a3 BBD (t*-to)+ a 4BBD (t-tm)+0 FHX
Where It is the incidence rate at age t, and
t = age
to= age at menarche;
tm=
t
mB
age at menopause;
min(age, age at menarche);
mi = 1 if postmenopausal at age I, = otherwise;
- 49 -
s = parity;
t = age at ith birth, o = 1...,s;
b = birth index =
(t -ti) bit
bit= 1 if parity>i at age t, =0 otherwise;
ma= 1natual menopause, =0 otherwise;
mb = 1 if bilateral oophorectomy, =0 otherwise;
BBD = 1 if benign breast disease = yes, =0 otherwise;
FHX = 1if family history of breast cancer = yes, =0 otherwise;
dur_PMHA= number of years on oral estrogen;
durPMHB = number of years on oral estrogen and progestin;
durPMHc = number of years on other postmenopausal hormones;
PMHcur,t = 1 if current user of postmenopausal
hormones at age t, = 0 otherwise;
PMHpast, -1 if past user of postmenopausal hormones at age t, = 0 otherwise;
BMIj= body mass index at age j (kg/m2);
ALCj= alcohol consumption (grams) at age j
h = height (inches).
Colditz et al. fitted the model with the SAS function PROC NLIN on a selected cohort data set
from NHS, and calculated a risk score based on the model, which can be used as an index for
breast cancer incidence risks. The data were selected from the NHS cohort by excluding nurses
with pre-existing cancers, or missing (or conflict) pregnancy and parity information, or without a
precise age at menopause, or missing height and weight information.
In our work, we used a Bayesian approach to explore the same data set, using the same variables
in the Colditz-Rosner log-incidence model, to predict breast cancer risk based on incidence rate.
Here, breast cancer risk refers to the probability that a woman who doesn't have breast cancer
will be diagnosed with breast cancer within a defined period of time. We first built models to
- 50 -
predict the probability that a woman will be diagnosed of breast cancer in two years, then we used
similar approach to build models to predict the probability that a woman will be diagnosed of
breast cancer in longer terms: four years and six years. Such a breast cancer risk is consistent with
the definition of absolute risk of breast cancer that is generally used in this area, [124,121] and it
t
also matches with the probabilities of developing breast cancer predicted in the Gail model. [27]
Specifically, Gail et al. defined the absolute risk of a disease of interest within a defined period
from time a to time t as
It
~~~~~~~~~to
- f[h()+k()]dv
h(u)e
;a
du
where h(t) is the cause-specific hazard at time t for the disease of interest, and k(t) is the hazard of
mortality from other causes at the same time. [121] It is worth noting that such definition also
incorporated competing diseases represented by k(t).
4.2 Evaluating Risk Score as an Index for Breast Cancer
Risk scores are widely used as indicators for risks of various diseases, and many of them were
calculated using regression models, such as the breast cancer risk score developed by Colditz et
al., as described previously. It would be interesting to compare these scores with probabilistic
models, and see how they perform differently (or the same).Before starting on the prediction
models, we introduce an initial attempt using Bayesian Network to evaluate the risk score as an
index for breast cancer developed by Colditz et al., as described previously.
The objective of this analysis is to see whether breast cancer incidence is dependent on the risk
score, and whether the risk score can d-separate breast cancer from the risk factors. The
underlining assumption is quite intuitive: if the risk score is a good index for breast cancer, it
shall capture all information provided by the risk factors and act as an information flow proxy
from the risk factors to breast cancer. If there are additional information flow from the risk factors
-51 -
to breast cancer, not included by the score, it will suggest a possibility of improvement for the
risk score by adjusting model and incorporating such information.
Ms. Marion Mcphee in Harvard Medical School provided us a risk score file containing risk
scores calculated based on the log-incidence model using 1980 year data. She also provided us a
data set including the variables and records that were used to develop the log-incidence model,
from 1980 till 1998, which we used to build breast cancer classifiers as described later in this
chapter. For the study in this section, we used only 1980 data from this data file and combine with
the risk score for analysis, because the risk scores are calculated using 1980 data only, even
though the model is developed based on all year data. (When developing risk prediction models
in the rest of this chapter, we used all year data.)
In order to learn a Bayesian Network for our purpose, we put breast cancer (Case) on top, risk
score (Score_df53) in the middle, and then the risk factors. We hoped that with such a searching
order, Discoverer could find a dependency structure that can show whether there is information
flow from the risk factors to breast cancer outside the risk score. There were no missing value in
this data set, and the continuous variables were discretized into three bins with roughly 1/5
records in the top and bottom bins and 3/5 records in the middle. The result is shown in Figure 7.
- 52 -
Figure 7 ]3ayesian Network learned for evaluating risk score as an index for breast cancer
From the learned Bayesian Network, we see that breast cancer is dependent on risk score, which
in turns has dependencies with many of the risk factors. There is no direct links between the risk
factors and breast cancer, so this result does not show any extra information that the risk score
needs to modify to incorporate.
The above result, however, does not guarantee that the risk score captures all information of the
risk factors because of two major reasons. First, searching for optimal dependency structure of a
Bayesian Network is a NP-hard problem, and the searching method we used does not guarantee a
global optimal solution. There could be other dependency structures that are more likely than this
structure, and breast cancer and other risk factors might not be d-separated by the risk score in
some of those structures. The second reason is that this data set is only one tenth of the whole
data set on which the risk score model is developed, and therefore such an evaluation is not
- 53 -
complete. If we had had risk scores from all year data, we would have been able to do a more
complete evaluation, while still constraint by the first reason described.
4.3 Learning a Classifier for Breast Cancer Incidence
The original data set we had includes records of 3216 nurses who were diagnosed with breast
cancer between 1980 and 2000, and 71805 nurses who were never diagnosed with breast cancer
until 2000. NHS data are organized in two-year intervals, because the data were collected every
other year. For nurses without breast cancer, there is one record every other year from 1980 until
1998, except for nurses who left the study early due to death, diagnosis of cancer other than
breast cancer, or other reasons. These nurses only have records till when they left. For nurses with
breast cancer, there is one record every other year from 1980 till the last record before the report
of breast cancer.
We used a randomly sampled subset from the above data to train a classifier. We didn't use the
whole data set for the following reasons. First of all, the number of records is different for every
nurse. Healthier nurses stayed in the study till 2000, thus having more records. Nurses who died
or were diagnosed with cancers before 2000, including breast cancer, have fewer records. If we
use all available data, healthier nurses will be overemphasized compared with the nurses who
exhibited diseases (not limited to breast cancer, but also including other cancers or other diseases
that caused mortality). Secondly, using multiple records from one person to train a single
classifier might increase the apparent power more than the data can actually provide. Thirdly, for
an incidence model, only one record of a nurse with breast cancer will be used as a breast cancer
case, and which year to use will depend on the predictive period of the model (discussed in more
details later). If we were to use all available data, the proportion of breast cancer cases would be
very small (about 0.6%), which is relatively harder to deal with in training classifiers. Lastly, the
whole data set is very large (60MB), and sub-sampling it makes the application of the tools we
used more practical.
- 54 -
Based on the above reasons, we sub-sampled the data set by randomly picking one record for
nurses without breast cancer, while picking the record that immediately precedes the report of
diagnosis for nurses with breast cancer (this record is marked as breast cancer). It is worth noting
that the sampling ratio used here is different for breast cancer and non-breast cancer, because the
sampling of breast cancer is deterministic rather than random. Ideally we should also randomly
sample the breast cancer nurses, but in order to make use of all breast cancer cases, we fixed our
sampling on the breast cancer record. Therefore, breast cancer cases were amplified, and the risk
prediction results of a classifier trained from this data set need to be prorated.
Because Discoverer works better on categorical variables than continuous variables, we
discretized continuous variables into three bins, with the two extreme bins containing
approximately one fifth of the records each, and the middle bin containing the remaining three
fifths. (In the exploratory process, we also tried different discretization methods, including
equal-size binning, equal-range binning, discretization on class, discretization on entropy, etc.,
but did not find better results on performance.)
We randomly split this data set into two parts, 75% as training set, and the remaining 25% as test
set. Because breast cancer is a rare disease, and even in this "amplified" data set breast cancer
only constitutes about 2% of the records, a little noise in the random splitting may cause
noticeable difference in the proportions of breast cancer records in the split sets. Therefore, we
stratified the random splitting on breast cancer. Such stratification is necessary to eliminate
systematic overestimation or underestimation of risks due to different marginal distributions of
breast cancer in training and test sets, which we encountered in preliminary studies.
When learning the network, we organized the variable order in three steps. In step one, we put the
variables into an order such that closely related variables are together, while the order is also
- 55 -
consistent with our understanding of the causal relationships among the variables, except for
breast cancer. The breast cancer variable, "case," is on top of the variable list, so Discoverer could
explore the dependencies between breast cancer and all other variables. (If we put breast cancer at
the bottom of the list, Discoverer may not be able to find as many dependencies because of the
competition between other variables.)
In step two, we randomly permuted the variable orders to create ten different training sets, with
the same data but different variable orders, and learned a group of ten different networks from
each training set. For each variable, we counted the number of times it appeared in the Markov
Blanket of breast cancer in these ten networks, as in Table 6. From each group, we picked the
variable with the highest count (first variables in each group) and moved these variables to the top
of the list, to facilitate the searching of the dependencies between these variables and breast
cancer.
Table 6 Variable counts in Markov Blankets of breast cancer in ten networks with permuted order
Variable Name
Variable Meaning
Count
Group
age
Age
3
yearsMeno
Years with menstrual period
2
yearsNatural
1
bbdMeno
Years after natural menopause
Years with menstrual period if have benign
breast disease
bbdMenopause
Years after menopause if have benign breast
disease
bbd
Benign breast disease
3
3
yearsProgestin
curPMH
Years on progestin
Current postmenopausal hormone use
5
2
Postmenopausal
hormone
tmtbm
Birth index
1
Pregnancy
sumalc2
Alcohol consumption when on postmenopausal
hormone
1
Alcohol
consumption
4
Aging
Benign
disease
breast
It is worth noting that family history is not included in these variables. This is probably due to the
fact that family history is a very unbalanced variable (only about 10% nurses with family history),
- 56 -
thus being at a disadvantage when competing with other more balanced variables. It is also
possible that in all of the limited permutations, family history happened to be in a position that is
hard for Discoverer to find the dependency between family history and breast cancer. Nonetheless,
it is well known that family history is a predicting factor for breast cancer, supported by the
genetic mutations found to cause breast cancer. Therefore, we included family history in the next
step of the ordering process despite its absence in Table 6.
In step three, we combined the results from step one and step two, by putting the selected
variables in step two, i.e. age, family history, bbdMeno, yearsProgestin, tmtbm, sumalc2 on the
top of the list, and obtained the variable order in Table 7.
Table 7 Variable order
Variable Name
Variable Meaning
case
Breast cancer
age
Age
famhx
Family history
bbdMeno
Years with menstrual period if have benign breast disease
yearsProgestin
Years on progestin
tmtbm
sumalc2
Birth index
Alcohol consumption when on postmenopausal hormone (PMH)
bmi 1
bmi2
hi
h2
Body Mass Index (BMI) factor without PMH use
BMI factor with PMH use
Height factor with PMH use
Height factor without PMH use
sumalc 1
Alcohol consumption
sumalc3
Alcohol consumption without PMH use
yearsMeno
Years with menstrual period
yearslbirth
Years after first birth
yearsNatural
Years after natural menopause
yearsBilateral
yearsEstrogen
Years after bilateral oophorectomy
Years on oral estrogen
yearsPMHother
Years on other PMH than estrogen or progestin
curPMH
Current PMH use
pastPMH
bbd
Past PMH use
Benign breast disease
- 57 -
Variable Name
bbdMenarch
bbdMenopause
Variable Meaning
Age at menarche if have benign breast disease
Years after menopause if have benign breast disease
We learned a Bayesian Network from the training set with the above variable order, shown in
Figure 8.
Figure 8 Bayesian Network for predicting breast cancer
In the graph, variable "Case" is the marker for breast cancer. We can see that age (Age_df53),
family history of breast cancer (Famhx), years having menstrual period if had benign breast
disease (Bbdmeno), and number of years on oral estrogen and progestin (Yearsprogestin_df53)
are the four most important predicting factors and constitute the Markov blanket for breast cancer
(Case).
We evaluated the generalization error of this model on the hold-out 25% test set. The prediction
AUC is 0.65, with 95% Confidence Interval (CI) of 0.64 - 0.67.
- 58 -
4.4 Classifier for ER+/PR+ Breast Cancer
Predicting ER+/PR+ breast cancer is clinically more important, because ER+/PR+ tumors are
more sensitive to Selective Estrogen Receptor Modulators (SERMs), one of the major methods
for prevention of breast cancer. [125] In further study of breast cancer predicting models, we
limited the scope of our model to ER positive and PR positive breast cancer versus non-breast
cancer by removing breast cancer case records not marked as ER+/PR+ (including ER-, PR-, and
ER or PR missing) from the data set described above, which left 1447 breast cancer nurses and
71805 non breast cancer nurses.
We split (randomly, stratified) the data into a training set with 75% of the total data, and a test set
with the remaining 25% of the total data. We built the classifier on the training set, and then ran
the classifier through the test set to evaluate its performance, which will be discussed in a later
section. The resulting Bayesian Network is shown in Figure 9.
Figure 9 Bayesian Network for predicting ER+/PR+ breast cancer
We can see that the structure of this prediction model is similar to the one in Figure 8 for all
breast cancer, with the same predicting variables: age (Age_Df53), family history of breast cancer
(Famhx), years of having menstrual period if had benign breast disease (Bbdmeno), and years
- 59 -
taking oral estrogen and progestin (YearsprogestinDf53). In the next section, we evaluated this
model in more detail.
4.5 Evaluation of Risk Prediction Model
We evaluated the performance of the risk prediction model in two ways. First, we evaluated the
discriminating ability of the model using the Receiver Operating Characteristic (ROC) curve,
specifically, area under the ROC curve (AUC). Second, we evaluated the model's ability to give
an accurate probability for breast cancer prediction, by comparing expected and observed
incidence rates.
l
l
- -
|
| s
0.9
0.8
0.7
0.6
.2
a)0.5
cC
CD
0.4
AUC = 0.70 (95% CI: 0.67 - 0.74)
0.3-
0.2 -
"l r
I
I
I
0.1
0.2
0.3
I
0.4
I
I
0.6
0.5
1 - specificity
I
I
I
0.7
0.8
0.9
Figure 10 ROC curve of ER+/PR+ breast cancer prediction on test set
- 60 -
1
Figure 10 shows the ROC curve of the prediction results on the test set. The area under the ROC
curve is 0.70, with 95% CI of 0.67 - 0.74. The actual data of Figure 10 can be found in Table 8.
(In the ROC curve, sensitivity is defined as the number of breast cancer cases predicted as breast
cancer divided by the total number of breast cancer cases, and specificity is defined as the number
of non-breast cancer cases predicted as non-breast cancer divided by the total number of
non-breast cancer cases. The model gives a probability between 0 and 1 for each record, and by
varying the prediction threshold, we get a series of different sensitivity and specificity values, and
thus the ROC curve. In application of the prediction model, one needs to pick a threshold, which
is a trade-off between sensitivity and specificity.)
Both predicted and observed risks in Table 8 are prorated as discussed previously, to compensate
for the sampling. We over-sampled breast cancer records, because for non-breast cancer nurses,
we randomly selected one record; while for breast cancer nurses, we selected the record precedes
the diagnosis of breast cancer. The proration ratio for all breast cancer is determined by dividing
the expected number of breast cancer records if we had selected all records randomly by the
actual number of breast cancer records. That is equal to dividing the actual number of breast
cancer records (3216) by the total number of records of nurses with breast cancer (19186), which
is 0.168. Then we also compensated for breast cancer cases missing ER/PR status, assuming in
these cases the marginal distribution for ER+/PR+ breast cancer is the same as in those with
known ER/PR status. The final proration ratio for ER+/PR+ breast cancer is 0.226, and we
multiplied this ratio into both predicted risks and observed risks for a direct view of real-world
risk.
- 61 -
Table 8 Prediction results of ER+/PR+ breast cancer
predicted
risks
(prorated)
observed
risks
(prorated)
number
of
records
observe
d cases
predicte
d cases
threshold
1- specificity
sensitivity
0.001
0.002
3451
23
17.255
0.001
1.000
1.000
0.003
0.002
6379
54
76.548
0.003
0.809
0.936
0.004
0.005
1675
38
28.475
0.004
0.457
0.787
0.005
0.005
514
11
10.280
0.005
0.365
0.681
0.006
0.005
530
12
13.250
0.006
0.337
0.651
0.006
0.007
2095
69
58.660
0.006
0.309
0.618
0.007
0.007
45
1
1.350
0.007
0.196
0.427
0.007
0.005
786
18
23.580
0.007
0.193
0.424
0.008
0.019
111
9
3.774
0.008
0.151
0.374
0.008
0.009
0.006
0.006
169
620
4
15
5.915
25.420
0.008
0.009
0.145
0.136
0.349
0.338
0.009
0.008
217
7
9.114
0.009
0.102
0.296
0.010
0.010
360
15
16.560
0.010
0.090
0.277
0.010
0.012
372
20
17.112
0.010
0.071
0.235
0.011
0.010
320
14
15.040
0.011
0.051
0.180
0.011
0.012
0.011
0.014
9
87
0
5
0.432
0.011
0.012
0.034
0.141
4.437
0.034
0.141
0.014
0.011
155
0.014
0.033
23
7
3
9.300
1.472
0.014
0.014
0.029
0.021
0.127
0.108
0.015
0.022
140
13
9.100
0.015
0.020
0.100
0.064
0.015
0.014
23
1
1.541
0.015
0.013
0.017
0.016
49
3
3.773
0.017
0.012
0.061
0.019
0.019
28
2
2.408
0.019
0.009
0.053
0.024
0.028
126
15
13.104
0.024
0.008
0.047
0.026
0.024
23
2
2.622
0.026
0.001
0.006
0.029
0.019
5
0
0.640
0.029
0.000
0.000
We also evaluated the model's ability to give an accurate probability for breast cancer prediction
by comparing expected and observed incidence rates. The Bayesian Network predictor gives a
probability to each record, as a predicted risk for that nurse. For nurses with the same input
configuration, or variable values, the probability assigned by the Bayesian Network predictor will
be the same. Therefore, the prediction results naturally group into clusters, with the same
- 62 -
predicted risk or probability for all records in the same cluster. For each cluster, we calculated the
observed risk using a Bayesian approach based on the true class values of the records within that
cluster, with uniform prior distribution and an equivalent sample size of 1. The size of the clusters
(number of records within the cluster) varies over a wide range, as shown in Table 8, because the
distribution of input configurations is unbalanced.
A comparison between predicted risk and observed risk is shown in Figure 11. The actual data are
also in Table 8. The diagonal dotted line is an "ideal calibration curve," on which the results from
a perfectly calibrated classifier would fall.
0.14
--
,
I
I
I.
O.12
0
0
0.1
0
0
0 .08
0
0Q
0
0
0
0
0
0- 0.06
7'0
0
0.04
0
,0
0,'
°
0.02
.
-
0
7
e"
I
p
p
n
v0
0.02
0.04
0.06
0.08
Observed Risk
0.1
0.12
0.14
0.16
Figure 11 Calibration curve of ER+/PR+ breast cancer prediction on test set
There are two aspects we need to evaluate in this calibration curve. First, we want to know
- 63 -
whether there is a systematic overestimation or underestimation of risks. Second, we want to
know, quantitatively, how close the predicted risks are to the observed risks on average. The first
one can be evaluated by comparing the total expected number of breast cancer cases based on the
predictions with the total observed number of breast cancer cases. The second one can be
evaluated by a goodness-of-fit test.
For the above prediction results, the total expected number (E) of breast cancers is 370, which is
the sum of expected breast cancer cases of every cluster, and the observed total number (O) is 361,
giving an E/O ratio of 1.03. Rockhill et al. and Costantino et al. calculated CI of E/O by assuming
a Poisson distribution for 0. [38, 59] With this method, calculating the 95% CI of O and dividing
E by these upper and lower numbers, the CI for above E/O is 0.93 - 1.14.
In a goodness-of-fit test, we calculated the X2 statistic of the predicted number of cases and
observed number of cases by summing over the clusters, which gives
x2
of 34.28 with
corresponding p-value of 0.13, showing no statistically significant lack of fit.
Alternatively, the calibration curve can be evaluated using linear regression. The linear regression
coefficient evaluates the closeness of predicted risks and observed risks, or degree of linear
relationship between them. A linear regression on predicted risk versus observed risks gives a
linear regression coefficient r2 of 0.77. We aimed at improving this linear relationship between
predicted and observed risks in next section.
4.6 Filtering of Risk Prediction Probabilities Based on Clustering
A closer look at the calibration curve shows that the data points that deviate further from the
"ideal calibration curve" in Figure 11 generally have fewer records than the data points closer to
the center. This observation prompted an attempt to improve the linearity of the calibration curve
by clustering these data points.
- 64 -
Our Bayesian Network risk model gives the same probability prediction on records with the same
input configuration, which constitute generic prediction clusters. The number of records within
each generic cluster varies, reflecting the distribution of records over the input configurations.
From the previous results, we see that clusters with more records are generally closer to the "ideal
calibration curve" than clusters with fewer records. This observation can be intuitively
understood, because with fewer records, the estimates of probabilities have larger variance.
Based on the above observation, we designed a filtering process to improve the linearity of the
calibration curve. To be consistent with the model evaluation and validation process, this filter is
also trained on the same training set on which we trained the model, and then applied to the
prediction result of the same test set on which we tested the model. More specifically, we
designed a filtering process that can learn a merging scheme from the training set predictions, i.e.
which small clusters shall be merged together, and then apply this learned merging scheme to the
test set prediction result, which looks like that we are filtering the prediction results such that the
output only contains relatively larger or merged clusters.
If we merge the small clusters together, the resulting new calibration will deviate less from the
"ideal" linear shape, but its discrimination ability may degrade, because the previously different
or discriminating outputs now become the same. Therefore, we need to find a trade-off point that
can maximize the linearity of the calibration curve without losing much discrimination power.
Slonim et al. developed a clustering algorithm that "explicitly maximizes the mutual information
per cluster between the data and given categories," called the agglomerative information
bottleneck method. [122] We employed a clustering approach based on this agglomerative
information bottleneck method to find this trade-off point, as described below.
- 65 -
Suppose there are Mo different input configurations in the data set, represented by x0i, i = 1, 2 ...
Mo. Records with the same input configuration constitute a generic cluster, because merging these
records (if we start from one cluster per record) would be trivial. The model gives probability
prediction p(ylxoi) on records within the same cluster xoi, and p(ylxoi)is the output for this cluster
Xoi.
All generic clusters constitute the original cluster pool C. When two clusters xa and
Xb
are
merged into a new cluster xz, Xa and Xb will be replaced by x, in the cluster pool, and the cluster
pool becomes C 1, which has M = M-
1 clusters. The output of the new cluster xz is the
weighted average of the output of xa and Xb,i.e. p(y
xz) = p(Xa)p(y Xa) +p(xb)p(y Xb) Then
p(Xa) + p(Xb)
the merging procedure continues among the clusters within the cluster pool, while the size of the
pool (number of current clusters) decreases.
The clustering is a greedy approach, merging two clusters together at a time. In each step, the
clusters to be merged were selected from all possible combinations of any two clusters within the
current cluster pool by the means of evaluating merge cost, which is weighted Jensen-Shannon
divergence change.
Jensen-Shannon divergence (JS divergence) is a measurement of "distance" between conditional
distributions. The JS divergence between two clusters can be calculated as following:
2
JSp(xl),p(x
2 ) [p(y xl),p(y Ix2)]= H
2
p(xi)p(Y Ixi) i=1
p(xi )H[p(y Ixi)]
i=l
where p(xl) and p(x2) are the marginal distributions of two partitions, and p(ylxl) and p(ylx2) are
the conditional probabilities of y given xl and x2, respectively. H[p(x)] is Shannon's entropy,
givenbyH [p(y)]= -Z y p(y) log p(y).
- 66 -
Slonim et al. showed that in each merge step, the decrease in the mutual information between the
clusters and the category variable (class variable) due to the merge is the JS divergence between
the merged clusters weighted by the marginal distributions of these clusters, i.e.
merge cost =[p(xl) +p(x2)] JSp(xl)P(x2 ) [P(y xl),p(y x2)]
By minimizing this "merge cost," we can find the "best possible merge" in each greedy step.
We applied this clustering procedure to the prediction results of the training set, and recorded
during this procedure the area under the ROC curve and the linear regression coefficient of the
observed and predicted risk, shown in Figure 12. From these results, we found a good trade-off
point, when the number of clusters is 7, where both AUC and r2 are very close to optimal values.
The original AUC is 0.700, and the optimal r2 is 1.000 with 3 clusters. With 7 clusters, the AUC is
0.698,
and rr2 is
is 0.999.
0.999. Please
Please note
note these
these are
are prediction
prediction results
results on
on the
the training
training set,
set, not
not the
the test
test set.
set.
0.698,
and
- 67 -
/
74
0
IL
I
numberof clusters
Figure 12 Clustering of prediction on training set
Therefore, we obtained a clustering filter, which can improve the calibration curve performance
without losing much discrimination power in the training set, by mapping the model prediction
output of each generic cluster to the prediction output of one of the seven clusters.
We then used this clustering filter, with number of clusters equal to 7, to smooth the prediction
result of the test set. The prediction results after filtering are shown in Table 9
- 68 -
Table 9 Prediction results on test set after filtering
predicted
risks
(prorated)
o)bserved
risks
(prorated)
0.001
0.002
0.003
0.004
0.006
0.010
0.015
0.024
0.002
0.005
0.007
0.009
0.016
0.026
__
_
count
_
observed
cases
predicted
cases
threshold
3451
23
17.255
6379
2189
3456
2265
418
154
54
49
100
89
29
17
76.548
38.865
96.801
97.129
27.778
16.304
_
_
1_
_
1-specificity
sensitivity
0.001
1.000
1.000
0.003
0.004
0.006
0.010
0.015
0.024
0.809
0.457
0.337
0.151
0.029
0.008
0.936
0.787
0.651
0.374
0.127
0.047
1
0
0
These prediction results after filtering give the ROC curve shown in Figure 13.
- 69 -
1
I
l
[
-I
l
[
l
l
[
- -r .,,.... roil,
1l~
J
0.9
.0."'"
...
0.8
0.7
0.
0.6
:>I
0.5
0)
Cn
9
0.4
AUC
=
0.70 (95% CI: 0.67
-
0.73)
..
0.3
.
A,
0.2
A.
0.1
,.
W
I
I
r'
0.1
0.2
0.3
0.4
I
0.5
0.6
1 - specificity
I
I
I
0.7
0.8
0.9
1
Figure 13 ROC curve after filtering on test set
In Figure 13, the circles connected by the dotted lines are data points after filtering, and the
triangles are data points before filtering, i.e. the original clusters. From the graph, we see the
clustering happened mostly among the clusters that are close neighbors (with similar sensitivity
and specificity). The "crowded" areas at the low sensitivity end are now represented by one or
two points, while the points without many neighbors at the high sensitivity end are kept. The
AUC of this filtered result is 0.70, with 95% CI 0.67-0.73. Therefore, the clustering and filtering
didn't lose much, if any, discrimination power compared to the original model.
The calibration curve of the prediction after filtering is shown in Figure 14 (the dotted line as the
"ideal calibration curve" for comparison). From the graph, we can see now that the predicted risk
- 70 -
and observed risk have a higher degree of linear relationship than before clustering and filtering
(in Figure 11). The linear regression coefficient r 2 is 0.996, while before clustering it was 0.77.
We can see that in the test set the clustering filter also did a good job by improving linearity of the
calibration curve and keeping the same discrimination power.
0.025
0.02
CD
.,
a)
0.015
.'0
.0
U,
WU
a:
0.01
0
._
7
0
0.005
0
0
n
v0
'/ v
I
0.005
I
I
0.01
0.015
ObservedRisk afterclustering
I
0.02
0.025
Figure 14 Calibration curve after filtering on test set
4.7 Classifiers Predicting Long Term Risks
The model discussed in the previous section, based on two-year interval data collection, predicts
breast cancer incidence risk within two years. Clinically, longer term risk prediction may be
desired for earlier and more effective prevention and disease management.
-71 -
In order to build and test a model predicting breast cancer risks in the next N years, we need
records of those nurses who were known not to develop breast cancer in the next N years as
non-breast cancer records, and records of those nurses who were diagnosed of breast cancer
randomly within the next N years as breast cancer records.
The non-breast cancer records can be obtained by randomly sampling one record from those at
least N years before the last record, in which we know for sure a nurse didn't develop breast
cancer within the next N years. The breast cancer records can be obtained by randomly sampling
one record from within N years of the report of breast cancer diagnosis.
Putting these data together directly, however, may introduce a sampling bias that artificially
increase prediction accuracy, because there will be no non-breast cancer records within the last N
years of the study, but there could be breast cancer records during that period. Therefore, we need
to filter the sampled data by discarding the breast cancer records that fall within the last N years
of study. After such sampling and filtering, we get a 4-year risk data set with 1294 ER+/PR+
breast cancer nurses and 68308 non-breast cancer nurses, and a 6 year data set with 1176
ER+/PR+ breast cancer nurses and 64709 non-breast cancer nurses.
Again, by randomly splitting the data sets (stratified) into training sets with 75% data and test sets
with the remaining 25% data, we learned Bayesian Networks from the training sets and evaluated
the generalization errors on the test set. The Bayesian Networks for predicting 4-year and 6-year
risks for ER+/PR+ breast cancer are shown in Figure 15 and Figure 16. In these two graphs,
variable "Brcn" marks ER+/PR+ breast cancer versus non-breast cancer.
- 72 -
Figure 15 Bayesian Network for predicting 4-year risk of ER+/PR+ breast cancer
Figure 16 Bayesian Network for predicting 6-year risk of ER+/PR+ breast cancer
From the graphs, we can see the predicting variables are the same as in the 2-year risk model
(Figure 8), age (Age_Df53), family history of breast cancer (Famhx), years having menstrual
period if had benign breast disease (Bbdmeno), and years taking oral estrogen and progestin
(Yearsprogestin). The prediction results are listed in Table 10.
- 73 -
Table 10 Prediction of 2-year, 4-year, and 6-year risk models for ER+/PR+ breast cancer
Model
AUC
95% Cl of
AUC
E/O ratio
2-year risk
0.70
0.67 - 0.74
1.03
4-year risk
0.68
0.64 - 0.72
6-year risk
0.66
0.62 - 0.69
95% CI of
E/O ratio
p
r
1.02
0.93 - 1.14
0.91 - 1.14
0.13
0.75
0.77
0.79
1.06
0.94 - 1.19
0.55
0.28
We also applied filters constructed by agglomerative information bottleneck clustering on the
predictions of 4-year and 6-year risks, and received similar improvement on the calibration curve
without losing much discrimination power, shown in Table 11.
Table 11 Prediction after filtering of 2-year, 4-year, and 6-year ER+/PR+ risk models
Model
2-year risk
AUC
95% CI of
AUC
E/O ratio
95% CI of
E/O ratio
r2
p
4-year risk
0.70
0.68
0.67 - 0.73
0.65 - 0.72
1.03
1.02
0.93 - 1.14
0.91 - 1.14
0.10
0.42
0.996
0.97
6-year risk
0.66
0.62 - 0.69
1.06
0.94- 1.19
0.70
0.91
We developed models predicting 2-year, 4-year, and 6-year risks of ER+/PR+ breast cancer, and
validated the models on hold-out test sets. The predicting structures are very similar for these
three models, with age, family history of breast cancer, years on oral estrogen and progesterone
use, and years having menstrual period given benign breast disease as the predicting factors for
breast cancer.
We are not surprised to see such similarities between these models because of several reasons.
First of all, the underlining biological and pathological mechanisms are the same, no matter
whether we are predicting 2-year risks, 4-year risks, or 6-year risks. If a biological mechanism
determines breast cancer risks in two years, it is possible that the same biological mechanism may
determine breast cancer risks in four years as well, though we may expect to see more uncertainty
- 74 -
involved in longer term risks. Secondly, we sampled the breast cancer records in a way such that
for 4-year risk data, about half of the breast cancer records are exactly the same as 2-year risk
data, and for 6-year risk data, one third of the breast cancer records are exactly the same as 2-year
risk data, and another one sixth are exactly the same as 4-year risk data (which is different from
the 2-year data). In addition, because we discretized the variables, many of the different records
actually got the same values after discretization except for the portion on the boundaries. These
two effects together helped in deriving similar model structures. Thirdly, we learned these
Bayesian Networks using the same variable orders, which favor similar structures.
Performance wise, our models have discrimination results in 2-year risk prediction better than
those in 4-years, which in turn are better than those in 6-years. This is quite intuitive: longer term
results generally involve more uncertainty, and thus are harder to predict. On the other hand, the
differences between these models are not very large. Instead, there are large overlaps on the 95%
CI of the AUC(s between the results, and the AUC of 6-year risk prediction barely falls out of the
95% CI of the AUC of 2-year risk prediction. One reason for such small differences can be
accredited to the similar predicting structures, while unchanged values caused by discretization
might be another reason that can partially explain it.
4.8 Summary
In this chapter, we presented a series of models that we built to predict absolute risks of breast
cancer over a defined period of time based on a set of variables constructed and selected by the
log-incidence model proposed by Colditz et al.. [36] The data were randomly sampled for risk
prediction of different period of time.
We evaluated the models on hold-out test sets, and reported the results in the form of AUC,
calibration curve, E/O ratio, goodness-of-fit test, and linear regression of predicted and observed
- 75 -
risks. The AUC of the models is within the range from 0.66 to 0.70, and thus provides a relatively
good discrimination power for breast cancer, given the fact that the current model used clinically,
the Gail model, has an AUC of 0.58. [59] While a goodness-of-fit test showed no statistically
significant lack of fit, the calibration curve actually showed rather good linear fit of the observed
and predicted risks, and the overall E/O ratio is from 1.02 to 1.06.
In order to further improve the prediction performance on calibration, we constructed output
filters by clustering the predictions on training sets, and then filtered the predictions on test sets
using these filters. We reported the clustering results using the Agglomerative Information
Bottleneck (AIB) method, which improved the linear relationship between the predicted and
observed risks without sacrificing much discrimination power.
The discrimination performance of the models, measured by AUC, is far from a perfect model,
which should have an AUC close to 1. Breast cancer is a disease that involves uncertainties. It is
possible that whether a woman will develop breast cancer is uncertain until a very short time
before the onset of the disease. In other words, clinical and life-style changes, spontaneous
mutations in breast tissues, or other random factors might change the probability of developing
breast cancer. It is possible that breast cancer is not a deterministic disease in the long term (e.g.
two years in advance).
Because the biological and pathological mechanisms for breast cancer are unclear, or the exact
information is unavailable, we used observable variables and risk factors based on past
experience to build risk predicting models, which may be directly or indirectly linked to the true
cause of breast cancer. For example, family history is linked to genetic defects and mutations,
which should be one of the causes of breast cancer. Because we do not have the exact genetic
information of all nurses, we used family history as a proxy to approximate the influence of the
genetic factors. Such approximation brings in even more uncertainty in predicting breast cancer
- 76 -
compared to a hypothetical situation where we could directly use the genetic defects and
mutations for prediction, which already involves uncertainty in low penetrance diseases such as
breast cancer.
We are not saying that the prediction of breast cancer cannot be better. Some important risk
factors, such as genotype of BRCA1 and BRCA2, were not available at the time of this study.
inclusion of such information will certainly boost the prediction power of the models. Also better
models can be constructed when more advanced machine learning techniques are developed, or
when we have a deeper understanding of the mechanism of breast cancer.
The calibration performance of the models, measured by E/O ratio, calibration curve, or linear
relationship between observed and predicted risks, are quite good. Even with limited
discrimination power, an accurate estimate of breast cancer risk is still useful, whether in clinical
trial planning or disease counseling and management.
- 77 -
Chapter 5
Discussion
5.1 On Exploratory Analysis
During the procedure of exploratory analysis, we learned many lessons, which we wish we had
known beforehand to work more efficiently and productively.
5.1.1 Expert Knowledge
Human expert knowledge is necessary in exploring complex dependencies even with advanced
tools. One place we can use human expert knowledge is in variable selection. The search space of
Bayesian Networks increases as a factorial of the number of variables, thus limiting variables in
the study scope can help dramatically in limiting computation. In addition, incorporating expert
knowledge in variable selection may help improving the performance of classification models. In
our study, the classification models for ER+/PR+ breast cancer are learned from variables created
from the original raw data based on human expert knowledge, i.e. the log-incidence model
proposed by Colditz et al. with their understanding of the mechanism of breast cancer
development. The classification performance of these models is better than that of other models
we learned directly from the raw data. On the other hand, such selection of variables may impose
unwarranted constraints on the parts of the hypothesis space being explored, especially when we
try to find something of which nobody ever thought. There is always such a trade-off.
5.1.2 Exploration of the Dependencies in Learned Bayesian Networks
Learning a Bayesian Network from data sometimes prompts us with evidence of dependencies
that we never thought of or have little knowledge of. One example is the dependency between
clomid use and breast cancer with specific ER status discussed previously. This is one of the
objectives of exploratory analysis: to find out things we don't know.
- 78 -
Domain specific knowledge and common sense are heavily involved in such explorations. Also
the dependencies need to be explained with care. For instance, we found a dependency between
breast implants and breast cancer in the process of exploration, such that nurses who had breast
implants also had a higher incidence of breast cancer. After discussions with Professor Marco
Ramoni and Professor Graham Colditz, we suspected that the dependency is probably due to
breast implant after surgical removal of breast(s) because of breast cancer. To confirm this
hypothesis, we further investigated this dependency by comparing the date of breast implant and
the date of diagnosis of breast cancer. If the implant is after diagnosis of breast cancer, that record
is re-marked as not having a breast implant. By such manipulation, we were able to evaluate
breast implant as a possible risk factor for breast cancer. The result showed, on the contrary, that
breast implant has a statistically insignificant association with lower incidence of breast cancer.
Such further exploration, guided by common sense and domain expert knowledge, is used to
validate and confirm the dependencies found by learning a Bayesian Network from the data. To
some extent, it can also help to understand the causal relationship that could be underlying the
dependencies.
5.1.3 Split of Training and Test Set
The data sets we worked on are highly unbalanced, not only on breast cancer, but also on many
other variables such as family history or clomid use. When splitting the data into a training set
and a test set, we need to stratify the random splitting. For a more balanced data set, randomly
splitting without stratification may not be a problem, because the random variance on the
distribution is small comparing to a relatively large marginal distribution of the variable. For an
unbalanced data set, the random noise can make a big difference. In our early analysis, we used
non-stratified random splitting when training classification models. The prediction result has a
heavy systematic bias by underestimating the risks. It turned out that was due to a higher
incidence of breast cancer in the test set than the training set, which means we were evaluating a
- 79 -
model on a different population. In later analysis, we stratified all the random splitting and saw no
more large systematic bias.
5.1.4 Discretization and Consolidation of Variable Values
The major tool we used, Bayesware Discoverer, requires continuous variables to be transformed
into discrete variables. Many other Bayesian Network tools have the same requirement. Tools that
can deal with continuous variables generally require the distribution of the continuous variable to
be normal. Therefore, discretization is often a problem that we need to consider in Bayesian
analysis.
There are many different ways to discretize continuous variables. Some machine learning
algorithms can discretize during the learning process, such as C4.5, while many require it to be
done in pre-processing the data. A continuous variable can be discretized based only on the
information itself or it can be based on information from other variables. Commonly used
methods to discretize a variable include binning to achieve equal numbers of data values in each
bin, choosing bins that each span the same fraction of a variable's range, or bins based on
normalized distributions.
Other can be based on entropy or some statistics such as x2 values.
There is no established theory about the best discretization method. In this work, we tried equal
interval binning, equal size binning, and binning based on entropy and/or Jensen-Shannon
divergence. We also tried to define the size of the bins to emphasize the extreme ends of values
(e.g. top 1/5 and bottom 1/5 values as one bin each, while the middle 3/5 values as one bin).
There is no statistically significant difference between these methods when comparing the
prediction performance of the classification models we built.
5.2 On Learning Bayesian Networks from Data
- 80 -
5.2.1 Reading a Bayesian Network Learned From Data
Interpreting Bayesian Networks (BN) learned from data is different from using BN to represent
human expert knowledge and reasoning. Learned BNs represents conditional independency
structure. Knowledge based BNs generally represents causal relationships. Explaining the
dependencies in a BN learned from data as causal relationship can be misleading or even
dangerous.
The clomid analysis and breast implant analysis are two examples. In the clomid case, we
checked for temporal succession to make sure that clomid was taken before the diagnosis of
breast cancer, and matched clomid use on age and parity, two possible confounders. Therefore,
with the resulting statistical association we suggest clomid as a possible risk factor for breast
cancer with specific estrogen receptor status, or specifically, that it might increase the risk of ER+
breast cancer and decrease the risk of ER- breast cancer.
In the breast implant case, we first checked for temporal succession by comparing the date of
breast implant with the date of diagnosis of breast cancer. It turned out that a large number of
breast implants happened after breast cancer diagnosis. Then we marked these nurses as not
having breast implant, so that the temporal relationship is consistent with a causal hypothesis as
for a risk factor. The statistics after this manipulation showed statistically insignificant association
between breast cancer and breast implant. All these analyses together illustrate that the
dependency we found in the original Bayesian Network reveals a possible causal relationship, but
with breast cancer as the cause and breast implant as the result.
We also need always to keep in mind that the dependency structure we see is only one of a group
-
of equally or possibly even more probable structures. We cannot make any conclusion without a
reasonable amount of effort, and even then the conclusion is only suggestive, not definitive, as
we've seen in the d-separation analysis of the risk score model.
- 81
5.2.2 Bayesian Network and Highly Interdependent Data
With an infinite amount of data, we can build a joint probability distribution table as a predictor
with the best possible performance, i.e. no better prediction can be made without extra variable
information. In reality, however, it is always the case that our data are very limited. Even with
tens of thousands of records, we still soon run out of data as the number of data points necessary
to make a reasonably good estimation increases exponentially with the number of variables (and
their values) in the joint probability distribution table.
People have been using Bayesian Networks to deal with such problems by reducing the parameter
space with a well defined dependency structure, and in many cases have been successful.
However, if the variables in the data set are highly interdependent, we will still encounter
difficulties.
For example, for a data set with n binary variables, 2n parameters are necessary to fully describe
the joint probability distribution table. If the underlying dependency structure can be represented
with a Bayesian Network in which all nodes are isolated, i.e. all independent of each other, we
need only n parameters to fully describe the distribution. That's a reduction of parameters from
exponential to linear. If the underlining dependency structure can be represented with a Bayesian
Network in which every node has exactly one parent and one child, except for the top one and the
bottom one, we need 2n-1 parameters. More generally, a Bayesian Network with n nodes, each
n
2 pi number of parameters to fully describe, while pi is
has pi number of parents, will need
i=1
constrained by the acyclic restriction. When the Bayesian Network is a complete graph, this
number becomes 2n, which is the same as that of a joint probability distribution table.
If the underlying dependency structure is actually complex, the number of dependencies needed
- 82 -
in a BN to describe it grows, and the amount of data needed to learn these additional parameters
also grows. When the data are insufficient, we will end up with a simpler structure, reflecting
inadequate information to override the independence assumptions.
Therefore,
learning a good Bayesian
Network from data becomes
harder when the
interdependency among the variables increases. In extreme cases, it can be as hard as estimating a
joint probability distribution table.
In such situations, we will have to seek methods to reduce the interdependencies among the data.
Merging variables and consolidating variable values could be useful, yet our limited effort on this
problem didn't make much difference in the experiments reported here. This forms one
interesting problem for future discussion.
5.3 On Evaluation of Risk Predicting Models
5.3.1 Evaluation of Models
We need to clarify two factors in evaluating a risk prediction model. One is the prediction
capability of the model itself, or how well the model captures the structure of the data without
overfitting. This is generally performed by evaluating the generalization error of the model. The
other is the representativeness of the population on which the model is built. If the population
used to build the model is not representative, the generalization ability of the model will be
limited.
Evaluating the model using hold-out test samples, cross-validation, calibration curves, or other
similar methods gives a measure of the prediction capability of the model itself. Comparing the
marginal distributions of various variables with those in a "gold standard" population, e.g. the
whole population of the US, can give a rough idea on the representativeness of the population.
- 83 -
Evaluating the model on a different population, for instance, a different study, can be a
combination of both, but such efforts need to be interpreted with care because this other test
population may not be representative as well.
In this work, we evaluated the prediction capability of the model using a hold-out test set, and
therefore applying these models to any population other than NHS needs careful checking on the
target population distribution and probably even model adjustments.
We evaluated the models on discrimination using AUC, and on calibration using a calibration
curve, E/O ratio, goodness-of-fit test, and linear regression. The AUC of the models is within the
range from 0.66 to 0.70, and thus provides a relatively good discrimination power for breast
cancer. Considering calibration, all models passed a goodness-of-fit test with no statistically
significant lack-of-fit, and with a good linear relationship between predicted and observed risks,
especially after clustering.
Even though the discrimination power is limited (best AUC of 0.70, good for breast cancer, but
still limited), a well-estimated risk can still be useful in certain applications such as deciding
whether to administer a preventive intervention that has adverse and beneficial effects.
5.3.2 Comparison with the Model in Clinical Use
The Gail model is a clinically applied breast cancer risk model, and evaluation results of this
model are available. Spiegelman et al. reported a linear regression coefficient of 0.67 between
observed and predicted risks. [58] Costantino et al. reported overall E/O ratio 0.84 (95% CI
0.73-0.97) on Gail model 1 and overall E/O ratio 1.03 (95% CI 0.88-1.21) on Gail model 2. [38]
Rockhill et al. reported AUC of 0.58 (95% CI 0.56-0.60), and overall E/O ratio 0.94 (95% CI
0.89-0.99) on Gail model 2. [59]
- 84 -
Comparing models presented in this work with the Gail model, however, is infeasible at the
current stage of our work. This is because of two reasons. First, the Gail model predicts the risk
for all breast cancer, while the majority efforts of our work focused on ER+/PR+ breast cancer,
because it is clinically important and useful. Second, the Gail model predicts breast cancer risk in
a five year period (in clinical applications as well as the published evaluation results), while the
models of this work predict risks in two, four, or six year periods. Measures can be taken to make
these models comparable to the Gail model, which will be discussed later.
5.3.3 Clustering of Bayesian Network Prediction
The data set we used is highly unbalanced, not only in the class variable, breast cancer, but also in
predicting variables such as family history and hormone use. It is common that some of the
prediction buckets are much smaller than others. Predictions on these small buckets are generally
inaccurate. Clustering the small buckets together may help improve the robustness of the
prediction. The reason for improvement in calibration but not in AUC is due to the small size of
the inaccurate buckets, which are clustered in the process. For overall accuracy, such changes
may not have much impact. However, when applied in clinical decision making, for the
individuals falling into those small clusters, having a better estimation of risk (probability) can
make a significant impact on the final medical decision.
We reported the clustering results using the Agglomerative Information Bottleneck (AIB) method,
which improved the linear relationship between the predicted and observed risks without
sacrificing much discrimination power. During the exploratory analysis, we also tried clustering
based on the Euclidean distance between the inputs, with or without penalizing on the
Jensen-Shannon distance, and ad hoc clustering simply based on the output. We finally decided to
use AIB because it has a well-established theory and better empirical performance, i.e. faster
increase in linear relationship and slower decrease on AUC.
- 85 -
5.4 Summary of Contributions
In this thesis, we performed exploratory analysis on a rich collection of data from a large health
cohort study, the Nurses' Health Study.
In Chapter 3, we illustrated with examples how Bayesian Networks can be used to explore the
dependencies among a group of variables, and how to exercise care when explaining the found
dependencies and validate for causal relationship. We investigated the dependency structures
among a collection of clinical and life-style variables and a selected set of SNPs from different
genes, and obtained a null result that these SNPs may not be associated with breast cancer. We
also found during the exploration a new risk factor, clomid use, for breast cancer with specific
estrogen receptor status.
In Chapter 4, we first attempted to evaluate risk score as an index for breast cancer incidence, and
then we developed risk prediction models using Bayesian Networks, first on all breast cancer,
then on ER+/PR+ breast cancer, which is more prone to the current preventive intervention and
hence whose prediction is clinically more significant. We evaluated the performance of the risk
prediction models on discrimination ability using AUC, which is good compared to the Gail
model used clinically. We also evaluated the calibration of the models using calibration curves, X2
statistics, E/O ratio, and linear relationship of predicted and observed risks. We applied
Agglomerative Information Bottleneck clustering to construct a prediction filter for the risk
model, and improved the calibration performance without sacrificing much discrimination power
by applying the filter to prediction outputs.
Overall, our contributions include the discovery of a new risk factor for breast cancer with
specific estrogen receptor status, and the development of prediction models for breast cancer
incidence risks in different periods of time. We applied prediction output filters to the models we
-86 -
developed, and showed improved calibration performance, which is clinically important in
disease management and clinical trial planning. We also showed that with a reasonable amount of
effort, no dependency could be found between a group of SNPs and breast cancer, suggesting no
association exists between the related genes and breast cancer.
5.5 Future Work
Exploratory analysis on a large health cohort study such as the Nurses' Health Study can be
unending. The work presented in this thesis is very limited, and has a large potential for further
improvement.
5.5.1 Data Pre-processing
Data collected in the Nurses' Health Study don't have an overwhelming number of missing values,
and we didn't suffer much from missing values in this work. We used k-nearest neighbor
imputation to deal with the missing values. There are many superior methods to deal with missing
values, and recruiting those methods may render better results.
We already discussed discretization and consolidation of variable values. We tried different ways
of discretization, but there are more options available. For example, we can discretize continuous
variables in the process of learning a Bayesian Network, which we didn't try due to the limitation
of the tools we used. In the latest exploration, we tried discretization on Jensen-Shannon
divergence, which showed some signs of improvement. This could be another promising direction
for future work.
5.5.2 Evaluation of Risk Score As an Index for Breast Cancer
The work we presented on evaluating the risk score as an index for breast cancer is incomplete.
- 87 -
We believe that a similar analysis could be completed based on all the data that were used in
developing the log-incidence model, which were not all available to us in the present study.
With those additional data, we could search a wide range of possible Bayesian Network structures
to find the best predictive models and to assess our confidence in them.
5.5.3 Dealing with Highly Interdependent Data
Learning Bayesian Networks is difficult with highly interdependent data. There could be other
ways to help attack this problem. For example, we can try to use a model with latent variables,
hoping the introduction of a latent variable can reduce the interdependencies. Alternatively, we
can construct a group of Bayesian Networks from the data with different learning heuristics or
parameters, and compose a Bayesian Committee with this group of Bayesian Networks. Such a
meta-classifier may have more robust performance.
5.5.4 Prediction of 5-year Risks
Clinically, people are more used to estimating 5-year risks, partly due to the tradition. The models
presented in this work predict two, four, or six year risks, because the data were collected in two
year intervals. With time alignment and interpolation of the data, however, it is possible to
develop 5-year risk models using the same approach as in this work. In order to make this work
applicable in a clinical setting, probably developing a 5-year risk model is necessary.
The development of a 5-year risk model will also help to make it comparable with the Gail model,
if clinical guidance can be established to project ER+/PR+ risks to all breast cancer risks.
5.5.5 Clomid and Breast Caner
-
Clomid, or clomiphene citrate, has been widely used in treatment for fertility enhancement since
- 88
the early 1970's, because of its ease of administration and minimal side effects. It acts as a
selective estrogen receptor modulator, binding to estrogen receptors and has mixed agonist and
antagonist activity depending upon the target tissue. [113] Clomid works as an estrogen
antagonist at the hypothalamus, causing it to sense low levels of serum estrogen and increase the
production of gonadotropin releasing hormone, which stimulates the pituitary gland to produce
follicle stimulating hormone (FSH). FSH stimulates the development of the ovarian follicles,
which contain the eggs. In addition, clomid works on the pituitary gland to boost the production
of Luteinizing Hormone, which triggers the ovulation process and maturation of the eggs. [ 114]
Clomid works as an estrogen agonist on some tissues, while estrogen and estrogen-like hormones
are known to be risk factors for breast cancer. Is this why clomid use is associated with higher
risk of ER+ breast cancer? If so, are there similar effects of other estrogen-like hormones, i.e., do
they specifically raise the risk of ER+ breast cancer?
Women take clomid as a fertility drug, which happens during reproductive age, or at least before
menopause. Women take PMH for hormone replacement therapy, generally after menopause.
While tamoxifen is generally used for prevention and treatment of early stage breast cancer,
which is more likely to happen in the later part of a women's life. Could this different stage of life
when taking the hormones have an impact on the outcome?
Clomid causes many different hormonal changes. To name a few, it increases androstenedione,
dehydroepiandrosterone,
dehydroepiandrosterone
sulfate,
dihydrotestosterone,
estradiol,
estrogens (urine), follicle stimulating hormone, luteinizing hormone, progesterone, testosterone,
free testosterone, thyroxine binding globin, thyroid stimulating hormone, and decreases
cholesterol, triiodothyronine and free thyroxine index. [115] Could any of these changes, or any
-
other not listed here, be related to breast cancer risks?
- 89
All these questions cannot be answered without further investigation and form interesting topics
for future work.
- 90 -
Bibliography
1. Nurses' Health Study. Available at: http://www.channing.harvard.edu/nhs/hist.html.
2. R. Chen, personal communication.
3. Chan AT, Giovannucci EL, Meyerhardt JA, Schernhammer ES, Curhan GC, Fuchs CS.
Long--term Use of Aspirin and Nonsteroidal Anti-inflammatory Drugs and Risk of
Colorectal Cancer. JAMA. 2005;294:914-923.
4. The Nurses' Health Study (NHS).
Available at: http://www.channing.harvard.edu/nhs/index.html
5. Cotran RS, Kumar V, Collins T. Pathologic Basis of Disease,
6 th Edition.
W.B. Saunders
Company, 1999.
6. American Cancer Society.
Available at: http://www.cancer.org/docroot/STT/stt_0_2004. asp?sitearea=STT&level=l.
7. Bross DJ, Blumenson LE. Statistical testing of a deep mathematical model for human
breast cancer. J Chronic Dis. 1968 Dec; 21(7):493-506.
8. Hilf R. Will the Best Model of Breast Cancer Please Come Forward. Natl Cancer Inst
Monogr. 1971 Dec; 34:43-54.
9. MacMahon B, Cole P, Lin TM, Lowe CR, Mirra AP, Ravnihar B, Salber EJ, Valaoras VG,
Yuasa S. Age at first birth and breast cancer risk. Bull World Health Organ. 1970;
43(2):209-21.
10. Staszewski J. Age at menarche and breast cancer. J Natl Cancer Inst. 1971 Nov;
47(5)::935-40.
11. Trichopoulos D, MacMahon B, Cole P.
Menopause and breast cancer risk. J Natl
Cancer Inst. 1972 Mar; 48(3):605-13.
12. MacMahon B, Morrison AS, Ackerman LV, Lattes R, Taylor HB, Yuasa S. Histologic
characteristics of breast cancer in Boston and Tokyo. Int J Cancer. 1973 Mar 15;
11(2):338-44.
13. De Waard F, Cornelis JP, Aoki K, Yoshida M. Breast cancer incidence according to
weight and height in two cities of the Netherlands and in Aichi prefecture, Japan. Cancer.
1977 Sep; 40(3):1269-75.
14. Farewell VT. The combined effect of breast cancer risk factors. Cancer. 1977 Aug;
40(2):931-6.
15. Anderson DE, Badzioch MD. Risk of familial breast cancer. Cancer. 1985 Jul 15;
56(2):383-7.
16. Henderson BE, Pike MC, Casagrande JT. Breast cancer and the oestrogen window
hypothesis. Lancet. 1981 Aug 15;2(8242):363-4.
17. Thomas DB, Persing JP, Hutchinson WB. Exogenous estrogens and other risk factors for
breast cancer in women with benign breast diseases. J Natl Cancer Inst. 1982
Nov;69(5): 1017-25.
18. Thomas DB. Factors that promote the development of human breast cancer. Environ
- 91 -
Health Perspect. 1983 Apr; 50:209-18.
19. Lipsett MB. Hormones, medications, and cancer. Cancer. 1983 Jun 15; 51(12
Suppl):2426-9.
20. Trichopoulos D, Hsieh CC, MacMahon B, Lin TM, Lowe CR, Mirra AP, Ravnihar B,
Salber EJ, Valaoras VG, Yuasa S. Age at any birth and breast cancer risk. Int J Cancer.
1983 Jun 15;31(6):701-4.
21. Polednak AP, Janerich DT. Characteristics of first pregnancy in relation to early breast
cancer. A case-control study. J Reprod Med. 1983 May ;28(5):314-8.
22. Moolgavkar SH, Day NE, Stevens RG. Two-stage model for carcinogenesis:
Epidemiology of breast cancer in females. J Natl Cancer Inst. 1980 Sep;65(3):559-69.
23. Pike MC, Krailo MD, Henderson BE, Casagrande JT, Hoel DG. Hormonal risk factors,
breast tissue age and the age-incidence of breast cancer. Nature. 1983 Jun 30;
303(5920):767-70.
24. Ottman R, Pike MC, King MC, Henderson BE. Practical guide for estimating risk for
familial breast cancer. Lancet. 1983 Sep 3; 2(8349):556-8.
25. Claus EB, Risch N, Thompson WD. The calculation of breast cancer risk for women with
a first degree family history of ovarian cancer. Breast Cancer Res Treat 1993; 28:115-20.
26. Claus EB, Risch N, Thompson WD. Autosomal dominant inheritance of early-onset
breast cancer. Implications for risk prediction. Cancer 1994; 73:643- 51.
27. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, Mulvihill JJ.
Projecting individualized probabilities of developing breast cancer for white females who
are being examined annually. J Natl Cancer Inst. 1989 Dec 20;81(24):1879-86.
28. Anderson SJ, Ahnn S, Duff K. NSABP Breast Cancer Prevention Trial risk assessment
program, version 2. NSABP Biostatistical Center Technical Report, August 14, 1992.
29. Couch FJ, DeShano ML, Blackwood
MA, et al: BRCA1 mutations
in women attending
clinics that evaluate the risk of breast cancer. N Engl J Med 336:1409-1415, 1997.
30. Shattuck-Eidens
D, Oliphant A, McClure M, et al: BRCA1 sequence analysis in women
at high risk for susceptibility mutations: Risk factor analysis and implications for genetic
testing. J Am Med Assoc 278:1242-1250, 1997.
31. Frank TS, Manley SA, Olopade 01, et al: Sequence analysis of BRCA1 and BRCA2:
Correlation of mutations with family history and ovarian cancer risk. J Clin Oncol
16:2417-2425,
1998.
32. Pathak DR, Whittemore AS. Combined effects of body size, parity, and menstrual events
on breast cancer incidence in seven countries. Am J Epidemiol. 1992 Jan
15;135(2):153-68.
: Determining carrier probabilities for breast
33. Parmigiani G, Berry D, Aguilar
cancer-susceptibility genes BRCA1 and BRCA2. Am J Hum Genet 62:145-158, 1998.
34. Rosner B, Colditz GA. Nurses' health study: log-incidence mathematical model of breast
cancer incidence. J Natl Cancer Inst 1996; 88: 359 - 64.
35. Colditz GA, Rosner B. Cumulative risk of breast cancer to age 70 years according to risk
factor status: data from the Nurses' Health Study. Am J Epidemiol 2000; 152: 950 - 64.
- 92 -
36. Colditz GA, Rosner BA, Chen WY, Holmes MD, Hankinson SE. Risk factors for breast
cancer according to estrogen and progesterone receptor status. J Natl Cancer Inst 2004;
96: 2]18-28.
37. Pike MC, Krailo MD, Henderson BE, Casagrande JT, Hoel DG. Hormonal risk factors,
breast tissue age and the age-incidence of breast cancer. Nature 1983; 303:767-70.
38. Costantino J, Gail MH, Pee D, Anderson S, Redmond CK, Benichou J, Wieand HS.
Validation studies for models projecting the risk of invasive and total breast cancer
incidence. J Natl Cancer Inst 1999; 91:1541-8.
39. Tyrer J, Duffy SW, Cuzick J. A breast cancer prediction model incorporating familial and
personal risk factors. Stat Med 2004; 23: 1111 - 30.
40. Miki Y, Swensen J, Shattuck-Eidens D, et al. A strong candidate for the breast and
ovarian cancer susceptibility gene BRCA1. Science 1994;266:66-71.
41. Wooster R, Bignell G, Lancaster J, et al. Identification of the breast cancer susceptibility
gene BRCA2. Nature 1995;378:789-92.
42. Tavtigian SV, Simard J, Rommens J, et al. The complete BRCA2 gene and mutations in
chromosome 13q-linked kindreds. Nat Genet 1996;12: 333-7.
43. Berry DA, Iversen ES Jr, Gudbjartsson DF, Hiller EH, Garber JE, Peshkin BN, et al.
BRCAPRO validation, sensitivity of genetic testing of BRCA1/BRCA2, and prevalence
of other breast cancer susceptibility genes. J Clin Oncol 2002; 20: 2701 - 12.
44. Frank TS, Deffenbaugh AM, Reid JE, Hulick M, Ward BE, Lingenfelter B, et al. Clinical
characteristics of individuals with germline mutations in BRCA1 and BRCA2: analysis
of 10 000 individuals. J Clin Oncol 2002; 20: 1480 - 90.
45. Antoniou AC, Pharoah PD, McMullan G, Day NE, Stratton MR, Peto J, et al. A
comprehensive model for familial breast cancer incorporating BRCA1, BRCA2 and other
genes. Br J Cancer 2002; 86: 76 - 83.
46. de la Hoya M, Osorio A, Godino J, Sulleiro S, Tosar A, Perez-Segura P, et al. Association
between BRCA 1 and BRCA2 mutations and cancer phenotype in Spanish breast/ovarian
cancer families: implications for genetic testing. Int J Cancer 2002; 97: 466 - 71.
47. de la Hoya M, Diez 0, Perez-Segura P, Godino J, Fernandez JM, Sanz J, Alonso C,
Baiget M, Diaz-Rubio E, Caldes T. Pre-test prediction models of BRCAlor
BRCA2mutation in breast/ovarian families attending familial cancer clinics. J. Med.
Genet 2003; 40:503-10.
48. Vahteristo P, Eerola H, Tamminen A, Blomqvist C, Nevanlinna H. A probability model
for predicting BRCA 1 and BRCA2 mutations in breast and breast-ovarian cancer families.
Br J Cancer2001; 84: 704 - 8.
49. Hartge P, Struewing JP, Wacholder S, Brody LC, Tucker MA. The prevalence of common
BRCA1 and BRCA2 mutations among Ashkenazi Jews. Am J Hum Genet 1999; 64: 963
-70.
50. Apicella C, Andrews L, Hodgson SV, Fisher SA, Lewis CM, Solomon E, et al. Log odds
of carrying an ancestral mutation in BRCA1 or BRCA2 for a defi ned personal and
family history in an Ashkenazi Jewish woman (LAMBDA). Breast Cancer Res 2003; 5:
- 93 -
R206- 16.
51. Jonker MA, Jacobi CE, Hoogendoorn WE, Nagelkerke NJ, de Bock GH, van
Houwelingen JC. Modeling familial clustered breast cancer using published data. Cancer
Epidemiol Biomarkers Prev 2003; 12: 1479 - 85.
52. Gilpin CA, Carson N, Hunter AG. A preliminary validation of a family history assessment
form to select women at risk for breast or ovarian cancer for referral to a genetics center.
Clin Genet 2000; 58: 299 - 308.
53. Fisher TJ, Kirk J, Hopper JL, Godding R, Burgemeister FC. A simple tool for identifying
unaffected women at a moderately increased or potentially high risk of breast cancer
based on their family history. Breast 2003; 12:120 -7.
54. Wingo PA, Ory HW, Layde PM, Lee NC. The evaluation of the data collection process
for a multicenter, population-based, case-control design. Am J Epidemiol
1988;128:206-17.
55. Gail MH, Benichou J. Assessing the risk of breast cancer in individuals. In: DeVita VT Jr,
Hellman S, Rosenberg SA, editors. Cancer prevention. Philadelphia (PA): Lippincott;
1992. p. 1-15.
56. Gail MH, Benichou J. Validation studies on a model for breast cancer risk [editorial]
[published erratum appears in J Natl Cancer Inst 1994;86:803]. J Natl Cancer Inst
1994;86:573-5.
57. Bondy ML, Lustbader ED, Halabi S, Ross E, Vogel VG. Validation of a breast cancer risk
assessment model in women with a positive family history. J Natl Cancer Inst
1994;86:620-5.
58. Spiegelman D, Colditz GA, Hunter D, Hertzmark E. Validation of the Gail et al. model
for predicting individual breast cancer risk. J Natl Cancer Inst 1994;86:600-7.
59. Rockhill B, Spiegelman D, Byrne C, Hunter DJ, Colditz GA. Validation of the Gail et al.
model of breast cancer risk prediction and implications for chemoprevention. J Natl
Cancer Inst 2001;93:358-66.
60. Yasui Y, Potter JD. The shape of the age-incidence curves of female breast cancer by
hormone-receptor status. Cancer Causes Control 1999;10:431-7.
61. Tarone RE, Chu KC. The greater impact of menopause on ER- than ER+ breast cancer
incidence: a possible explanation (United States). Cancer Causes Control 2002; 13:7-14.
62. Hulka BS, Chambless LE, Wilkinson WE, Deubner DC, McCarty KS Sr, McCarty KS Jr.
Hormonal and personal effects of estrogen receptors in breast cancer. Am J Epidemiol
1984; 19:692-704.
63. Hildreth NG, Kelsey JL, Eisenfeld AJ, LiVolsi VA, Holford TR, Fischer DB. Differences
in breast cancer risk factors according to the estrogen receptor level of the tumor. J Natl
Cancer Inst 1983;70:1027-31.
64. Wolpert D. Stacked generalization. Neural Networks, 1992; 5: 241 - 59.
65. Breiman L. Bagging predictors. Machine Learning, 1996; 24(2):123 - 40.
66. Breiman L. Random forest. Machine Learning, 2001; 45:5 - 32.
67. Dempster AP, Laird D, Rubin D. Maximum likelihood from incomplete data via the EM
- 94 -
algorithm (with discussion). Journal of the Royal Statistical Society 1977; 39:1-38.
68. Geman S, Geman D. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE transactions on Pattern Analysis and Machine Intelligence
1984; 6:721-41.
69. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann Publishers, 1988.
70. Chickering DM. Learning Bayesian networks is NP-Complete. Learning from Data:
Artificial Intelligence and Statistics, pp. 121-130. Springer-Verlag, 1996.
71. Pacific Fertility Center.
Available at: http://www.infertilitydoctor.com/treat/clomiphene.htm
72. Jordan M. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999.
73. Bayesware Limited.
Available at: http://www.bayesware.com/products/discoverer/discoverer.html
74. Cooper G, Herskovitz E. A bayesian method for the induction of probabilistic networks
from data. Machine Learning, 1992; 309-347
75. Ramoni M. Personal communication.
76. Dixon JK. Pattern recognition with partly missing data. IEEE Transactions on Systems,
Man, and Cybernetics, 1979; SMC-9, 10, 617-621.
77. Duda, RO., Hart, PE., and Stork, DG.. Pattern Classification 2nd, John Wiley & Sons,
2001.
78. Troyanskaya
O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., Bostein D.
and Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics,
2001; 17(6): 520-5.
79. Tago C, Hanai T. Prognosis prediction by microarray gene expression using support
vector machine. Genome Informatics, 2003; 14: 314-5.
80. National Human Genome Research Institute. Introduction to the Human Genome Project,
[online] available http://www.genome.gov/10001772, 2003.
81. Thompson MW, Mclnnes RR, and Willard HF. Thompson & Thompson: Genetics in
Medicine, Fifth Edition, W. B. Saunders Company, Philadelphia, 1991.
82. Wang DG, et al. Large-scale identification, mapping, and genotyping of single-nucleotide
polymorphisms in the human genome. Science, vol. 280, no. 5366, pp. 1077-1082, 1998.
83. Altshuler D, et al. An SNP map of the human genome generated by reduced
representation shotgun sequencing. Nature, vol. 407, no. 6803, pp. 513-516, 2000.
84. Mullikin JC, et al. An SNP map of human chromosome 22. Nature, vol. 407, no. 6803, pp.
516-520 2000.
85. Marth GT, et al. A general approach to single-nucleotide polymorphism discovery. Nature
Genetics, vol. 23, no. 4, pp. 452-456, 1999.
86. Botste in D, and Risch N. Discovering genotypes underlying human phenotypes: past
successes for mendelian disease, future approaches for complex disease. Nature Genetics
Supplement, vol. 33, pp. 228-237, 2003.
95 -
87. Environmental Genome Project. Polymorphism and genes. [online] available
http://www.niehs.nih.gov/envgenom/polymorf.htm, 2003.
88. Hall JM, et al. Linkage of Early-Onset Familial Breast Cancer to Chromosome 17q21.
Science, vol. 250, no. 4988, pp. 1684-1689, 1990.
89. Miki Y, et al. A strong candidate for the breast and ovarian cancer susceptibility gene
BRCA 1. Science, vol. 266, no. 5182, pp. 66-71, 1994.
90. Wooster R, et al. Identification of the breast cancer susceptibility gene BRCA2. Nature,
vol. 278, no. 6559, pp. 789-792, 1995.
91. Serova OM, et al. Mutations in BRCA1 and BRCA2 in breast cancer families: are there
more breast cancer-susceptibility genes? American Journal of Human Genetics, vol. 60,
no. 3, pp. 486-495,
1997.
92. Seitz S. Strong indication for a breast cancer susceptibility gene on chromosome
8p 12-p22: linkage analysis in German breast cancer families. Oncogene, vol. 14, no. 6,
pp. 741-743, 1997.
93. de Jong MM, et al. Genes other than BRCA1 and BRCA2 involved in breast cancer
susceptibility. Journal of Medical Genetics, vol. 39, no. 4, pp. 25-242, 2002.
94. Helzlsouer KJ, et al. Association between glutathione S-transferase Ml, P 1, and T 1
genetic polymorphisms and development of breast cancer. Journal of the National Cancer
Institute, vol. 90, no. 7, pp. 512-518, 1998.
95. Charrier J, et al. Allelotype influence at glutathione S-transferase M1 locus on breast
cancer susceptibility. British Journal of Cancer, vol. 79, no. 2, pp. 346-353, 1999.
96. Ambrosone CB, et al. Cytochrome P4501A1 and glutathione S-transferase (M1) genetic
polymorphisms and postmenopausal breast cancer risk. Cancer Research, vol. 55, no. 16,
pp. 3483-3485, 1995.
97. Kristensen VN, et al. A rare CYP 19 (aromatase) variant may increase the risk of breast
cancer. Pharmacogenetics, vol. 8, no. 1, pp. 43-48, 1998.
98. Haiman CA, et al. No association between a single nucleotide polymorphism in CYP19
and breast cancer risk. Cancer Epidemiology Biomarkers & Prevention, Vol. 11, no. 2, pp.
215-216, 2002.
99. Haiman CA, et al. A tetranucleotide repeat polymorphism in CYP 19 and breast cancer
risk. International Journal of Cancer, vol. 87, no. 2, pp. 204-210, 2000.
100. Bailey LR, et al. Breast cancer and CYPIA1, GSTM1, and GSTT1 polymorphisms:
evidence of a lack of association in Caucasians and African Americans. Cancer Research,
vol. 58, no. 1, pp. 65-70, 1998.
101. Huang CS, et al. Breast cancer risk associated with genotype polymorphism of the
estrogen-metabolizing genes CYP 17, CYP lA 1, and COMT: a multigenic study on cancer
susceptibility. Cancer Research, vol. 59, no. 19, pp. 4870-4875, 1999.
102. Garcia-Closas M, Glutathione S-transferase mu and theta polymorphisms and breast
cancer susceptibility. Journal of the National Cancer Institute, vol. 91, no. 22, pp.
1960-1964, 1999.
96 -
103. Ishibe N, et al. Cigarette smoking, cytochrome P450 A1 polymorphisms, and breast
cancer risk in the Nurses' Health Study. Cancer Research, vol. 58, no. 4, pp. 667-671,
1998.
104. Millikan RC, et al. Cigarette smoking, N-acetyltransferases 1 and 2, and breast cancer
risk. Cancer Epidemiology Biomarkers & Prevention, vol. 7, no. 5, pp. 371-378, 1998.
105. Hunter DJ, et al. A prospective study of NAT2 acetylation genotype, cigarette smoking,
and risk of breast cancer. Carcinogenesis, vol. 18, no. 11,pp. 2127-2132, 1997.
106. Gertig DM, et al. N-acetyl transferase 2 genotypes, meat intake and breast cancer risk.
International Journal of Cancer, vol. 80, no. 1, pp. 13-17, 1999.
107. Hines LM, et al. A prospective study of the effect of alcohol consumption and ADH3
genotype on plasma steroid hormone levels and breast cancer risk. Cancer Epidemiology
Biomarkers & Prevention, vol. 9, no. 10, pp. 1099-1105, 2000.
108. HaimnanCA, et al. The relationship between a polymorphism in CYP 17 with plasma
hormone levels and breast cancer. Cancer Research, vol. 59, no. 5, pp. 1015-1020, 1999.
109. Guillemette C, et al. Association of genetic polymorphisms in UGT1A1 with breast
cancer and plasma hormone levels. Cancer Epidemiology Biomarkers & Prevention, vol.
10, no. 6, pp. 711-714, 2001.
110. Hairnan CA, et al. A polymorphism in CYP 17 and endometrial cancer risk. Cancer
Research, vol. 61, no. 10, pp. 3955-3960, 2001.
111. Haiman CA, et al. The androgen receptor CAG repeat polymorphism and risk of breast
cancer in the Nurses' Health Study. Cancer Research, vol. 62, no. 4, pp. 1045-1049, 2002.
112. Curtis, et al. Use of an artificial neural network to detect association between a disease
and multiple marker genotypes. Annals of Human Genetics, vol. 65, no. 1, pp. 95-107,
2001.
113. Seli E, Arici A. Ovulation induction with clomiphene citrate.
Available at: http://www.uptodate.com
114. Peris A. Clomiphene Therapy.
Available at: http://www.fertilitext.org/p2_doctor/clomid.html
115. Becker KL, Bilezikian JP, Bremner WJ, Hung W, Kahn CR, Loriaux DL, Nylen ES,
Rebar RW, Robertson GL, Snider RH. Principles and Practice of Endocrinology and
Metabolism, 3rd edition. Lippincott Williams & Wilkins, 2001.
116. Rossing MA, Daling JR, Weiss NS, Moore DE, Self SG.. Risk of breast cancer in a
cohort in infertile women. Gynecol Oncol. 1996 Jan;60(1):3-7.
117. Venn A, Watson L, Bruinsma F, Giles G, Healy D. Risk of cancer after use of fertility
drugs with in-vitro fertilisation. Lancet. 1999 Nov 6;354(9190): 1586-90.
118. Brinton LA, Scoccia B, Moghissi KS, Westhoff CL, Althuis MD, Mabie JE, Lamb EJ.
Breast cancer risk associated with ovulation-stimulating drugs. Hum Reprod. 2004
Sep; 19(9):2005-13. Epub 2004 Jun 24.
119. Reich
and Regauer S. Do drugs that stimulate ovulation increase the risk for
endometrial stromal sarcoma? Human Reproduction Vol.20, No.4 pp. 1112, 2005.
120. Brinton LA, Scoccia B, Moghissi KS, Westhoff CL, Althuis MD, Mabie JE, Lamb EJ.
- 97 -
Reply: Do drugs that stimulate ovulation increase the risk for endometrial stromal
sarcoma? Human Reproduction Vol.20, No.4 pp. 1112-1113, 2005.
121. Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics.
2005 Apr; 6(2):227-39.
122. Slonim N, Tishby N. Agglomerative Information Bottleneck. In NIPS'99.
123. Dougherty, J, Kohavi, R, & Sahami, M. Supervised and unsupervised discretization of
continuous features. Proceedings of the Twelfth International Conference on Machine
Learning,
124. Freedman
1995. Tahoe City, CA. pp. 194-202.
AN, Seminara
D, Gail MH, Hartge P, Colditz GA, Ballard-Barbash
R,
Pfeiffer RM. Cancer risk prediction models: a workshop on development, evaluation, and
application. Journal of the National Cancer Institute, 2005; 97(10): 715-23.
125. National Cancer Institute. Breast Cancer Prevention Studies.
Available at: http://www.cancer.gov/cancertopics/factsheet/Prevention/breast-cancer
- 98
-