Random Forest Clustering and Other Statistical Methods

advertisement
This manuscript is the supplement methods for the paper:
Seligson DB, Horvath S, , Shi T, Yu H, Tze S, Grunstein M and Kurdistani SK.
Global histone modification patterns predict risk of prostate cancer recurrence.
Nature (2005).
Section 1
Prostate Tissue Microarray (TMA). Under IRB approval, a prostate tissue microarray
(TMA) was constructed using formalin-fixed, paraffin-embedded prostate tissue samples
provided through the Department of Pathology and Laboratory Medicine at the UCLA
Medical Center. Primary radical prostatectomy cases from 1984-1995 were randomly
selected from the pathology database. The original H&E stained diagnostic slides were
reviewed by a study pathologist (D.S.) utilizing the Gleason histological grading [1] and
the 1997 AJCC/UICC TNM classification systems [2]. Case material from 246
prostatectomies was arrayed into 3 blocks encompassing a total of 1,364 individual tissue
cores. All cases were of the histological type adenocarcinoma, conventional, not
otherwise specified (NOS) [3].
TMAs were constructed as previously described [4]. At least 3 replicate tumor
samples were taken from donor tissue blocks in a highly representative fashion. Matched
morphologically normal, hypertrophic (BPH) and in situ neoplastic lesions (PIN), were
also arrayed, when available. Twenty patients treated with neoadjuvant hormones were
excluded from the study. Of the remaining 226 cases, 183 (81%) were informative for all
5 histone markers and 171 of those also were supported by complete recurrence data. In
this group, the median age at the time of surgery was 65 (range 46 to 76). 104 (57%)
patients were low grade (Gleason score 2-6); 79 (43%) were high grade (Gleason score 710). Half of the tumors, (50%) were not confined to the prostate (organ confined = T2a or
T2b with negative lymph nodes, no capsular extension and with negative surgical
margins). Sixty (33%) patients were margin positive and 33 (18%) had seminal vesicle
invasion (pT3b). Regarding capsular invasion, 38 (21%) had no invasion, 101 (55%) had
invasion, and 44 (24%) had capsular extension. Concurrent regional lymphadenectomy
accompanied 181 (99%) cases, 12 of which (7%) were positive for metastases. The
maximum pre-operative serum PSA was known for 160 patients (87%), with a median
value of 8.9 ng/ml, (range 0.6-96.5).
Supplement Table 2 shows the clinicopathologic data for the subset of patients
with low grade (Gleason Score 2-6) tumors (n=104). The median age at the time of
surgery of this group was 64 (range 46 to 75). In the low grade tumors, 36% were not
confined to the prostate. Twenty five (24%) patients were margin positive and 5 (5%) had
seminal vesicle invasion (pT3b). Twenty eight (27%) had no capsular invasion by tumor,
56 (54%) had capsular invasion, and 20 (19%) had capsular extension. Concurrent
regional lymphadenectomy accompanied 102 (98%) cases, only 2 of which (2%) were
positive for metastases. The maximum pre-operative serum PSA was known for 93
patients (89%), with a median value of 7.8 ng/ml, (range 0.6-56.0).
A retrospective analysis for outcome assessment was based on detailed
anonymized clinicopathologic information linked to the TMA tissue specimens.
1
Recurrence, defined as a postoperative serum PSA of 0.2 ng/ml or greater, was seen in 61
(34%) of all study patients, and 20 (19%) of patients with low grade tumors. The median
total follow-up, defined as the time to recurrence or to last contact in non-recurring
patients, was 50 months (range 1.0-163) for all patients, and 60.0 months (range 2-163)
for patients with low grade tumors.
The median follow-up time within the recurring and non-recurring patient groups was
22.0 (1.0-115.0) and 65.5 months (range 2.0-163.0), respectively, in all patients, and 30.5
(2.0-98.0) and 65.5 months (range 2.0-163.0), respectively, in patients with low grade
tumors.
Immunohistochemistry. A standard 2-step indirect immunohistochemical staining
method was used for all antibodies (DAKO, Carpenteria, CA). Tissue array sections (4
m-thick) were cut immediately prior to staining using a TMA sectioning aid
(Instrumedics, NJ). Following deparaffinization in xylenes, the sections were rehydrated
in graded alcohols. Endogenous peroxidase was quenched with 3% hydrogen peroxide in
methanol at room temperature. The sections were placed in 95O C solution of 0.01 M
sodium citrate buffer (pH 6.0) for antigen retrieval. 5% normal goat serum was next
applied for 30 min. to block non-specific protein binding sites. Primary rabbit antihistone polyclonal antibodies were applied for 30 min at room temperature (H3K18Ac at
1:200; H4R3diMe at 1:25; H3K9Ac at 1:800; H4K12Ac at 1:100; and H3K4diMe at
1:800 dilution from stock). Detection was accomplished using the DAKO Envision
System, followed by chromogen detection with diaminobenzidine (DAB). Incubations
were performed in a humidity chamber. The sections were counterstained with Harris’
Hematoxylin, followed by dehydration and mounting. Negative controls were identical
array sections stained minus the primary antibody.
Scoring of immunohistochemistry. Semi-quantitative assessment of antibody staining
on the TMAs was performed by two study pathologists (H.Y. and D.B.S.) blinded to the
clinicopathologic variables. Prostatic glandular epithelium was the scored target tissue;
scoring of benign tissues did not include basal cells. Both tissue spot histology and
grading were confirmed on Hematoxylin and Eosin (H&E) stained TMA slides, as well
as on all of the counterstained study slides. The frequency of nuclear expression positive
target cells (range 0-100%) was scored for each TMA spot.
REFERENCES
1.
Gleason, D.F., Classification of prostatic carcinomas. Cancer Chemother Rep,
1966. 50(3): p. 125-8.
2.
Sobin, L.H. and I.D. Fleming, TNM Classification of Malignant Tumors, fifth
edition (1997). Union Internationale Contre le Cancer and the American Joint Committee
on Cancer. Cancer, 1997. 80(9): p. 1803-4.
3.
young, R.H., Srigley, J.R., Amin, M.B., Ulbright, T.M., Cubilla, A., Tumors of
the prostate gland, seminal vesicle, male urethra, and penis., in Atlas of Tumor
Pathology. 2000, Armed Forces Institute of Pathology: Washington DC.
4.
Kononen, J., et al., Tissue microarrays for high-throughput molecular profiling of
tumor specimens. Nat Med, 1998. 4(7): p. 844-7.
2
Section 2
Random Forest Clustering and Other Statistical Methods
In part A, we will discuss and review Random Forest (RF) clustering. In part B, we will
compare RF clustering to a more standard clustering analysis involving the Euclidean
distance.
A. Review of Random Forest Clustering
What are RF predictors?
RF predictors are a state of the art prediction method which have been shown to work well
with many different types of data (1). A random forest predictor is a collection of individual
classification tree predictors. The random forest construction allows one to construct a
similarity measure between two samples by counting the number of times a tree predictor
places them in the same terminal node. Random forests can also be used for unsupervised
learning problems (class outcomes unknown) by first generating synthetic data, which are
chosen to represent the null hypothesis of no dependence structure in the data. Here we
generate synthetic observations by randomly sampling from the product of empirical
marginal distributions of the observed data. Then a random forest predictor is constructed
to distinguish observed from synthetic data. By restricting the resulting intrinsic similarity
measure to the observed data, one can define a similarity measure between the unlabeled
observations. To protect against random fluctuations due to Monte Carlo sampling, we
generate 100 synthetic data sets and average the resulting similarity measures. We define the
RF dissimilarity measure as the square root of 1 minus the similarity measure (4). We use
the RF dissimilarity as input for classical multidimensional scaling (MDS), which is related to
principal component analysis. It takes the dissimilarities between samples and returns a set
of points in a low dimensional Euclidean space such that the Euclidean distances between
the points are approximately equal to the RF dissimilarities (2,3). To cluster the points we
grouped the samples along the arms of the resulting “U” shape. The results are extremely
similar to using the RF dissimilarity directly in partitioning around medoids (PAM) clustering
(4).
Why Random Forests Clustering?
One major input of clustering analysis is the dissimilarity measure (4). We propose to use a
random forest dissimilarity for TMA data since it has the following relevant theoretical
advantages (5).
First, the clustering results do not change when one or more covariates are monotonically
transformed since the dissimilarity only depends on the feature ranks. Thus, one does not
need to worry about symmetrizing skewed covariate distributions.
Second, the RF dissimilarity weighs the contributions of each covariate on the dissimilarity in
a natural way: the more related the covariate is to other covariates, e.g. the more correlated a
protein marker is with other markers, the more it will affect the definition of the RF
dissimilarity.
3
Third, the RF dissimilarity does not require the user to specify threshold values for
dichotomizing tumor expressions. Since the RF dissimilarity is based on individual tree
predictors, which dichotomize the expression values as part of their construction, the RF
dissimilarity automatically dichotomizes the expressions in a principled, data-driven way. It is
standard practice in supervised analyses to dichotomize tumor marker expressions for ease
of interpretation and reproducibility. But we caution against using external threshold values
for dichotomizing expressions in unsupervised analyses since dichotomization may reduce
the information content or even bias the results. In contrast, RF clustering automatically
dichotomizes staining scores in a principled, data-driven way.
For a technical description of the RF dissimilarity consult Breiman (1), Shi and Horvath (5)
and a technical report and R tutorial that can be downloaded from
http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm.
Analysis Steps and Other Statistical Methods
Our analyses of the data involved the following 3 general steps:
A)
use RF clustering to group the patients based only on their tumor marker
expression profiles;
B)
assess the differences between the resultant clusters in terms of their
survival distributions and other clinico-pathological variables, such as
stage, grade etc.;
C)
examine the difference in tumor marker expression between the clusters.
The statistical methods used in the analyses are described below.
We used several methods for describing the clusters in terms of clinical variables and tumor
marker expressions. To test whether variables differed across groups, we used the KruskalWallis test, which is a non-parametric multi-group comparison test. To visualize the survival
distributions, we used Kaplan-Meier plots. Log-rank tests were used to test the difference
between survival distributions. All p-values were two sided and p < 0.05 was considered
significant. All statistical analyses were carried out with the freely available software R
(http://www.r-project.org/) (6). R code that implements RF clustering can be found at the
following webpage:
http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm.
B. Comparing the results of the random forest dissimilarity with those of the
Euclidean distance.
Here we report the results of comparing the random forest dissimilarity to the
standard dissimilarity (Euclidean distance) when the latter is used as input of partitioning
around medoid clustering (4). Instead of PAM clustering, which is a variant of k-means
clustering, one could also use hierarchical clustering to group samples on the basis of RF
dissimilarity. We find that our main message does not change when other standard clustering
procedures are used that take a dissimilarity measure as input.
The table below cross tabulates the patients based on their cluster membership predicted by
the 2 methods.
4
Table Comparing the random forest clustering results with those of the Euclidean distance
for the 104 low Gleason score prostate tumors.
Euclidean Distance
clustering
Rand. Forest
Dissimilarity clustering
Cluster 1
Cluster 2
Cluster 1
Cluster 2
55
8
1
40
Overall there is good agreement between the clustering results: the clusterings disagree on
only 8+1 patient samples. There is indirect empirical evidence that RF clustering is superior:
The figure shows Kaplan Meier plots that visualize the recurrence free time distributions for
the different patient samples defined by the cells of the above table. Specifically, the figure
shows the Kaplan Meier plots for
i)
patients in RF cluster 1 and Euclid. cluster 1,
ii)
patients in RF cluster 2 and Euclid. cluster 2,
iii)
RF cluster 2 and Euclid. cluster 1.
Since there was only one patient in RF cluster 1 and Euclidean cluster 2, the corresponding
Kaplan Meier plot was ignored.
5
0.6
0.4
0.2
RF cluster 1 & Euclid. cluster 1
RF cluster 2 & Euclid. cluster 2
RF cluster 2 & Euclid. cluster 1
0.0
Prob. Recurrence Free
0.8
1.0
Figure Kaplan Meier (KM) plots corresponding to different clusterings of the low Gleason
score prostate samples. The RF method clusters patients corresponding to the green and red
KM curves into one cluster. In contrast, the Euclidean distance PAM analysis clusters
patients corresponding to the green and black KM curve into one cluster. Clearly, the RF
dissimilarity is superior to the Euclidean distance in this particular analysis.
0
50
100
150
Time to Recurrence
6
In our empirical comparison, RF clustering leads to superior performance on these
prostate tumor samples. We have listed several theoretical reasons why we expect that RF
clustering is a useful method for typically skewed tumor marker data (5). We have found
additional empirical evidence that RF clustering performs well with tumor marker data (7).
But there is no substitute to using additional real data to show that this method performs
well in practice. Future studies on additional real TMA data sets should aim to provide
empirical evidence that the RF dissimilarity is indeed worthwhile for TMA data.
Unfortunately, the TMA community does not (yet) offer a collection of benchmark data sets.
REFERENCES:
1.
2.
3.
4.
5.
6.
7.
Breiman L: Random forests. Machine Learning 2001, 45:5-32
Venables WN, Ripley BD: Modern applied statistics with S-PLUS. New York,
Springer-Verlag, 1999, pp xi, 501
Cox TF, Cox MAA: Multidimensional scaling. Boca Raton, Chapman & Hall/CRC,
2001
Kaufman L, Rousseeuw PJ: Finding groups in data: an introduction to cluster
analysis. New York, Wiley, 1990, pp xiv, 342
Shi T, Horvath S (2005) Unsupervised Learning with Random Forest Predictors.
Journal of Computational and Graphical Statistics, in press.
Ihaka, R. and Gentleman, R. R: A Language for Data Analysis and Graphics. Journal
of Computational and Graphical Statistics, 5: 299-314, 1996.
Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S (2005) Tumor Classification
by Tissue Microarray Profiling: Random Forest Clustering Applied to Renal Cell
Carcinoma. Mod Pathol. 2005 Apr;18(4):547-57
7
Download