PAM_instructions_2008

advertisement
Prediction Analysis with Microarrays (PAM)
June 2008 Course "Analysis of microarray gene expression data"
Judith Boer and Lodewyk Wessels
PAM version used: Add-in for Excel 2.13
Classification by "nearest shrunken centroid"
The PAM software uses a modification of the nearest centroid method, which
computes a standardized centroid for each class in the training set. This is the average
gene expression for each gene in each class divided by the within-class standard
deviation for that gene. Nearest centroid classification takes the gene expression
profile of a new sample, and compares it to each of these class centroids. The class,
whose centroid it is closest to, in squared distance, is the predicted class for that new
sample.
Nearest shrunken centroid classification makes one important modification to the
standard method. It "shrinks" each of the class centroids toward the overall centroid
for all classes by an amount we call the threshold. This shrinkage consists of moving
the centroid towards zero by subtracting the threshold, setting it equal to zero if it hits
zero. For example if threshold was 2.0, a centroid of 3.2 would be shrunk to 1.2, a
centroid of –3.4 would be shrunk to –1.4, and a centroid of 1.2 would be shrunk to
zero. The amount of shrinkage is determined by cross-validation. After shrinking the
centroids, the new sample is classified by the usual nearest centroid rule, but using the
shrunken class centroids. The shrinkage has two advantages: (1) it can make the
classifier more accurate by reducing the effect of noisy genes; (2) It does automatic
gene selection for genes that characterize the classes.
Example datasets
PAM comes with a couple of example data files, including cDNA microarray
expression data of small round blue cell tumors (SRBCT) of childhood (Khan et al.,
2001). This data set consists of Training Data and Test Data with normalized, logtransformed expression measurements on 2,308 genes (after filtering). The tumors are
classified as Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB), or
rhabodomyosarcoma (RMS). A total of 63 training samples and 25 test samples are
provided, although five of the latter were not SRBCTs. The aim is to correctly classify
the test samples to one of the four tumor classes defined in the training set.
The second data set consists of rma-normalized Affymetrix hg133a expression data of
33 muscle samples from 2 different groups: 19 Duchenne muscular dystrophy patients
(DMD) and 14 controls. In principle, all Affymetrix probesets could be used (22K),
here we use a filtered set of 1663 to shorten the calculation time. This data set only
contains a Training set.
Obtaining PAM (for installation on your own computer)
PAM can be freely downloaded from http://www-stat.stanford.edu/~tibs/PAM/.
Please note that the Excel front-end is an addition to PAM for R. Therefore, new users
must download both the PAM for R package and the PAM for Excel package.
1
Loading a dataset
Start Excel. You will see that the speed buttons have been extended with two items:
PAM and PAM Controller. These will be used to start PAM and to steer the
computations and the production of tables and graphs.
Open khan.xls, start with the Khan Data worksheet containing the Training data.







Select all data in the worksheet (click on cell A1, hold down Ctrl and Shift
keys and press arrow key -> twice, then arrow key "down" twice.
Now click on the button PAM. A form appears, on which you can specify
details of the data. Do the following:
Put 2 in the field labeled Class Labels in selection row. This says that in
row 2 of the selected data the labels of the classes are written.
Put 1 in the field labeled Sample Labels in selection row. This says that the
labels of the arrays (which might codes of the patients or biological samples)
are found in row 1.
Leave the field labeled Batch Labels in selection row empty. There are no
batch indicators in this data set.
Put 3 in the field labeled Expression Data starts in selection row. This says
that the expressions are found in row 3 and further down.
Click on the button OK. PAM will start initial computations. After some time
another form will appear.
Controlling computations and output
The form named PAM Menu presents a number of buttons and a field to fill in
(which becomes visible later on). Initially only a few buttons are active. You can click
on Exit Menu to make the form disappear. You can always call it back by clicking the
speed button PAM Controller.
First click Train. This is always the first step and produces a classifier without a
threshold. You can judge its performance by clicking Plot Training Error. The plot
will appear on the worksheet PAM Plots. It shows how the classification error for the
training data (Y-axis) changes with the threshold (X-axis). As the threshold, which is
the amount of shrinkage, is increased, genes are eliminated from the class prediction.
The upper axis shows the number of active genes used in the classification.
Move plots to new worksheet
PAM puts all plots on one sheet. This is not optimal for studying them. Generate new
Worksheets under the Excel menu Insert > Worksheet. Give the worksheet a logical
name, such as "Plot Training Error". Click on a plot with the right mouse button. One
option is Location, where you can specify that a plot should become a separate graph
sheet. This is not true for the plot of the centroids. It is produced by R and copied onto
the Excel worksheet. But you can copy it to the clipboard, paste it onto a new
worksheet and resize it there. Generally the gene labels are hard to read on this plot,
because they are too small or do overlap each other. But they have the same order as
on the list of genes. It is also possible to copy a graph to the clipboard and paste it into
a Word document.
2
Cross-validation and threshold selection
Cross-validation imitates how the classifier would work on new data. The available
data are split in ten parts. Nine of them are used to train the classifier and the
remaining one is used for “prediction”. This is done ten times (each of the ten parts is
used in turn for prediction) and the results are combined to compute the overall error.
You start the computations by clicking Cross validate on the Cross Validation tab of
the PAM Menu. The result can be presented as a graph by clicking Plot CV curves.
Perform cross-validation and choose a threshold that minimizes the misclassification
error. Fill this value in the field labeled Threshold, press Enter.
Results for a chosen threshold
The results can be obtained by clicking Plot CV Probabilities. The plot shows for
each array (X-axis) the computed probability (Y-axis) that it belongs to each of the
possible classes. Try two different thresholds to see the effect on the probabilities.
Now go back to the Training tab of the PAM Menu. The shrunken centroids for each
class, per gene, are plotted when you click on Plot Centroids. The horizontal units
are log ratios of gene expression. A list of genes that “survive” the threshold can be
obtained by clicking List Gene Scores. The list appears in a separate worksheet.
You can also obtain a confusion matrix, by clicking on the button Display Confusion
Matrix. This matrix shows a cross-table of which combinations of actual class and
prediction occur. When all predictions are correct, you will find only numbers on the
diagonal. Do NOT click on the Estimate FDR button, this gives an error!
Test set prediction
The Kahn dataset contains a second set of arrays on the worksheet named Khan Test
Data. Take a look at this data set. We will try to predict the class of these new
samples, using the centroids found in the training set after cross-validation.
Go to the Test Set Prediction tab on the PAM Menu. Click on the Predict Test Set
button. A new form will appear, with fields to enter the rows in the Test data sheet
where Class, Sample, and Expression data are found. The entries are the same as in
the training set. Follow the instructions on the form to select the data columns (not the
gene id columns) in the Test Data sheet. Start the prediction by clicking on OK.
Plot Test Error will give you a similar error graph as in the training set. You will see
that the chosen threshold is good for the test data as well. Plot Test Probabilities and
Output Prediction Info will give you a visualization and table, respectively, for the
predicted classes and probabilities.
Questions
How many samples are in the training and test sets, respectively? How many genes?
What threshold did you choose for centroid shrinkage?
How many genes contributed to the classification at this threshold?
What were the prediction results on the training set? And on the test set?
Analysis of DMD dataset
Run PAM on the DMD data in DMD_PAM_filtered.xls (1663 probesets). This data
set only contains a training set, so you can not do the Test set prediction part. Are the
groups well-predicted? Why do you think that is the case? Are the predictive genes
mostly over- or underexpressed in the DMD group?
3
References
Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold,
F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S. (2001).
Classification and diagnostic prediction of cancers using gene expression profiling
and artificial neural networks, Nat Med 7, 673-9.
Pescatori, M., Broccolini, A., Minetti, C., Bertini, E., Bruno, C., D'amico, A.,
Bernardini, C., Mirabella, M., Silvestri, G., Giglio, V., Modoni, A., Pedemonte, M.,
Tasca, G., Galluzzi, G., Mercuri, E., Tonali, P. A., Ricci, E. (2007). Gene expression
profiling in the early phases of DMD: a constant molecular signature characterizes
DMD muscle from early postnatal life throughout disease progression, FASEB J.
21(4), 1210-26.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple
cancer types by shrunken centroids of gene expression, Proc Natl Acad Sci U S A 99,
6567-72.
4
Download