Matrix comparison writeup

advertisement
Matrix Comparison Bootstrap Program
First off, I am no R programmer, so experienced R programmers will be pretty unimpressed. That said, it does appear to work for
appropriate data sets, but I present it as is, with no guarantees.
This program compares the genetic covariance matrices derived from two data sets. Our original interest was to compare the
covariance matrix for a stock population with the covariance matrix for a population that had diverged due to two generations of
brother sister mating. Thus, the null hypothesis is that the two covariance matrices are identical. (Yes, this is a frequentist approach!).
I have implemented three separate sets of tests:
1) Modified Mantel/Bartlett/Rank tests
Goodnight, C. J. and J. M. Schwartz. 1997. A bootstrap comparison of genetic covariance matrices. Biometrics 53:1026-1039.
This set of tests recognizes three manners in which two matrices can differ. First they can differ in:
Rank: Basically the number of traits with a non-zero genetic variance. There is the slight complication that occasionally one trait is a
linear combination of two or more other traits. If this linear combination trait is not counted in the rank. This is measured by D, the
difference in rank. It is not calculated by this program, but it is easily calculated from the compiled data.
Size: The hypervolume that is enclosed by the matrix as measured by the determinant. In the univariate case this is simply the
variance, for the multivariate case it is the total variance corrected for covariances among traits. This is measured using a signed
Bartletts test. The signied Bartlett’s test compares the size of two matrices A and B. If they have the same size it returns zero. If A is
larger than B it returns a positive number, and if B is larger than A it returns a negative number. Depending on the nature of your
hypothesis you can used the signed value (if it is a one tailed test) or the absolute value (two tailed test).
Shape: Two matrices have the same shape if the relative magnitudes of the variances and the covariances are the same for all traits
and pairs of traits. This is measured by the modified Mantell’s test. This test is modified to (1) make it appropriate for covariance
matrices, and (2) to correct the null hypothesis to be that the two matrices have identical shapes.
2) Random Skewers
Cheverud, J. M. 1996. Quantitative genetic analysis of cranial morphology in the cotton-top (Saguinus oedipus) and saddle-back (S.
fuscicollis) tamarins. J. Evol. Biol. 9:5–42.
Revell, L. J. 2007. The G matrix under fluctuating correlational mutation and selection. Evolution 61:1857–1872.
This uses the method of Cheverud (1996) modified by Revell (2007), but corrected for the null hypothesis as outlined in Calsbeek and
Goodnight (2009)
This test compares two G matrices by generating a large number of random unit vectors (number is programmable). Each vector is
separately multiplied by the two G matrices and the correlation between the two vectors is calculated. If the two matrices are identical
then the vector correlation will be 1. Matrices that are not identical will have correlations that are less than one, and may be negative.
The average correlation for the large number of random vectors is reported.
We perform the random skewers test in different ways:
rawRSkewers are random skewers on the original G matrices, using all of the variables, whether or not they are present in both
matrices (if a trait has a zero variable it will not be present in the matrix). In many cases this will be an appropriate statistic, however,
it has two shortcomings to be aware of. (1) if a trait is not present in one matrix this will drastically lower the vector correlations, and
(2) because there is no standardization of the variables the results of this test will change if the scale of measurement is changed.
Also, the correlations will be dominated by the numerically largest variables.
rawStdRSkewers are random skewers done on the original G matrices, but with the variables being divided by the genetic standard
deviation for each trait. This will remove the effects of scale of measurement.
subsetRSkewers are random skewers done on the original G matrices, but only using the variables that are present (have
a non-zero variance) in both matrices. This removes the first potential difficulty since no variables are considered unless
they are present in both matrices. However, this statistic is sensitive to changes in scale, and will be dominated by the
numerically largest variables.
subsetStdRSkewers are random skewers done on the original G matrices, but only using the variables that are present
(have a non-zero variance) in both matrices.. This is the most modified of the statistics, but is free of both the issue of
comparing matrices with different variables, and the problems of scale and domination by the numerically largest
variables.
My general recommendations: I recommend using standardized skewers. Unless your traits are all similar, measured on the same
scale, and of approximately the same magnitude the untransformed skewers will give you very difficult to interpret results that will be
dominated by one or two variables. The choice of using the “raw” or “subset” skewers depends on the biological question you are
interested in. For questions about whether two populations evolve differently that allows for complete loss of variance for some traits
the “raw” is probably more appropriate. If you want to know whether or not two matrices are the same where comparable the “subset”
might be more appropriate.
3) Selection Skewers.
Calsbeek, B. and C. J. Goodnight, 2009. Empirical comparison of G matrix test statistics: finding biologically relevant change. Evolution 63:
2627-2635.
This method compares two matrices by comparing the multivariate response to specific selection pressures. It is particularly
appropriate when there is an a priori selection pressure that is of interest.
Selection “skewers” are specified in the program. Each selection skewer is a set of weightings for the strength of selection on each
trait. These weightings are used to create an index, and the data set is ranked by this index. The proportion of the population
corresponding the “SelSkewSelectIntensity” is used to calculate the S selection vector. This is multiplied by GP-1 for both matrices.
These “R” vectors are standardized by the phenotypic standard deviation, and the correlation between the two R matrices is calculated.
Note that this method takes into account not only differences in the G matrix, but also differences in the P matrix to calculate the
resulting correlation.
These results are in a separate variable labeled “SelectionSkewers”
IN the statistics output I have, for completeness, included statistics on both the complete bootstrap data set, and for the subset in which
the two bootstrap covariance matrices have at least two variables in common. In general I would recommend using the complete data
rather than the subset. The subset is identified with the “NoO” attached to the statistic name.
Some General Comments on the Program
This program generates bootstrap data sets in which the null hypothesis that there is no true differences between the two data sets is
true. It does this by using a single specified data set to generate bootstrap data sets that have the same structure as the original data
sets, but with the actual numbers coming from the specified data set.
What the program does is it produces a set of bootstrap data set statistics where the null hypothesis is true. To use this for statistical
testing you need to calculate the statistics on the actual data (I leave that to you, but most of the formulae are already in this program),
and compare the result from the actual data to the result from the bootstrap data sets. If the actual data set is more extreme than 95%
(or what ever you choose) of the bootstrap data samples then the results should be considered to be significant.
The important point is that this program generates the bootstrap data sets for which the null hypothesis is true. It DOES NOT analyze
the original data set. I hope to have that program up and running one day, but for the moment I leave that to you.
Some notes to familiarize you with these statistics. If the “JointRank” is 1 then the subset skewers will necessarily have a
correlation of 1, and the modified Mantell will have a correlation of 0. You should take that into account in your
comparisons!
Running the program
The program can be run by cutting and pasting the entire file into R (As I said, I am no R programmer!). You will need to change
some of lines to reflect the data sets you want to analyze, and where they are stored on your computer. The commands that need to be
changed are near the bottom of the file and separated out by comments:
########################################################
#
#
#
Load the correct data sets and function names.
#
#
This will change from run to run depending on needs
#
#
#
########################################################
#
#### load the right data sets
stockbal <-read.table("/Research/R Trib Bootstrap/balanced stock females.txt")
stockbal <- data.frame(stockbal, row.names = 1:length(stockbal[,1]))
stock <-read.table("/Research/R Trib Bootstrap/stock data female.txt")
stock <- data.frame(stock, row.names = 1:length(stock[,1]))
Pop3 <-read.table("/Research/R Trib Bootstrap/population 3 females.txt")
Pop3 <- data.frame(Pop3, row.names = 1:length(Pop3 [,1]))
#### various lists that will be used by the functions.
FactorNames <- c("Sire", "Dam")
StockFactorNames <- c("Sire", "Dam")
Pop3FactorNames <- c("Sire")
Traits <- c("Pupal_Mass", "Dev_Time", "Dry_Mass", "Rel_Fit")
### in the following the list must be of length (number of traits X Number of skewers)
### any other result will give an error.
Number_of_Skewers <- 5 #This is the number of selection skewers to be examined
Skewers <- c( 1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1,
0, 0, 1, 1) # These are the selection skewers
#Parameters to adjust
# This is the strength of selection on the selection skewers. This should be the proportion selected
# It is a number between 0 and 1, with the smaller the number the stronger the selection.
# adjust the following for different mating systems. For sires in a half sib design VA = 4*var(half sibs)
# for full sibs adjust VA = 2 * var(full sibs) Adjust as appropriate for your system
VA_multiplier <- 4
#adjust this as appropriate. Typically it will be 4 since var(fullsibs)-var(halfsibs) = 1/4 VD
VD_multiplier <- 4
SelSkewSelectIntensity <- 0.5
#These parameters change the number of bootstraps, and number of skewers in the random skewers
NumberOfBootStraps <- 10 # the number of boot strap samples generated
SkewerNumber <- 1000 # the number of random skewers generated
You will need to change the file names to the correct names, and also change the factor names, traits names, and selection skewers as
appropriate. Spelling and capitalization counts here! Factor names and Traits must be the exact names that head the data columns.
For the selection skewers this example has five skewers with an entry for each of the four traits. Thus each column is a weighting, and
each row is a separate skewer.
Pay particular attention to the VA and VD multipliers. These will change depending on the breeding design. For a standard half sib
design the VA multiplier is 4 (VA = 4 * variance among half sibs), and the VD multiplier is 4 (VD = 4 *(Var full sibs-Var half sibs)),
but this can change for different breeding designs.
You should also change the NumberOfBootStraps to a small number (I use 3) until you are satisfied that the program is working. An
actual bootstrap analysis should be done on at least 100, and more likely 1000 bootstraps.
After everything is adjusted and ready to go the program can be run by running the command:
RunMatrixAnalysis(stockbal, stock, Pop3, FactorNames, StockFactorNames, Pop3FactorNames, Traits)
You will probably have changed the parameters enclosed in parentheses to reflect your data.
The program will eventually print out the statistical analyses that will look something like this:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] "General Statistics"
[1] "below are the general statistics and the general statistics using the subset with joint matrices of at least size 2"
[1] "Subset stats have NoO attached to them"
[1] "Number of bootstraps equals"
[1] 10
[1] "Number of bootstraps with joint matrix size of at least 2 is"
[1] 4
[1] "the dataframe name for this is StatOut"
[1] " "
StatName Statistic LessThanOrEqual GreaterThanOrEqual MoreExtreme LessThanOrEqualNoO
GreaterThanOrEqualNoO MoreExtremeNoO
1
D
1.000000e+00
1.0
0.3
0.5
1.00
1.00
1.00
2
Bartlett
-1.102106e+02
0.3
0.7
0.3
0.50
0.50
0.50
3
Mantel
-6.633548e-01
0.0
1.0
0.2
0.00
1.00
0.50
4
rawRSkewers
1.195078e-03
0.5
0.5
0.5
0.25
0.75
0.75
5
rawStdRSkewers
8.868861e-02
0.7
0.3
0.5
0.75
0.25
0.75
6
subsetRSkewers
8.796094e-01
0.0
1.0
1.0
0.00
1.00
1.00
7
subsetStdRSkewers
6.387489e-01
0.4
0.6
0.6
1.00
0.00
0.00
[1] "below are the selection skewers and the selection skewers using the subset with joint matrices of at least size 2"
[1] "Subset skewer stats have NoO attached to them"
[1] "Number of bootstraps equals"
[1] 10
[1] "Number of bootstraps with joint matrix size of at least 2 is"
[1] 4
[1] "the dataframe name for this output is SelSkewOut"
[1] " "
LessThanOrEqual GreaterThanOrEqual MoreExtreme LessThanOrEqualNoO GreaterThanOrEqualNoO
MoreExtremeNoO
1
0.9
0.1
0.1
1.00
0.00
0.00
2
0.4
0.6
0.6
0.50
0.50
0.50
3
4
5
0.8
0.1
0.8
0.2
0.9
0.2
0.2
0.6
0.2
0.75
0.25
0.75
0.25
0.75
0.25
0.25
0.75
0.25
These results can be used directly, or you can access the original data as follows:
AnalysisResults: The analysis of the original data for D, Bartlett, Mantel, and the random skewers
AnalysisSelectionSkewers: The analysis of the original data for the selection skewers
BootResults: The analysis of the bootstrap data sets for D, Bartlett, Mantel, and the random skewers
SelectionSkewers: The analysis of the bootstrap data for selection skewers.
StatOut: The statistical analysis. For easier to read output try “format(StatOut, justify=c("left"), scientific=FALSE)”
SelSkewOut: The statistical analysis of the selection skewers.
One hint: If you want to do a lot of bootstrap samples, it might be a good idea to do a subset of the bootstraps (say 1000 at a time) and
save them as text files, merge them and then do the analysis on the merged data set. Otherwise you could tie up your computer for an
extended time, and be very vulnerable to computer glitches.
Download