User Guide to Statistical Analyses
Most of the statistical analyses are conducted using SAS v.8.02 (SAS Institute, Inc. Cary NC
27513, http://www.sas.com/) programs. Analyses include:
(1) the Kolmogorov-Smirnov two sample test, to determine if there are differences in the distribution of distributions of occurrence of clones (1, 2, 3, etc., times) within fingerprint groups. Use our Excel spreadsheet for this test.
(2) One way ANOVAs on the total number of clones found per treatment, or on fingerprint richness, diversity, or evenness in each treatment.
(3) Cluster analysis , to group treatments into similar groups based on the pattern of clones per fingerprint cluster.
(4) Stepwise Discriminant analysis , to identify fingerprint clusters in which the distribution of clones per treatment varies.
(5) Stepwise Regression analysis, to identify fingerprint clusters in which the distribution of clones per treatment varies with some measurement that varies between treatments, for example, the ability of soils in different treatments to suppress a plant parasitic nematode.
(1) Kolmogorov-Smirnov two-sample test: We are using the Kolmogorov-Smirnov two sample test to see if the distributions of occurrence of clones (1, 2, 3, etc., times in a group) are different between treatments. The first step in the analysis is to compile the frequency distributions for the clones in the treatments to be tested:
Open Microsoft Excel.
Open Taxonomic Table data which you saved when you computed this data using
GCPAT.
Now you need to compute the frequency with which clones in each treatment occur in groups ranging from a size of 1 to a size of 10 or more. Your treatment data will be organized into columns, with one column per treatment. Use the Excel FREQUENCY function to obtain the clone frequencies for each treatment. In the example below, assume that columns C and D contain treatment data and that the rows containing this data run from x to y. When inputing this statement in Excel you will have to type in actual row numbers instead of x, y, a and k.
Column A Column B Column C (treatment 1 column) Column D (treatment 2 column)
Row a Frequency 0 =FREQUENCY(Cx:Cy,Ba:Bk) =FREQUENCY(Dx:Dy,Ba:Bk)
Row b
Row c
1
2
3 Row d
Row e
Row f
Row g
4
5
6
Row h
Row i
Row j
Row k
7
8
9
10
when you have more than one column for one treatment (ie, replicates within a treatment) and want to compare the distributions of each treatment as a whole, compile the frequency distributions for each replicate, then sum these to get the frequency distribution for the whole treatment. These summed frequency distributions can then be compared.
-
Once you have the frequency distributions you’d like to compare, open the KS Test spreadsheet in Excel.
Enter frequency data (omitting 0 frequency data) for one treatment into the appropriate places in columns B and C:
Enter frequency data here
Frequency of
Occurance 101a 101b
4
5
6
7
1
2
3
8
9
676
21
4
1
0
0
0
0
1
539
10
2
2
0
0
0
0
0
10 or more 0 0
Total 703 553
The maximum difference in the cumulative distributions (D) is then calculated and tested against the K-S test statistic (these will be calculated automatically and the information displayed in column H). We reject Ho, or the two populations have the same distribution, if D is greater than the appropriate K-S value.
(2) One-way ANOVAs : we use this test to see if there are differences between treatments. In order to compare treatments, we need multiple observations for each treatment. When you design your experiment, try to accommodate treatment replications, or you may not be able to compare parameter values for each treatment.
Use SAS to perform this analysis. Below is a sample SAS program that you can rewrite to accommodate your needs. options ls = 80 ps = 55 nocenter nodate;
/* Set up temporary SAS data set called onewaybacteria */ data onewaybacteria;
/* Two variables to be input, Treatment and bacteria count */ input treatment $ baccount;
/* baccount is the total number of different bacterial clones found for each individual replicate */
/* The "$" indicates that Treatment is a text variable*/ title1 'Oneway ANOVA Example' ; title2 'Bacteria Count' ;
/* datalines Statement to indicate data is about to begin */ datalines ;
101 644
101 502
101 495 zhou 610 zhou 525 zhou 560 mbbb 539 mbbb 513 mbbb 577 mbv 549 mbv 587 mbv 474
;
RUN ; proc glm ;
/* This analysis has balanced data, but proc glm was used in case there are unequal replicates on future data*/ class treatment;
/*The MEANS statement with the TUKEY option generates the tukey pairwise comparisons for the cells. This test protects against inflation of the type I error rate due to multiple t-tests and uses a constant error term for the analysis.*/ model means
baccount=treatment;
treatment / tukey cldiff ; run ;
This one-way ANOVA example will test for differences in total number of bacteria clones identified between the four treatments. The Tukey means separation test is then applied to identify pairwise differences between the individual treatments. The output from this analysis is presented below.
Oneway ANOVA Example
Bacteria Count
The GLM Procedure
Dependent Variable: baccount
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 1330.25000 443.41667 0.13 0.9370
Error 8 26472.66667 3309.08333
Corrected Total 11 27802.91667
R-Square Coeff Var Root MSE baccount Mean
0.047846 10.49879 57.52463 547.9167
Source DF Type I SS Mean Square F Value Pr > F treatment 3 1330.250000 443.416667 0.13 0.9370
Source DF Type III SS Mean Square F Value Pr > F treatment 3 1330.250000 443.416667 0.13 0.9370
The GLM Procedure
Tukey's Studentized Range (HSD) Test for baccount
NOTE: This test controls the Type I experimentwise error rate.
Alpha 0.05
Error Degrees of Freedom 8
Error Mean Square 3309.083
Critical Value of Studentized Range 4.52880
Minimum Significant Difference 150.41
Comparisons significant at the 0.05 level are indicated by ***.
Difference Simultaneous
treatment Between 95% Confidence
Comparison Means Limits zhou - 101 18.00 -132.41 168.41 zhou - mbbb 22.00 -128.41 172.41 zhou - mbv 28.33 -122.08 178.74
101 - zhou -18.00 -168.41 132.41
101 - mbbb 4.00 -146.41 154.41
101 - mbv 10.33 -140.08 160.74 mbbb - zhou -22.00 -172.41 128.41 mbbb - 101 -4.00 -154.41 146.41 mbbb - mbv 6.33 -144.08 156.74 mbv - zhou -28.33 -178.74 122.08 mbv - 101 -10.33 -160.74 140.08 mbv - mbbb -6.33 -156.74 144.08
In this example no significant differences were found between total number of bacteria clones between the four treatments (p=0.13) on the ANOVA and no significant differences were found on the pairwise comparisons. Typically the pairwise comparisons would not be used if the overall F-test was not significant.
(3) Cluster analysis: This method was run to group the treatments into similar groups. The problem in identification of similarity between clone library treatments is that the majority of all identified clones appear in groups containing a single clone, or only a few clones. For this reason the cluster analysis uses only data from clones appearing in larger groups for clustering the treatments, but clear guidelines for what should be considered a large enough group have not been identified as yet. For now, we shall consider groups containing 5 or more clones to be large enough.
First step: Obtain data for fingerprint groups containing 5 or more clones.
Open Excel, and open the Taxonomic Table file containing data for all of your clones.
Save the Table file under a different name. Delete all columns but those describing the
Group Number (Column A), the Number of Clones per Group (Column B), and the number of clones per treatment in each group (treatment data columns).
Highlight all the rows and columns containing data for the fingerprint groups.
Data -> Sort; Sort by: Number of Clones per Group.
You should get output that sorts the fingerprints into rows based on how many clones there are in each group. Delete the rows that contain data for fingerprints containing 4 or fewer clones. Save your data, and be sure to use a file name that is different from the
Taxonomic Table file or you will lose most of your data and have to re-run the
Taxonomic Table function in GCPAT to get it back.
Second step: format the data for use with the SAS cluster analysis program. For this program the data needs to be in the following format:
Group1 Group2 Group3 Group4 Group5 Group6 Unit
9 4 0 0 0 0 One-a
9
9
3
2
1
3
0
1
0
2
0 One-b
0 One-c
0
0
0
4
1
4
1
3
5
0
2
1
1
0
2
0 Two-a
0 Two-b
0 Two-c
2
2
2
1
1
2
3
3
3
2
4
3
0
1
0
1
0
0
0
0
0
0
0
0
0
0 Three-a
0 Three-b
0 Three-c
5 Four-a
5 Four-b
1 7 0 0 0 5 Four-c
To format your data, first delete the Number of Clones per Group data column (Column
B), then highlight all rows and columns containing data.
Copy (CTRL + C)
Open a new blank workbook.
Edit -> Paste Special: click “Transpose values”
-
You will get a list of columns that looks like the above, only the “Unit” column is named
“Group Number”. Change this to “Unit” and relocate the information to the last column in the file.
Save the file. Close the file before attempting to run the SAS program that uses it. Note that the SAS code below assumes that the data is saved in Microsoft Excel v. 4.0 format and will look for a data file that is saved in that format, complete with the appropriate extension.
An example of SAS code for the cluster analysis is:
PROC IMPORT OUT= WORK.liztest
DATAFILE= "C:\Example.xlw"
DBMS=EXCEL4 REPLACE;
GETNAMES=YES;
RUN ; title1 'Cluster Analysis Example' ; proc print ; run ; proc cluster method =single std ; id unit; proc tree ; run ;
When this code is run, we get the resulting output:
(4) Stepwise Discriminant analysis: Given the large number of fingerprint groups in
OFRG studies, it would be unfeasible to manually pick out groups, or clusters of groups, that demonstrate treatment differences. To help us locate differences between treatments, we use a stepwise discriminant analysis. We need to look at data from groups containing a sufficient number of clones for analysis. What the optimal number of clones would be for this analysis is unknown, but currently we analyze data from fingerprint groups containing 5 or more clones.
First step: Open the file you made for Cluster analysis, above, or if you have not performed this analysis, obtain the data from groups containing 5 or more clones and transpose it, as described for generating Cluster analysis data files.
The format of the file will be slightly different in that the treatment and replicate data are in separate columns and come at the beginning of the data:
Treat
One
One
One
Rep a b c
Group1 Group2 Group3 Group4 Group5 Group6
9 4 0 0 0 0
9
9
3
2
1
3
0
1
0
2
0
0
Two
Two
Two
Three
Three
Three
Four
Four
Four a b c a a b c b c
2
2
2
1
0
0
0
1
1
2
3
3
3
4
1
4
2
7
4
3
0
1
1
3
5
0
0
1
0
0
0
0
2
1
0
0
0
0
0
0
1
0
2
0
0
0
0
0
5
0
0
0
5
5
Save the file under a new name.
An example of a stepwise discriminant SAS program is below.
Note that you must input the names of each group (or, the Group Number for each fingerprint group) in both the input statement and the proc stepdisc var statement, and that each group name must be recognized by SAS as a text string and not a number.
To insert data for datalines, you can cut (CTRL + C) and paste (CTRL + V) from Excel spreadsheets. options ls = 80 ps = 55 nocenter nodate;
/* Set up temporary SAS data set called stepdisc */ data stepdisc;
/* variables to be input are Treatment Replicate, and number of clones per treatment and replicate in each group */ input Treatment $ Replicate $ Group1 Group2 Group3 Group4 Group5 Group6;
/* The "$" indicates that it is a text variable*/ title1 'Stepwise Discriminant Analysis' ; title2 'Example Data' ;
/* datalines statement to indicate data is about to begin */ datalines ;
One
One
One a b c
9
9
9
0
4
3
2
4
0
1
3
1
0
0
1
0
0
0
2
1
0
0
0
0 Two a
Two
Two b c
Three a
0
0
2
2
1
4
2
3
3
5
4
3
2
1
1
0
0
2
0
0
0
0
0
0 Three b
Three c
Four a
Four b
2
1
1
1
3
3
2
7
0
1
0
0
0
0
0
0
0
0
0
0
0
5
5
5 Four c
;
RUN ; proc stepdisc data =stepdisc; class Treatment; var Group1 Group2 Group3 Group4 Group5 Group6; run ;
An example of the output is below.
The STEPDISC Procedure
Stepwise Selection: Step 1
Statistics for Entry, DF = 3, 8
Variable R-Square F Value Pr > F Tolerance
Group1 1.0000 Infty <.0001 1.0000
Group2 0.1169 0.35 0.7885 1.0000
Group3 0.3577 1.48 0.2905 1.0000
Group4 0.3220 1.27 0.3493 1.0000
Group5 0.3253 1.29 0.3437 1.0000
Group6 1.0000 Infty <.0001 1.0000
The program may automatically select groups it thinks should be excluded (or which demonstrate differences in clone distribution between treatments), but we recommend looking instead at the data before any groups are excluded (or, “Step 1” data), since sometimes the number of steps the program can take is less than the number of groups that should be excluded.
In the above example, two groups have a significant p-value, Group1 and Group6. These groups probably represent microorganisms which have different distributions between treatments.
(5) Stepwise Regression analysis : In this case you are looking for fingerprint groups in which the abundance of clones between treatments varies with some measurement of ecosystem function that varies between treatments. An example of an ecosystem function would be the ability of a soil to suppress a plant parasitic nematode. The data used in this analysis is identical to that used in the Stepwise Discriminant analysis, except this time there is an additional data column for the measurement.
Treat
One
One
One
Two
Two
Two
Three
Three
Three
Four b c a b c a
Rep a b c a
Group1 Group2 Group3 Group4 Group5 Group6 Measurement
9 4 0 0 0 0 6
9
9
0
3
2
4
1
3
1
0
1
0
0
2
1
0
0
0
6
6
0
0
0
2
2
2
1
1
4
2
3
3
3
3
5
4
3
0
1
2
1
1
0
0
0
0
2
0
0
0
0
0
0
0
0
0
5
0
0
3
3
3
2
Four
Four b c
1
1
2
7
0
0
0
0
An example of a stepwise regression SAS program follows:
0
0
5
5
2
2 options ls = 80 ps = 55 nocenter nodate; data stepwisemeasure;
/* Measurement represents the property to be measured, such as ability to digest waste or suppress plant disease */ input Treatment $ Replicate $ Group1 Group2 Group3
Group4 Group5 Group6 measurement;
/* The "$" indicates that it is a text variable*/ title1 'Stepwise Regression' ;
title2 'Example of Measurement Data' ;
/* datalines statement to indicate data is about to begin */ datalines ;
One a
One
Two
Two
Four c
; b
One c a
Two b c
Three a
Three b
Three c
Four a
Four b
9
9
9
0
0
0
2
2
2
1
1
1
4
3
2
4
1
4
2
3
3
3
2
7
0
1
3
1
3
5
4
3
0
1
0
0
0
0
1
0
2
1
1
0
0
0
0
0
0
0
2
1
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
5
5
6
6
6
0
0
0
3
3
3
2
2
2
RUN ;
PROC REG DATA =stepwisemeasure;
MODEL measurement=Group1 Group2 Group3 Group4 Group5 Group6/
SELECTION =STEPWISE;
TITLE1 'Stepwise Regression' ;
/* Add & substract one at a time & compare f0 */ run ;
When the above program is run, the output looks like this:
Stepwise Regression 14
The REG Procedure
Model: MODEL1
Dependent Variable: measurement
Stepwise Selection: Step 1
Variable Group1 Entered: R-Square = 0.8971 and C(p) = 3.3051
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 50.46000 50.46000 87.15 <.0001
Error 10 5.79000 0.57900
Corrected Total 11 56.25000
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 1.01000 0.28808 7.11698 12.29 0.0057
Group1 0.58000 0.06213 50.46000 87.15 <.0001
Bounds on condition number: 1, 1
--------------------------------------------------------------------------------
Stepwise Selection: Step 2
Variable Group5 Entered: R-Square = 0.9286 and C(p) = 1.8366
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 52.23639 26.11819 58.57 <.0001
Error 9 4.01361 0.44596
Corrected Total 11 56.25000
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 1.19154 0.26869 8.77017 19.67 0.0016
Group1 0.59018 0.05476 51.79362 116.14 <.0001
Group5 -0.50899 0.25503 1.77639 3.98 0.0771
Stepwise Regression 15
The REG Procedure
Model: MODEL1
Dependent Variable: measurement
Stepwise Selection: Step 2
Bounds on condition number: 1.0088, 4.035
--------------------------------------------------------------------------------
All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the model.
Summary of Stepwise Selection
Variable Variable Number Partial Model
Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F
1 Group1 1 0.8971 0.8971 3.3051 87.15 <.0001
2 Group5 2 0.0316 0.9286 1.8366 3.98 0.0771
The program will automatically display the groups in which there is some correlation between the measurement and the number of clones in each treatment. Pick out significant groups based on the p-value. In this case, Group 1 seems to show a significant (p<0.05) correlation between the measurement and the number of clones in each treatment.