UserGuideLRTae - Division of Statistical Genomics

advertisement
User's Guide to the LRTae program 1.0
=========================================
TABLE OF CONTENTS
1.0 OVERVIEW AND THEORY
1.1
Terminology
1.1.1 Maximum likelihood estimates of parameters using ExpectationMaximization algorithm
1.1.2 Asymptotic and permutation p-values
2.0 RUNNING THE LRTae PROGRAM
2.1
Input files
2.1.1
Description file
2.1.2
Fallible data file
2.1.3
Infallible data file
2.2
Usage
2.2.1
Command line options
2.3
Example files with this distribution
3.0 INTERPRETING RESULTS FROM LRTae OUTPUT
3.1
A note about selection of double-samples
4.0 PROBLEMS? COMMENTS? (contact information)
5.0 ACKNOWLEDGEMENTS
6.0 REFERENCES
1.0 OVERVIEW AND THEORY
This program is compiled to work in UNIX Solaris, LINUX, and Windows (PC)
operating systems. All commands are executed from the command line in UNIX and
LINUX or DOS prompt in Windows.
One of the more vexing problems in association analysis with cases and controls is
misclassification error (e.g., a true case is incorrectly labeled as a control or a
homozygote genotype 1 1 is labeled as a heterozygote 1 2). Consequences of such
misclassifications are a reduction in the power to detect association and biased estimates
or parameters such as genotype or haplotype frequencies in cases and controls. These
misclassifications are of primary concern for case/control genetic association studies
since there is no way to detect such errors in phenotype or genotype data short of
repeated sampling (or double-sampling – see below). We developed a method, the
Likelihood Ratio Test Allowing for Errors (LRTae) to treat this problem (1). The purpose
of this software is to compute the LRTae on genotype and phenotype data provided by
the user.
1.1 Terminology
A key assumption for use of this program is that double-sample phenotype and/or
genomic data (either genotypes or haplotypes – see below) are available for some subset
of the individuals. By double-sample we mean a sample (either phenotype, genotype, or
haplotype) that has been classified twice in a particular manner (2, 3). The first
classification (called the fallible classification) is determined by a standard procedure that
is subject to misclassification error. It is available on every individual. The second
classification (called the infallible classification) is determined by a “gold-standard”
procedure that has much lower misclassification error than the fallible classification
mechanism. It is typically available on only a subset of individuals. An example of a
double-sample for phenotype comes from Alzheimer’s Disease. The fallible classification
procedure is phenotype determination using a clinical dimentia instrument and the
infallible procedure is autopsy-proven plaques and tangles. For genotypes, the fallible
classification is the current genotyping technology and the infallible classification
procedure may be forward and reverse sequencing. For haplotypes, the fallible
classification procedure may be statistically assigned haplotypes based on multi-locus
genotype data and the infallible classification may be molecular haplotype calls.
Individuals for whom both classifications are available are called double-samples. We
also use the term genomic data to describe either genotype or haplotype pair data.
This program performs a likelihood ratio test on genomic data for cases and
control individuals. It uses double-sample data to: (i) estimate the phenotype and
genomic data misclassification probabilities; (ii) compute asymptotically unbiased
estimates of genomic data frequencies in cases and controls; and (iii) increase power for
genetic association. While we do not present full details of the theory in this user guide,
we note that full documentation for our method is available online
(http://www.bepress.com/sagmb/vol3/iss1/art26/) or from the authors (lrtae@linkage.rockefeller.edu).
1.1.1 Maximum likelihood estimates of study parameters using ExpectationMaximization algorithm
Maximum likelihood estimates (MLEs) of parameters such as the sampling proportions
of cases and controls in the data set and the specific genomic data frequencies in cases
and controls under the null and alternative hypotheses are determined using the
Expectation-Maximization (EM) method (4). Parameter estimates are updated until the
absolute difference between the sum of the nth step and n + 1st step estimates is no greater
than 10-9 (summed over all parameters). That is, the stopping condition for parameter
estimation is that |  vin  vin 1 | 109 , where vin is the nth step estimate of the ith
i
parameter.
1.1.2 Asymptotic and permutation p-values
The null hypothesis (H0) for our statistic is that each observed genotype’s (or
haplotype’s) frequency in cases and controls is equal. The alternative hypothesis (H1) is
that at least one genotype (or haplotype) has a different frequency between cases and
controls. Log-likelihoods for the data under each of the hypotheses are computed and our
LRTae test statistic is twice the difference of the log-likelihoods. Asymptotically, under
H0, the LRTae statistic is distributed as a central chi-square distribution with k-1 degrees
of freedom, where k is the number of distinct genotype (or haplotype) categories
observed (5). In situations where the number of observations of a particular genotype or
haplotype classification may be small, the asymptotic distribution may not be valid. In
such situations, we compute p-values for the LRTae statistic via permutation. However,
because we have double-sample data, permutation testing is not so straightforward. We
consider two situations:
(a) Double-sample information is available only for genotypes – In such situations,
we randomly permute phenotype status for all individuals, keeping the total
number of cases and controls fixed in each replicate. The permutation p-value is
then the proportion of replicates for which the LRTae statistic exceeds the
observed LRTae statistic. Exact confidence intervals for the permutation p-value
are computed using the method implemented in the BINOM software (6).
(b) Double-sample information is available only for phenotypes – In such situations,
we randomly permute genotype (or haplotype) status for all individuals, keeping
the total number of observations of a particular genotype (or haplotype) fixed in
each replicate. As above, the permutation p-value is the proportion of replicates
for which the LRTae statistic exceeds the observed LRTae statistic. Exact
confidence intervals for the permutation p-value are computed using the method
implemented in the BINOM software (6).
While we conjecture that the p-values determined via our permutation procedures
(described in items (a) and (b)) will achieve correct significance levels, we note that we
have performed no simulation studies to date to verify these procedures. Therefore, users
should interpret p-values using the permutation procedure with caution.
Finally, we note that we perform the same procedures for the LRTstd method.
2.0 RUNNING THE LRTae PROGRAM
2.1 Input files
The LRTae program (version 1.0) requires three files as input. They are: (i) the
phenotype and genotype description file; (ii) the fallible data file; and (iii) the infallible
data file. Full details regarding format and examples for each type of file are presented
below.
2.1.1 Description file
This file contains the information identifying the phenotypes and genotypes in the fallible
and infallible data files (items 2.1.2 and 2.1.3). There are a minimum of two and a
maximum of three columns for this file. The first column indicates the nature of the
categorical variable in second column for the corresponding row. Choices are: P =
Phenotype; A = Genomic data marker with genotypes or haplotype-pairs coded as two
separate alleles or haplotypes; C = Marker locus with genotypes coded by a single
number. As an example to distinguish between symbols A and C, consider a SNP locus
containing genotypes DD, Dd, and dd. Using the symbol A, individual genotypes are
written in the fallible and infallible data files using two columns. For example, the two
columns “1 1”, “1 2”, and “2 2” may represent the genotypes DD, Dd, and dd. Note that
this is the same coding that is used for genotypes in the LINKAGE programs (7). Using
the symbol “C”, genotypes are coded using a single number. Using our example, we may
make the assignments: “1” = DD, “2” = Dd, “3” = dd, although this assignment is
arbitrary. This coding is similar to the coding for genotypes used in the EH program (8).
Also see http://linkage.rockefeller.edu/ott/eh.htm .
The second column in this file is the name of the variable corresponding to the
symbol in the first column. This variable name is also used in the 1st line of both the
fallible and infallible data files to indicate the nature of the corresponding columns in
those files.
The third column, which is optional, indicates the symbol that is used for missing
data in the fallible and infallible data files. If no value is provided, then the program will
assume the default settings that “-1” is the code for missing phenotype data and “0” is the
missing code for genotype or haplotype pair data. Using the default settings, one uses “0
0” for an individual’s missing genotype if genotypes are coded using the A symbol and
“0” for an individual’s missing genotype if genotypes are coded using the C symbol. If
one wants to use the value of –99 for missing genotype data, then similarly one types
“-99 -99” for an individual’s missing genotype if genotypes are coded using the A
symbol and “-99” for an individual’s missing genotype if genotypes are coded using the
C symbol. We present an example of the contents of a description file here.
(Example description file)
P pheno1 –1
P pheno2 –99
A SNP1
95
C Mkr2
In this file, we see that there are four variables of interest. Two correspond to phenotypes
and two correspond to genomic data. The first phenotype variable is called “pheno1” and
the second is called “pheno2”. The first genomic variable is labeled “SNP1” and the
second is labeled “Mkr2”. For the variable pheno1, missing data is indicated by a “-1”
code, while for the variable pheno2, missing data is indicated by a “-99” code. For
variable SNP1, missing data is indicated by a “95” code. There is no specified code for
missing data with the variable Mkr2, so the default code “0” is used.
2.1.2 Fallible data file
This file contains phenotype and genotype classifications for all individuals as measured
with the fallible measuring instrument (see section 1.1 for definition of fallible). The
format for this file is as follows:
Ind_ID Order_of_ phenotype_and genomic_data
There are two key formatting issues regarding this file and the infallible data file (2.1.3).
They are:
1) The first line of the file always consists of the order of the phenotype and
genomic data. The order is determined by using the variable names provided in
the description file (item 2.1.1).
2) The first column of each row is always the individual ID (Ind_ID), which must be
an alphanumeric string of characters.
The format for this file is best explained through an example. Suppose that we have an
example description file as provided in item 2.1.1 above. Then a fallible data file might
look like:
(Example fallible data file)
pheno1 SNP1 Mkr2 pheno2
A1
0
11
2
1
U2
1
12
1
-99
A3
0
22
4
0
A4
1
95 95 1
1
U12 1
12
2
0
…
In this example, we provide fallible phenotype and genomic classifications for five
individuals. As mentioned above, the first column for both the fallible and the infallible
data files MUST be the individual ID and it must be an alphanumeric string.
As indicated in the example description file, the first column after the Individual
code column is the individual’s classification for the phenotype pheno1. All phenotype
classifications must always be either 0 (case) or 1 (control). In this example file, we see
that the fallible classification of the phenotype “pheno1” for the five individuals are
(respectively): case, control, case, control, control.
Because the 1st line of this file indicates that the data after the phenotype pheno1
is genotype data for the variable SNP1 and genotypes are coded using two columns (one
for each allele – symbol A in the example description file), we can determine that the
respective genotypes for individuals at this marker locus are: 1 1, 1 2, 2 2, missing data,
and 12.
The 1st line of the example fallible data file indicates that the next data are coded
genotypes for the variable “Mkr2” that are coded using a single column (symbol C in the
example description file). In the example fallible data file, we observe three different
haplotypes. The respective coded haplotypes for these individuals are: 2, 1, 4, 1, 2.
Finally, the last column in this example data file represents classification for a
second phenotype, pheno2. Individuals have the respective phenotypes: control, missing
data, case, control, case, using the fallible measuring instrument for phenotype pheno2.
2.1.3 Infallible data file
This file contains phenotype and genotype classifications for individuals as measured
with the infallible measuring instrument (see section 1.1 for definition of infallible). Note
that the list of individuals in this file is a subset of the list of individuals in the fallible
data file. An individual is listed in this file only if the individual has an infallible
measurement for at least one of the phenotypes and/or genomic data variables in the
description file. The format for this file is as follows:
Ind_ID Order_of_ phenotype_and genomic_data
As above, we illustrate with an example file. Suppose that the data below is the infallible
data file corresponding to the example description and fallible data files above. Notice
that individual A4 is missing from what we report. That means that there is no doublesample data for that individual. Also notice that there is no information for variable
pheno2; that means that any analysis with the phenotype pheno2 will not include doublesample information on pheno2.
(Example infallible data file)
pheno1 SNP1 Mkr2
A1
0
11
2
U2
1
22
0
A3
0
22
4
U12 0
12
2
…
In this example, we provide infallible measures for the phenotype variable pheno1, for
the SNP marker locus variable SNP1, and for the coded genotype variable Mkr2. As
mentioned above, the first column for both the fallible and the infallible data files MUST
be the individual ID and it must be an alphanumeric string.
As indicated in the example description file, the first column after the Individual
code column is the individual’s classification for the phenotype pheno1. All phenotype
classifications must always be either 0 (case) or 1 (control). In this example file, we see
that the infallible classification for the four individuals are (respectively): case, control,
case, case. It is interesting to note that in these example files, individual U12 is classified
as a case using the infallible classifier and as a control using the fallible classifier
(Example fallible data file). This is an example of a phenotype misclassification. This
information is used when computing the LRTae statistic (1).
Because the 1st line of this example file indicates that the data after the phenotype
pheno1 is a genotype consisting of two columns (one for each allele – symbol A in the
example description file), we can determine that the respective genotypes for individuals
at this marker locus are: 1 1, 2 2, 2 2, and 1 2.
The 1st line of this example file indicates that the next data are coded genotypes
for the variable Mkr2 (coded using a single column – symbol C in the example
description file). In the example infallible data file, the respective coded genotypes for
the four individuals are: 2, missing data (coded 0), 4, and 2.
Finally, the last column in this example data file represents classification for a
second phenotype, P2. Individuals A1-A4 have the respective phenotypes: case, case,
control, case using the fallible measuring instrument for phenotype P2.
2.2
Program usage
Usage of the program is as follows:
Usage: lrtae [OPTIONS] <fallible file> <infallible file> <description file>
We explain each item:
[OPTIONS]: A list of options that may be used when running LRTae (see Section 2.2.1).
<fallible file>: The name of the fallible data file (item 2.1.2)
<infallible file>: The name of the infallible data file (item 2.1.3)
<description file>: The name of the description file (item 2.1.1)
2.2.1 Command line options
-p <# perms>
-g <# perms>
-b <# perms>
Permute phenotypes only
Permute genomic data only
Permute both phenotypes and genomic data
-j <genomic_list>
Run statistics for the genomic data specified
genomic_list should be a comma separated list
-k <phenotype_list> Run statistics for the phenotypes specified
phenotype_list should be a comma separated list
-c <conf_interval> Set the confidence interval (default is 0.95 = 95%)
-o
-h
-v
Specify output file
Show help
Show version
Note: If no genomic data or phenotype is specified all will be used
We explain each of these options in the order of their appearance above.
-p <# perms>: This option assumes that double-sample information is only available for
genotypes. Permutations are performed by randomly reassigning phenotypes, keeping
each individual’s information on genotypes fixed (including double-sample information).
The total number of cases and controls is also fixed for each replicate. The number of
permutations is specified by the user (#permutations). For example, typing “-p 10000”
means that 10,000 permutations will be performed by randomly reassigning observed
case and control status. For more details on p-values, see section 1.1.2 above. It is
important to comment that if this option is chosen for a phenotype variable in which
double-sample data is available, permutation p-values may not be valid.
-g <# perms>: This option assumes that double-sample information is only available for
phenotypes. Permutations are performed by randomly reassigning genotypes or haplotype
pair, keeping each individual’s information on phenotypes fixed (including doublesample information). The total number of each type of genotype or haplotype pair is also
fixed for each replicate. The number of permutations is specified by the user
(#permutations). For example, typing “-g 10000” means that 10,000 permutations will be
performed by randomly reassigning genotypes or haplotype pairs. For more details on pvalues, see section 1.1.2 above. It is important to comment that if this option is chosen for
a genomic data variable in which double-samples are available, permutation p-values
may not be valid.
Note: For any run of the LRTae program, only one of the options “-p” or “-g” may be
chosen.
-b <# perms>: With this option, permutations are performed on both phenotypes and
genomic data. THIS OPTION IS NOT RECOMMENDED AT PRESENT – MORE
RESEARCH MUST BE DONE TO EVALUATE P-VALUES DETERMINED
USING THIS OPTION.
-j <genomic_list>: This option is chosen if the user only wants to perform LRTae
analysis on a subset of the genomic data provided in the data files (items 2.1.1-2.1.3). For
example, suppose that genotype and/or haplotype pair data are available for variables
Mkr1, Mkr2, Mkr3, Mkr4, Hap1, Hap2, and Hap3. If this option is not chosen, then the
program will compute the LRTae statistic for all genomic data. However, if the user types
the option:
“-j Mkr1,Mkr3,Hap1”
then the LRTae statistic will only be computed for Mkr1, Mkr3, and Hapl variables.
Note: With this option, the list of genomic data variables MUST be separated by a
comma and there can be no spaces between variables. For example, the program will not
run if the user types “-j Mkr1, Mkr3, Hapl” with spaces after the commas.
-k <phenotype_list>: This option is chosen if the user only wants to perform LRTae
analysis on a subset of the phenotype variables provided in the data files (items 2.1.12.1.3). For example, suppose that phenotype data are available for variables p1, p2, p3,
and p4. If this option is not chosen, then the program will compute the LRTae statistic for
all four variables. However, if the user types the option:
“-k p2,p4”
then the LRTae statistic will only be computed for the p2 and p4 variables.
Note: As above (-j option), the list of phenotype variables MUST be separated by a
comma and there can be no spaces between variables. For example, the program will not
run if the user types “-k p2, p4” with spaces after the comma.
-c <confidence_interval>: This option enables the user to specify the percent confidence
interval for the p-value obtained by permutation. For example, if the user types “-c 0.x”,
then a x% confidence interval centered about the permutation p-value will be provided.
So, if the user types “-c 0.99”, then a 99% confidence interval will be calculated and if
the user types “-c 0.90” then a 90% confidence interval will be calculated. This
confidence interval is determined using the method implemented in the BINOM program
(see
http://linkage.rockefeller.edu/ott/linkutil.htm#BINOM
for
more
information). If this option is not selected and permutations are performed, the default
confidence interval provided is 95%.
-o: This option allows the user to specify the name of the output file, containing results of
all analyses.
-h: Choosing this option will provide the user with a list of command-line options
(Section 2.2.1).
-v: This option provides the user with the version of the program being run.
2.3 Example files with this distribution
We provide two sets of example data with this program. Both are simulated data sets. The
first contains double-sample genotype data for a SNP marker. The fallible, infallible, and
description files are: sim-snp-fall.txt, sim-snp-infall.txt, and sim-snp-desc.txt,
respectively. The output file is labeled sim-snp-lrtae.out.
The second set of files contains information for two phenotypes (labeled simP1
and simP2) and for coded genotypes (labeled simMicro1; here Micro stands for
“Microsatellite” marker). The fallible, infallible, and description files are: sim-microfall.txt, sim-micro-infall.txt, and sim-micro-desc.txt, respectively. For the second set of
files, data are simulated so that the first phenotype is associated with the coded genotypes
and the second phenotype is not. The output file for this data set is labeled sim-microlrtae.out.
If we type:
> lrtae –o sim-snp-lrtae.out –p 10000 –c 0.99 sim-snp-fall.txt sim-snp-infall.txt sim-snp-desc.txt
at the command line, then the program will compute the LRTae statistic using the
observed phenotype data and the double-sample data for SNP genotypes. The program
also computes permutation p-values by randomly permuting case/control status for all
individuals. A total of 10,000 permutations are performed and a 99% confidence interval
centered around each permutation p-value is computed. The output is presented in a file
called “sim-snp-lrtae.out”. We present the contents of this file in the next section.
3.0 INTERPRETING RESULTS FROM LRTae OUTPUT
As with any statistical analysis program, critical to the success of the analysis is an
accurate interpretation of the results. We provide a sample output file, determined by
running the LRTae program on one sample data set above (see section 2.3 above).
LRTae program version 1.0
Written by Chad Haynes
Supervised by Derek Gordon
Input files:
sim-snp-desc.txt
sim-snp-fall.txt
(Description file)
(Fallible data file)
sim-snp-infall.txt
(Infallible data file)
Command line options selected:
10000 Permutations - Randomly reassign phenotypes
Output file name – sim-snp-lrtae.out
Program run at Tue Jan 04 17:51:57 2005
**************************************
** Genomic Data Marker: SNP1
**
** Phenotype Marker:
pheno1
**
**************************************
Population Frequencies:
1/1
1/2
2/2
LRTae Case
0.00000 0.09843 0.90157
Control 0.02619 0.22172 0.75209
LRTstd Case
0.02400 0.10800 0.86800
Control 0.05200 0.21600 0.73200
Misclassification:
Genomic Data
(Obs)
1/1
(True) 1/1
1.00000
1/2
0.08000
2/2
0.01064
1/2
0.00000
0.92000
0.02128
2/2
0.00000
0.00000
0.96809
N
1
25
94
Phenotype
(Obs)
Case
Control N
(True) Case
1.00000 0.00000 0
Control 0.00000 1.00000 0
LRT Statistics:
Hyp
LogLike LRT
Df Asym-P Perm-P 99.0% Permutation CI
LRTae
H1
647.82 18.13652 2 0.00012 0.00000 (0.00000, 0.00046)
H0
656.89
LRTstd H1
291.47 14.70875 2 0.00064 0.00100 (0.00037, 0.00214)
H0
298.82
Please cite the following when reporting results from the LRTae program:
Gordon D, Yang Y, Haynes C, Finch SJ, Mendell NR, Brown AM, and Haroutunian V (2004)
Increasing power for tests of genetic association in the presence of phenotype and/or
genotype error by use of double-sampling. Stat Appl Genet and Mol Biol 3:Article 26.
http://www.bepress.com/sagmb/vol3/iss1/art26/.
We first notice that the program output indicates what genomic data (SNP1) and
phenotype (pheno1) variables are being considered. Next, the program provides MLEs of
the individual SNP genotype frequencies in cases and controls using the LRTae method
and the LRTstd method (i.e., the method that does not use the double-sample
information.) Notice that the results are different. Using the double-sample information,
the MLEs for the 1 1 genotype in cases and controls with the LRTae method are 0.000
and 0.026, respectively (genotypes are written with a “/” symbol to separate the alleles).
Using the LRTstd method, the MLEs for the 1/1 genotype in cases and controls are 0.024
and 0.052, respectively. Similarly, for the 2/2 genotype, the LRTae method provides
MLEs of 0.902 and 0.752 for cases and controls (respectively) while the LRTstd method
provides MLEs of 0.868 and 0.732 (respectively).
What accounts for this difference? One can glean the answer by studying the
genotype misclassification table. In this table, we provide MLEs of the misclassification
probabilities, namely Pr(observed genotype = a | true genotype = b), where a and b are
any of the three genotypes 1/1, 1/2, or 2/2. The last column in this table, with the heading
“N” provides the sample size on which these calculations are determined. So for
example, we see that, based on the genotype 94 double-sample observations, the 2/2
genotype is misclassified approximately 2% of the time as the heterozygote 1/2 and
approximately 1% of the time as the homozygous 1/1 genotype. Similarly, based on 25
observations, the heterozygote 1/2 genotype is misclassified approximately 8% as the 1/1
genotype. Therefore, many of the “observed” 1/1 genotypes are really misclassified
heterozygotes and 2/2 genotypes. This accounts for the decreased 1/1 genotype frequency
estimates with the LRTae method as opposed to the LRTstd method.
Notice also that, because we have no double-sample data for the phenotype
variable pheno1, the classification is assumed to be “perfect”. Also, the sample size N for
the phenotype misclassification is 0.
The log-likelihoods under the null (H0) and alternative (H1) hypotheses (also see
Section 1.1.2) of the LRTae and LRTstd methods are provided below the
misclassification tables. For each method, twice the difference of the log-likelihoods
provides the value of the test statistic. We compute respective values of 18.137 and
14.709 for the LRTae and LRTstd methods. Since this data set was simulated by
randomly simulating errors into genotypes, the results are consistent with the findings of
Mote and Anderson (9) and others (10, 11) that the LRTae statistic has greater
significance. The added power comes from the correctly classified genotypes, which are
not used by the LRTstd method (1). Note also that the log-likelihoods for the LRTae
method are larger, indicating that additional information is being used. Also, the output
file provides the degrees of freedom (Df) for each likelihood ratio test (LRT). In this
example, there are two degrees of freedom for each test, since there are three genotype
classification categories possible for the di-allelic locus being tested. The asymptotic pvalue is computed assuming that the null distribution is a central chi-square distribution
with the indicated degrees of freedom. The asymptotic p-value is the probability of
observing the corresponding LRT statistic value assuming the null distribution is correct.
Because the LRTae method estimates the 1/1 genotype frequencies to be small (0
for cases!), we perform a permutation analysis to determine the accuracy of the p-value
based on asymptotic theory. Because we have double-sample data only for genotypes, we
choose the “-p” option (see above – Sections 1.1.2 and 2.2.1) and perform 10,000
permutations. Results of the permutation analysis indicate that the p-values based on
asymptotic theory are consistent with the p-values based on permutation (both asymptotic
p-values are within the 99% confidence intervals for the p-values based on permutation).
However, as mentioned above (Section 1.1.2) these results should be viewed with
some caution, as more extensive simulations need to be conducted.
3.2
A note about selection of double-samples
When using the LRTae statistic, we assume that the double-sample information is
obtained by randomly selecting a subset of individuals for classification with an infallible
measure. However, this assumption may be violated. For example, with a disease like
Alzheimer’s, there may be a bias in obtaining only the “gold-standard” diagnosis of
presence/absence of plaques and tangles for observed cases, since observed controls are
assumed to have died of other causes and therefore may not have autopsies performed.
Similarly, researchers may only scrutinize very rare observed genotypes, because these
can be particularly costly in terms of power loss for case/control studies (12).
4.0 PROBLEMS? COMMENTS?
If there are problems in the execution or compilation of this program or if you would like
to provide some feedback, please e-mail lrtae@linkage.rockefeller.edu
5.0 ACKNOWLEDGEMENTS
The authors of this software gratefully acknowledge grant K01-HG00055 from the
National Institutes of Health.
6.0 REFERENCES
Below are references for this README file. Please cite the first reference when reporting
results using the LRTae software.
1.
Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., and
Haroutunian, V. 2004. Increasing power for tests of genetic association in the presence of
phenotype and/or genotype error by use of double-sampling. Stat Appl Genet and Mol
Biol 3:Article 26. http://www.bepress.com/sagmb/vol3/iss1/art26/.
2.
Tenenbein, A. 1970. A double sampling scheme for estimating from binomial
data with misclassifications. J Am Stat Assoc 65:1350-1361.
3.
Tenenbein, A. 1972. A double sampling scheme for estimating from misclassified
multinomial data with applications to sampling inspection. Technometrics 14:187-202.
4.
Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from
incomplete data via the EM algorithm. J Roy Statist Soc B 39:1-38.
5.
Cox, D.R., and Hinkley, D.V. 1979. Theoretical Statistics. Boca Raton: CRC
Press.
6.
Ott, J. 1999. Analysis of Human Genetic Linkage. Baltimore: The Johns Hopkins
University Press.
7.
Terwilliger, J.D., and Ott, J. 1994. Handbook of Human Genetic Linkage.
Baltimore: Johns Hopkins.
8.
Xie, X., and Ott, J. 1993. Testing linkage disequilibrium between a disease gene
and marker loci. Am J Hum Genet 53:1107 (Abstract).
9.
Mote, V.L., and Anderson, R.L. 1965. An investigation of the effect of
misclassification on the properties of chisquare-tests in the analysis of categorical data.
Biometrika 52:95-109.
10.
Gordon, D., Finch, S.J., Nothnagel, M., and Ott, J. 2002. Power and sample size
calculations for case-control genetic association tests when errors are present: application
to single nucleotide polymorphisms. Hum Hered 54:22-33.
11.
Rice, K.M., and Holmans, P. 2003. Allowing for genotyping error in analysis of
unmatched cases and controls. Ann Hum Genet 67:165-174.
12.
Kang, S.J., Gordon, D., and Finch, S.J. 2004. What SNP genotyping errors are
most costly for genetic association studies? Genet Epidemiol 26:132-41.
Download