User's Guide to the LRTae program 1.0 ========================================= TABLE OF CONTENTS 1.0 OVERVIEW AND THEORY 1.1 Terminology 1.1.1 Maximum likelihood estimates of parameters using ExpectationMaximization algorithm 1.1.2 Asymptotic and permutation p-values 2.0 RUNNING THE LRTae PROGRAM 2.1 Input files 2.1.1 Description file 2.1.2 Fallible data file 2.1.3 Infallible data file 2.2 Usage 2.2.1 Command line options 2.3 Example files with this distribution 3.0 INTERPRETING RESULTS FROM LRTae OUTPUT 3.1 A note about selection of double-samples 4.0 PROBLEMS? COMMENTS? (contact information) 5.0 ACKNOWLEDGEMENTS 6.0 REFERENCES 1.0 OVERVIEW AND THEORY This program is compiled to work in UNIX Solaris, LINUX, and Windows (PC) operating systems. All commands are executed from the command line in UNIX and LINUX or DOS prompt in Windows. One of the more vexing problems in association analysis with cases and controls is misclassification error (e.g., a true case is incorrectly labeled as a control or a homozygote genotype 1 1 is labeled as a heterozygote 1 2). Consequences of such misclassifications are a reduction in the power to detect association and biased estimates or parameters such as genotype or haplotype frequencies in cases and controls. These misclassifications are of primary concern for case/control genetic association studies since there is no way to detect such errors in phenotype or genotype data short of repeated sampling (or double-sampling – see below). We developed a method, the Likelihood Ratio Test Allowing for Errors (LRTae) to treat this problem (1). The purpose of this software is to compute the LRTae on genotype and phenotype data provided by the user. 1.1 Terminology A key assumption for use of this program is that double-sample phenotype and/or genomic data (either genotypes or haplotypes – see below) are available for some subset of the individuals. By double-sample we mean a sample (either phenotype, genotype, or haplotype) that has been classified twice in a particular manner (2, 3). The first classification (called the fallible classification) is determined by a standard procedure that is subject to misclassification error. It is available on every individual. The second classification (called the infallible classification) is determined by a “gold-standard” procedure that has much lower misclassification error than the fallible classification mechanism. It is typically available on only a subset of individuals. An example of a double-sample for phenotype comes from Alzheimer’s Disease. The fallible classification procedure is phenotype determination using a clinical dimentia instrument and the infallible procedure is autopsy-proven plaques and tangles. For genotypes, the fallible classification is the current genotyping technology and the infallible classification procedure may be forward and reverse sequencing. For haplotypes, the fallible classification procedure may be statistically assigned haplotypes based on multi-locus genotype data and the infallible classification may be molecular haplotype calls. Individuals for whom both classifications are available are called double-samples. We also use the term genomic data to describe either genotype or haplotype pair data. This program performs a likelihood ratio test on genomic data for cases and control individuals. It uses double-sample data to: (i) estimate the phenotype and genomic data misclassification probabilities; (ii) compute asymptotically unbiased estimates of genomic data frequencies in cases and controls; and (iii) increase power for genetic association. While we do not present full details of the theory in this user guide, we note that full documentation for our method is available online (http://www.bepress.com/sagmb/vol3/iss1/art26/) or from the authors (lrtae@linkage.rockefeller.edu). 1.1.1 Maximum likelihood estimates of study parameters using ExpectationMaximization algorithm Maximum likelihood estimates (MLEs) of parameters such as the sampling proportions of cases and controls in the data set and the specific genomic data frequencies in cases and controls under the null and alternative hypotheses are determined using the Expectation-Maximization (EM) method (4). Parameter estimates are updated until the absolute difference between the sum of the nth step and n + 1st step estimates is no greater than 10-9 (summed over all parameters). That is, the stopping condition for parameter estimation is that | vin vin 1 | 109 , where vin is the nth step estimate of the ith i parameter. 1.1.2 Asymptotic and permutation p-values The null hypothesis (H0) for our statistic is that each observed genotype’s (or haplotype’s) frequency in cases and controls is equal. The alternative hypothesis (H1) is that at least one genotype (or haplotype) has a different frequency between cases and controls. Log-likelihoods for the data under each of the hypotheses are computed and our LRTae test statistic is twice the difference of the log-likelihoods. Asymptotically, under H0, the LRTae statistic is distributed as a central chi-square distribution with k-1 degrees of freedom, where k is the number of distinct genotype (or haplotype) categories observed (5). In situations where the number of observations of a particular genotype or haplotype classification may be small, the asymptotic distribution may not be valid. In such situations, we compute p-values for the LRTae statistic via permutation. However, because we have double-sample data, permutation testing is not so straightforward. We consider two situations: (a) Double-sample information is available only for genotypes – In such situations, we randomly permute phenotype status for all individuals, keeping the total number of cases and controls fixed in each replicate. The permutation p-value is then the proportion of replicates for which the LRTae statistic exceeds the observed LRTae statistic. Exact confidence intervals for the permutation p-value are computed using the method implemented in the BINOM software (6). (b) Double-sample information is available only for phenotypes – In such situations, we randomly permute genotype (or haplotype) status for all individuals, keeping the total number of observations of a particular genotype (or haplotype) fixed in each replicate. As above, the permutation p-value is the proportion of replicates for which the LRTae statistic exceeds the observed LRTae statistic. Exact confidence intervals for the permutation p-value are computed using the method implemented in the BINOM software (6). While we conjecture that the p-values determined via our permutation procedures (described in items (a) and (b)) will achieve correct significance levels, we note that we have performed no simulation studies to date to verify these procedures. Therefore, users should interpret p-values using the permutation procedure with caution. Finally, we note that we perform the same procedures for the LRTstd method. 2.0 RUNNING THE LRTae PROGRAM 2.1 Input files The LRTae program (version 1.0) requires three files as input. They are: (i) the phenotype and genotype description file; (ii) the fallible data file; and (iii) the infallible data file. Full details regarding format and examples for each type of file are presented below. 2.1.1 Description file This file contains the information identifying the phenotypes and genotypes in the fallible and infallible data files (items 2.1.2 and 2.1.3). There are a minimum of two and a maximum of three columns for this file. The first column indicates the nature of the categorical variable in second column for the corresponding row. Choices are: P = Phenotype; A = Genomic data marker with genotypes or haplotype-pairs coded as two separate alleles or haplotypes; C = Marker locus with genotypes coded by a single number. As an example to distinguish between symbols A and C, consider a SNP locus containing genotypes DD, Dd, and dd. Using the symbol A, individual genotypes are written in the fallible and infallible data files using two columns. For example, the two columns “1 1”, “1 2”, and “2 2” may represent the genotypes DD, Dd, and dd. Note that this is the same coding that is used for genotypes in the LINKAGE programs (7). Using the symbol “C”, genotypes are coded using a single number. Using our example, we may make the assignments: “1” = DD, “2” = Dd, “3” = dd, although this assignment is arbitrary. This coding is similar to the coding for genotypes used in the EH program (8). Also see http://linkage.rockefeller.edu/ott/eh.htm . The second column in this file is the name of the variable corresponding to the symbol in the first column. This variable name is also used in the 1st line of both the fallible and infallible data files to indicate the nature of the corresponding columns in those files. The third column, which is optional, indicates the symbol that is used for missing data in the fallible and infallible data files. If no value is provided, then the program will assume the default settings that “-1” is the code for missing phenotype data and “0” is the missing code for genotype or haplotype pair data. Using the default settings, one uses “0 0” for an individual’s missing genotype if genotypes are coded using the A symbol and “0” for an individual’s missing genotype if genotypes are coded using the C symbol. If one wants to use the value of –99 for missing genotype data, then similarly one types “-99 -99” for an individual’s missing genotype if genotypes are coded using the A symbol and “-99” for an individual’s missing genotype if genotypes are coded using the C symbol. We present an example of the contents of a description file here. (Example description file) P pheno1 –1 P pheno2 –99 A SNP1 95 C Mkr2 In this file, we see that there are four variables of interest. Two correspond to phenotypes and two correspond to genomic data. The first phenotype variable is called “pheno1” and the second is called “pheno2”. The first genomic variable is labeled “SNP1” and the second is labeled “Mkr2”. For the variable pheno1, missing data is indicated by a “-1” code, while for the variable pheno2, missing data is indicated by a “-99” code. For variable SNP1, missing data is indicated by a “95” code. There is no specified code for missing data with the variable Mkr2, so the default code “0” is used. 2.1.2 Fallible data file This file contains phenotype and genotype classifications for all individuals as measured with the fallible measuring instrument (see section 1.1 for definition of fallible). The format for this file is as follows: Ind_ID Order_of_ phenotype_and genomic_data There are two key formatting issues regarding this file and the infallible data file (2.1.3). They are: 1) The first line of the file always consists of the order of the phenotype and genomic data. The order is determined by using the variable names provided in the description file (item 2.1.1). 2) The first column of each row is always the individual ID (Ind_ID), which must be an alphanumeric string of characters. The format for this file is best explained through an example. Suppose that we have an example description file as provided in item 2.1.1 above. Then a fallible data file might look like: (Example fallible data file) pheno1 SNP1 Mkr2 pheno2 A1 0 11 2 1 U2 1 12 1 -99 A3 0 22 4 0 A4 1 95 95 1 1 U12 1 12 2 0 … In this example, we provide fallible phenotype and genomic classifications for five individuals. As mentioned above, the first column for both the fallible and the infallible data files MUST be the individual ID and it must be an alphanumeric string. As indicated in the example description file, the first column after the Individual code column is the individual’s classification for the phenotype pheno1. All phenotype classifications must always be either 0 (case) or 1 (control). In this example file, we see that the fallible classification of the phenotype “pheno1” for the five individuals are (respectively): case, control, case, control, control. Because the 1st line of this file indicates that the data after the phenotype pheno1 is genotype data for the variable SNP1 and genotypes are coded using two columns (one for each allele – symbol A in the example description file), we can determine that the respective genotypes for individuals at this marker locus are: 1 1, 1 2, 2 2, missing data, and 12. The 1st line of the example fallible data file indicates that the next data are coded genotypes for the variable “Mkr2” that are coded using a single column (symbol C in the example description file). In the example fallible data file, we observe three different haplotypes. The respective coded haplotypes for these individuals are: 2, 1, 4, 1, 2. Finally, the last column in this example data file represents classification for a second phenotype, pheno2. Individuals have the respective phenotypes: control, missing data, case, control, case, using the fallible measuring instrument for phenotype pheno2. 2.1.3 Infallible data file This file contains phenotype and genotype classifications for individuals as measured with the infallible measuring instrument (see section 1.1 for definition of infallible). Note that the list of individuals in this file is a subset of the list of individuals in the fallible data file. An individual is listed in this file only if the individual has an infallible measurement for at least one of the phenotypes and/or genomic data variables in the description file. The format for this file is as follows: Ind_ID Order_of_ phenotype_and genomic_data As above, we illustrate with an example file. Suppose that the data below is the infallible data file corresponding to the example description and fallible data files above. Notice that individual A4 is missing from what we report. That means that there is no doublesample data for that individual. Also notice that there is no information for variable pheno2; that means that any analysis with the phenotype pheno2 will not include doublesample information on pheno2. (Example infallible data file) pheno1 SNP1 Mkr2 A1 0 11 2 U2 1 22 0 A3 0 22 4 U12 0 12 2 … In this example, we provide infallible measures for the phenotype variable pheno1, for the SNP marker locus variable SNP1, and for the coded genotype variable Mkr2. As mentioned above, the first column for both the fallible and the infallible data files MUST be the individual ID and it must be an alphanumeric string. As indicated in the example description file, the first column after the Individual code column is the individual’s classification for the phenotype pheno1. All phenotype classifications must always be either 0 (case) or 1 (control). In this example file, we see that the infallible classification for the four individuals are (respectively): case, control, case, case. It is interesting to note that in these example files, individual U12 is classified as a case using the infallible classifier and as a control using the fallible classifier (Example fallible data file). This is an example of a phenotype misclassification. This information is used when computing the LRTae statistic (1). Because the 1st line of this example file indicates that the data after the phenotype pheno1 is a genotype consisting of two columns (one for each allele – symbol A in the example description file), we can determine that the respective genotypes for individuals at this marker locus are: 1 1, 2 2, 2 2, and 1 2. The 1st line of this example file indicates that the next data are coded genotypes for the variable Mkr2 (coded using a single column – symbol C in the example description file). In the example infallible data file, the respective coded genotypes for the four individuals are: 2, missing data (coded 0), 4, and 2. Finally, the last column in this example data file represents classification for a second phenotype, P2. Individuals A1-A4 have the respective phenotypes: case, case, control, case using the fallible measuring instrument for phenotype P2. 2.2 Program usage Usage of the program is as follows: Usage: lrtae [OPTIONS] <fallible file> <infallible file> <description file> We explain each item: [OPTIONS]: A list of options that may be used when running LRTae (see Section 2.2.1). <fallible file>: The name of the fallible data file (item 2.1.2) <infallible file>: The name of the infallible data file (item 2.1.3) <description file>: The name of the description file (item 2.1.1) 2.2.1 Command line options -p <# perms> -g <# perms> -b <# perms> Permute phenotypes only Permute genomic data only Permute both phenotypes and genomic data -j <genomic_list> Run statistics for the genomic data specified genomic_list should be a comma separated list -k <phenotype_list> Run statistics for the phenotypes specified phenotype_list should be a comma separated list -c <conf_interval> Set the confidence interval (default is 0.95 = 95%) -o -h -v Specify output file Show help Show version Note: If no genomic data or phenotype is specified all will be used We explain each of these options in the order of their appearance above. -p <# perms>: This option assumes that double-sample information is only available for genotypes. Permutations are performed by randomly reassigning phenotypes, keeping each individual’s information on genotypes fixed (including double-sample information). The total number of cases and controls is also fixed for each replicate. The number of permutations is specified by the user (#permutations). For example, typing “-p 10000” means that 10,000 permutations will be performed by randomly reassigning observed case and control status. For more details on p-values, see section 1.1.2 above. It is important to comment that if this option is chosen for a phenotype variable in which double-sample data is available, permutation p-values may not be valid. -g <# perms>: This option assumes that double-sample information is only available for phenotypes. Permutations are performed by randomly reassigning genotypes or haplotype pair, keeping each individual’s information on phenotypes fixed (including doublesample information). The total number of each type of genotype or haplotype pair is also fixed for each replicate. The number of permutations is specified by the user (#permutations). For example, typing “-g 10000” means that 10,000 permutations will be performed by randomly reassigning genotypes or haplotype pairs. For more details on pvalues, see section 1.1.2 above. It is important to comment that if this option is chosen for a genomic data variable in which double-samples are available, permutation p-values may not be valid. Note: For any run of the LRTae program, only one of the options “-p” or “-g” may be chosen. -b <# perms>: With this option, permutations are performed on both phenotypes and genomic data. THIS OPTION IS NOT RECOMMENDED AT PRESENT – MORE RESEARCH MUST BE DONE TO EVALUATE P-VALUES DETERMINED USING THIS OPTION. -j <genomic_list>: This option is chosen if the user only wants to perform LRTae analysis on a subset of the genomic data provided in the data files (items 2.1.1-2.1.3). For example, suppose that genotype and/or haplotype pair data are available for variables Mkr1, Mkr2, Mkr3, Mkr4, Hap1, Hap2, and Hap3. If this option is not chosen, then the program will compute the LRTae statistic for all genomic data. However, if the user types the option: “-j Mkr1,Mkr3,Hap1” then the LRTae statistic will only be computed for Mkr1, Mkr3, and Hapl variables. Note: With this option, the list of genomic data variables MUST be separated by a comma and there can be no spaces between variables. For example, the program will not run if the user types “-j Mkr1, Mkr3, Hapl” with spaces after the commas. -k <phenotype_list>: This option is chosen if the user only wants to perform LRTae analysis on a subset of the phenotype variables provided in the data files (items 2.1.12.1.3). For example, suppose that phenotype data are available for variables p1, p2, p3, and p4. If this option is not chosen, then the program will compute the LRTae statistic for all four variables. However, if the user types the option: “-k p2,p4” then the LRTae statistic will only be computed for the p2 and p4 variables. Note: As above (-j option), the list of phenotype variables MUST be separated by a comma and there can be no spaces between variables. For example, the program will not run if the user types “-k p2, p4” with spaces after the comma. -c <confidence_interval>: This option enables the user to specify the percent confidence interval for the p-value obtained by permutation. For example, if the user types “-c 0.x”, then a x% confidence interval centered about the permutation p-value will be provided. So, if the user types “-c 0.99”, then a 99% confidence interval will be calculated and if the user types “-c 0.90” then a 90% confidence interval will be calculated. This confidence interval is determined using the method implemented in the BINOM program (see http://linkage.rockefeller.edu/ott/linkutil.htm#BINOM for more information). If this option is not selected and permutations are performed, the default confidence interval provided is 95%. -o: This option allows the user to specify the name of the output file, containing results of all analyses. -h: Choosing this option will provide the user with a list of command-line options (Section 2.2.1). -v: This option provides the user with the version of the program being run. 2.3 Example files with this distribution We provide two sets of example data with this program. Both are simulated data sets. The first contains double-sample genotype data for a SNP marker. The fallible, infallible, and description files are: sim-snp-fall.txt, sim-snp-infall.txt, and sim-snp-desc.txt, respectively. The output file is labeled sim-snp-lrtae.out. The second set of files contains information for two phenotypes (labeled simP1 and simP2) and for coded genotypes (labeled simMicro1; here Micro stands for “Microsatellite” marker). The fallible, infallible, and description files are: sim-microfall.txt, sim-micro-infall.txt, and sim-micro-desc.txt, respectively. For the second set of files, data are simulated so that the first phenotype is associated with the coded genotypes and the second phenotype is not. The output file for this data set is labeled sim-microlrtae.out. If we type: > lrtae –o sim-snp-lrtae.out –p 10000 –c 0.99 sim-snp-fall.txt sim-snp-infall.txt sim-snp-desc.txt at the command line, then the program will compute the LRTae statistic using the observed phenotype data and the double-sample data for SNP genotypes. The program also computes permutation p-values by randomly permuting case/control status for all individuals. A total of 10,000 permutations are performed and a 99% confidence interval centered around each permutation p-value is computed. The output is presented in a file called “sim-snp-lrtae.out”. We present the contents of this file in the next section. 3.0 INTERPRETING RESULTS FROM LRTae OUTPUT As with any statistical analysis program, critical to the success of the analysis is an accurate interpretation of the results. We provide a sample output file, determined by running the LRTae program on one sample data set above (see section 2.3 above). LRTae program version 1.0 Written by Chad Haynes Supervised by Derek Gordon Input files: sim-snp-desc.txt sim-snp-fall.txt (Description file) (Fallible data file) sim-snp-infall.txt (Infallible data file) Command line options selected: 10000 Permutations - Randomly reassign phenotypes Output file name – sim-snp-lrtae.out Program run at Tue Jan 04 17:51:57 2005 ************************************** ** Genomic Data Marker: SNP1 ** ** Phenotype Marker: pheno1 ** ************************************** Population Frequencies: 1/1 1/2 2/2 LRTae Case 0.00000 0.09843 0.90157 Control 0.02619 0.22172 0.75209 LRTstd Case 0.02400 0.10800 0.86800 Control 0.05200 0.21600 0.73200 Misclassification: Genomic Data (Obs) 1/1 (True) 1/1 1.00000 1/2 0.08000 2/2 0.01064 1/2 0.00000 0.92000 0.02128 2/2 0.00000 0.00000 0.96809 N 1 25 94 Phenotype (Obs) Case Control N (True) Case 1.00000 0.00000 0 Control 0.00000 1.00000 0 LRT Statistics: Hyp LogLike LRT Df Asym-P Perm-P 99.0% Permutation CI LRTae H1 647.82 18.13652 2 0.00012 0.00000 (0.00000, 0.00046) H0 656.89 LRTstd H1 291.47 14.70875 2 0.00064 0.00100 (0.00037, 0.00214) H0 298.82 Please cite the following when reporting results from the LRTae program: Gordon D, Yang Y, Haynes C, Finch SJ, Mendell NR, Brown AM, and Haroutunian V (2004) Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat Appl Genet and Mol Biol 3:Article 26. http://www.bepress.com/sagmb/vol3/iss1/art26/. We first notice that the program output indicates what genomic data (SNP1) and phenotype (pheno1) variables are being considered. Next, the program provides MLEs of the individual SNP genotype frequencies in cases and controls using the LRTae method and the LRTstd method (i.e., the method that does not use the double-sample information.) Notice that the results are different. Using the double-sample information, the MLEs for the 1 1 genotype in cases and controls with the LRTae method are 0.000 and 0.026, respectively (genotypes are written with a “/” symbol to separate the alleles). Using the LRTstd method, the MLEs for the 1/1 genotype in cases and controls are 0.024 and 0.052, respectively. Similarly, for the 2/2 genotype, the LRTae method provides MLEs of 0.902 and 0.752 for cases and controls (respectively) while the LRTstd method provides MLEs of 0.868 and 0.732 (respectively). What accounts for this difference? One can glean the answer by studying the genotype misclassification table. In this table, we provide MLEs of the misclassification probabilities, namely Pr(observed genotype = a | true genotype = b), where a and b are any of the three genotypes 1/1, 1/2, or 2/2. The last column in this table, with the heading “N” provides the sample size on which these calculations are determined. So for example, we see that, based on the genotype 94 double-sample observations, the 2/2 genotype is misclassified approximately 2% of the time as the heterozygote 1/2 and approximately 1% of the time as the homozygous 1/1 genotype. Similarly, based on 25 observations, the heterozygote 1/2 genotype is misclassified approximately 8% as the 1/1 genotype. Therefore, many of the “observed” 1/1 genotypes are really misclassified heterozygotes and 2/2 genotypes. This accounts for the decreased 1/1 genotype frequency estimates with the LRTae method as opposed to the LRTstd method. Notice also that, because we have no double-sample data for the phenotype variable pheno1, the classification is assumed to be “perfect”. Also, the sample size N for the phenotype misclassification is 0. The log-likelihoods under the null (H0) and alternative (H1) hypotheses (also see Section 1.1.2) of the LRTae and LRTstd methods are provided below the misclassification tables. For each method, twice the difference of the log-likelihoods provides the value of the test statistic. We compute respective values of 18.137 and 14.709 for the LRTae and LRTstd methods. Since this data set was simulated by randomly simulating errors into genotypes, the results are consistent with the findings of Mote and Anderson (9) and others (10, 11) that the LRTae statistic has greater significance. The added power comes from the correctly classified genotypes, which are not used by the LRTstd method (1). Note also that the log-likelihoods for the LRTae method are larger, indicating that additional information is being used. Also, the output file provides the degrees of freedom (Df) for each likelihood ratio test (LRT). In this example, there are two degrees of freedom for each test, since there are three genotype classification categories possible for the di-allelic locus being tested. The asymptotic pvalue is computed assuming that the null distribution is a central chi-square distribution with the indicated degrees of freedom. The asymptotic p-value is the probability of observing the corresponding LRT statistic value assuming the null distribution is correct. Because the LRTae method estimates the 1/1 genotype frequencies to be small (0 for cases!), we perform a permutation analysis to determine the accuracy of the p-value based on asymptotic theory. Because we have double-sample data only for genotypes, we choose the “-p” option (see above – Sections 1.1.2 and 2.2.1) and perform 10,000 permutations. Results of the permutation analysis indicate that the p-values based on asymptotic theory are consistent with the p-values based on permutation (both asymptotic p-values are within the 99% confidence intervals for the p-values based on permutation). However, as mentioned above (Section 1.1.2) these results should be viewed with some caution, as more extensive simulations need to be conducted. 3.2 A note about selection of double-samples When using the LRTae statistic, we assume that the double-sample information is obtained by randomly selecting a subset of individuals for classification with an infallible measure. However, this assumption may be violated. For example, with a disease like Alzheimer’s, there may be a bias in obtaining only the “gold-standard” diagnosis of presence/absence of plaques and tangles for observed cases, since observed controls are assumed to have died of other causes and therefore may not have autopsies performed. Similarly, researchers may only scrutinize very rare observed genotypes, because these can be particularly costly in terms of power loss for case/control studies (12). 4.0 PROBLEMS? COMMENTS? If there are problems in the execution or compilation of this program or if you would like to provide some feedback, please e-mail lrtae@linkage.rockefeller.edu 5.0 ACKNOWLEDGEMENTS The authors of this software gratefully acknowledge grant K01-HG00055 from the National Institutes of Health. 6.0 REFERENCES Below are references for this README file. Please cite the first reference when reporting results using the LRTae software. 1. Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., and Haroutunian, V. 2004. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat Appl Genet and Mol Biol 3:Article 26. http://www.bepress.com/sagmb/vol3/iss1/art26/. 2. Tenenbein, A. 1970. A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65:1350-1361. 3. Tenenbein, A. 1972. A double sampling scheme for estimating from misclassified multinomial data with applications to sampling inspection. Technometrics 14:187-202. 4. Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc B 39:1-38. 5. Cox, D.R., and Hinkley, D.V. 1979. Theoretical Statistics. Boca Raton: CRC Press. 6. Ott, J. 1999. Analysis of Human Genetic Linkage. Baltimore: The Johns Hopkins University Press. 7. Terwilliger, J.D., and Ott, J. 1994. Handbook of Human Genetic Linkage. Baltimore: Johns Hopkins. 8. Xie, X., and Ott, J. 1993. Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 53:1107 (Abstract). 9. Mote, V.L., and Anderson, R.L. 1965. An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109. 10. Gordon, D., Finch, S.J., Nothnagel, M., and Ott, J. 2002. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 54:22-33. 11. Rice, K.M., and Holmans, P. 2003. Allowing for genotyping error in analysis of unmatched cases and controls. Ann Hum Genet 67:165-174. 12. Kang, S.J., Gordon, D., and Finch, S.J. 2004. What SNP genotyping errors are most costly for genetic association studies? Genet Epidemiol 26:132-41.