040310 Write up: Family Based Association

advertisement
Biostatistics 237 /Biomathematics 207B/HG207B
March 9, 2004
Account name: m237
Password: winter2002, win2002
Laboratory #9:
FAMILY BASED ASSOCIATION TESTS: TDT AND GAMETE COMPETITION
(A) The data sets:
This exercise has two parts. In part I, we will run the TDT and the gamete competition
on angiotensin I-converting enzyme (ACE) as a qualitative trait. The ACE gene is located
on 17q23. When running the TDT or the gamete competition for qualitative traits, we will
consider anyone with an ACE level of less than 0.648 to be affected. The data set for
part I consists of extended and nuclear families from Oxford phenotyped for ACE and
genotyped for the insertion-deletion (ID) polymorphism and the highly informative
polymorphism in the neighboring growth hormone (GH). As a prelude to part I we will
run the combining_alleles option of Mendel 5.0 to reduce the number of GH alleles and
avoid sparse data problems.
In part II we will use the gamete_competition as a test of family based association with a
quantitative trait. The data set for part II consists of extended and nuclear families from
Jamaica phenotyped for ACE. We will examine 3 SNPs located within the ACE locus.
The data consist of ACE levels on 405 people and SNP data on 489 people in 83
pedigrees. We will first run each SNP separately and then we will use the SNPs in
combination.
Copy the following data sets from the F:\class\bio237 folder to your directory:
Part I files:
pedoxf.in
locoxf.in
mapoxf.in
concomb.in
contdt.in
congam.in
Part II files:
Consnp.in
locsnp.in
mapsnp.in
varace.in
pedsnp.in
1
Part I: TDT and gamete competition for a qualitative trait
(B) Reducing the number of GH alleles.
To avoid having a very large number of possible cells many with no data, we will
combine alleles in the GH. This is absolutely necessary for the gamete competition.
Otherwise it will run extremely slowly. The TDT will run reasonably well without
collapsing the number of alleles, however because of their discrete nature, very sparse
data will lead to false inference with both methods. Combining very rare alleles will
avoid this problem.
The control in this case, concomb.in, is the following:
concomb.in:
!input files
LOCUS_FILE=LOCoxf.IN
MAP_FILE=MApoxf.IN
PEDIGREE_FILE=PEDoxf.IN
! reading input
AFFECTED=1
AFFECTED_LOCUS_OR_FACTOR=ACE
READ_PEDIGREE_RECORDS = F
pedigree_list_read=true
allele_separator=male =1
female=2
!new output
new_pedigree_file=pednew.in
new_locus_file=locnew.in
! analysis options
ANALYSIS_OPTION=Combining_alleles
OUTPUT_FILE=comb.out
Maximum_combined_alleles=7
This Mendel option creates a new locus file and corresponding pedigree file so it is
important to specify the new file names. Combining_alleles uses the allele frequencies in
the locus file to determine which alleles will be combined. The program combines alleles
until there are no more than the maximum number of alleles (user specified) and they are
at least as frequent as the minimum allele frequency (also user specified). The defaults
are a maximum of 10 alleles and a minimum allele frequency of 0.05. The minimum
number of alleles is 2, even if one of them has an allele frequency less than the specified
minimum allele frequency.
Run the combining allele option of Mendel 5.0 using Gregor by reading in this control
file, writing out the control.in file and selecting the option "Run Mendel". Examine the
new pedigree file and locus file and note the changes.
The new pedigree file is formatted (the top line gives the fortran format) and the number
of alleles at the GH locus have been reduced.
(5(1X,A8),(T51,3(1X,A8),:))
1
1
1
1-2
20-20
1-2
8-8
2
1
2
2
1
2
The new locus file decodes the combined alleles.
GH
6
7
8
11
13
19
20
Autosome 7
0.12378
0.08870
0.10916
0.10039
0.07115
0.13158
0.37524
0
ORIGINAL
ORIGINAL
ORIGINAL
ORIGINAL
ORIGINAL
ORIGINAL
ORIGINAL
ALLELE
ALLELE
ALLELE
ALLELE
ALLELE
ALLELE
ALLELE
NUMBERS:
NUMBERS:
NUMBERS:
NUMBERS:
NUMBERS:
NUMBERS:
NUMBERS:
6 9 14
3 4 5 7 12
1 2 8 10 18
11 16
13 15
17 19
20
Note, for example that alleles 1, 2, 10 and 18 have all been combined with allele 8.
(C) Running the TDT
The control file now uses the new pedigree and locus file. The pedigree file is formatted
so we no longer have the command pedigree_list_read=true. Instead we use the default,
pedigree_list_read=false.
contdt.in:
!input files
LOCUS_FILE=LOCnew.IN
MAP_FILE=MAPoxf.IN
PEDIGREE_FILE=PEDnew.IN
! reading input
AFFECTED=1
AFFECTED_LOCUS_OR_FACTOR=ACE
READ_PEDIGREE_RECORDS = F
allele_separator=male =1
female=2
!new output
new_pedigree_file=pednew.in
new_locus_file=locnew.in
! analysis options
ANALYSIS_OPTION=TDT
OUTPUT_FILE=TDT.out
Summary_File=TDTsum.out
samples=100000
In the pedigree file males are designated with 1 and female with 2. Affecteds (ACE less
than -0.648) are designated as 1 and unaffecteds as 2. Because the pvalues are estimated
by Monte Carlo simulation we need to specify the number of samples. The default is
10,000 but we have increased the number to 100,000.
Run the TDT option of Mendel 5.0 using Gregor by reading in this control file, writing
out the control.in file and selecting the option "Run Mendel". There will be two output
files, a summary file and a full output file. Examine them both. Note that the pvalue is
given as 0.0000. The actual pvalue is not 0.0000. It is reported as such because none of
the 100,000 samples gave a statistic that was as extreme or more extreme than the
observed statistic. You should report the pvalue as "less than 1x10-5" (< 1/samples) in
this case.
3
(D) Running the gamete competition on a qualitative trait.
We will now analyze the data in pednew.in using the gamete competition. The gamete
competition uses data from all the affecteds in the pedigree rather than the just the trios
with affected children. It allows for missing data.
The control file, congam.in has the following form:
!input files
LOCUS_FILE=LOCnew.IN
MAP_FILE=MApoxf.IN
PEDIGREE_FILE=PEDnew.IN
! reading input
AFFECTED=1
AFFECTED_LOCUS_OR_FACTOR=ACE
READ_PEDIGREE_RECORDS = F
allele_separator=male =1
female=2
!new output
new_pedigree_file=pednew.in
new_locus_file=locnew.in
! analysis options
ANALYSIS_OPTION=gamete_competition
model=2
OUTPUT_FILE=gam.out
Summary_File=gamsum.out
The notable differences between this control file and the one for the TDT are:
(1) no samples specified (asymptotic pvalues only)
(2) There are model options. Models 1 and 2 are for qualitative traits. Models 3 and 4 are
for quantitative traits. Models 1 and 3 use the allele frequencies given in the locus
file. Models 2 and 4 jointly estimate the allele frequencies.
Run the gamete competition option of Mendel 5.0 using Gregor by reading in this
control file, writing out the control.in file and selecting the option "Run Mendel". Again
there will be two output files, a summary file and a full output file. Examine them both
and compare the results with the results for the TDT.
PART II: Running the gamete competition on a quantitative trait.
(E) The input files.
The control file, Consnp.in contains:
!input files
MAP_FILE = mapsnp.in
PEDIGREE_FILE = Pedsnp.in
variable_file=varace.in
LOCUS_FILE = locsnp.in
! output files
SUMMARY_FILE = Sumsnp.out
OUTPUT_FILE = Mendsnp.out
! instructions to read input
map_list_read=true
MALE = 1
FEMALE = 2
4
quantitative_trait=ACE
! analysis specific information
analysis_option=Gamete_competition
MODEL = 4
Transform = STANDARDIZE::ACE
Because we are running a quantitative trait and we want to jointly estimate the allele
frequencies, the model option is 4. We need to specify a variable_file and the name of
the quantitative trait. We asked that the trait be standardized (subtracting off the mean
and dividing by the variance) although it isn't necessary in this case because ACE values
have already been standardized in the process of adjusting for age and sex differences.
There are some changes in the locus file and the pedigree file the first part of the lab. The
SNPs have already been combined for you into a single locus. Two of the 8 haplotypes
were estimated to be very rare so they were combined with other haplotypes. The
markers are treated as non-codominant so we must specify the relationship of the
phenotypes to the genotypes in the locus file.
t469
AUTOSOME 627
ATA
0.40190
ATG
0.00780
ACA
0.06740
ACG
0.18310
TEA
0.01340
TEA is TTA+TCA
TEG
0.32640
TEG is TTG+TCG
Note that because 122 denotes A/A T/C A/G, a double heterozygote, we need to specify
that there are two haplotype configurations that are consistent with the multilocus
genotype.
111
1
ATA/ATA
.
.
122
ATA/ACG
ATG/ACA
2
We will also run the SNPs as single loci. These have also been coded with a single
number designation so we need to "decode" them in the locus file.
Snp4
A
T
1
A/A
2
A/T
3
T/T
AUTOSOME 2 3
0.80000
0.20000
1
1
1
5
Finally, I have used the Fortran format to read in single loci snp4, snp6 and snp9 as well
as the multilocus SNP genotype for SNPs 4,6, and 9 combined.
(3X,I5,A8)
(16X,3A8,7X,2A1,T69,5X,3A1,T69,2(2X,2A8))
10
1
1
2
1
2 112
112
122
122
-0.395
-1.788
This is an "old style" MENDEL pedigree file. The first fortran format statement reads in
number of individuals in the pedigree and the family id number. The second fortran
format statement reads the information for each individual. There are data for 4
multilocus snp combinations and we want only to use the last one. We could set this up
through the map file, but here I have just skipped over all the data I didn't want to include
using T69 (tab to the 69th column). I first read in the data for the individual SNPs then I
return to the same column position and read in the data as a multilocus SNP.
(F) Run Mendel 5.0 using the Gregor interface. Load in Consnp.in, write a new
control.in file and run.
The output :
There is a summary file that should look like:
MARKER
NAME
Snp4
Snp6
Snp9
t469
P-VALUE
MAX OMEGA
0.00000
0.00000
0.00000
0.00000
1.07786
0.00000
1.40464
1.52939
FREQ
ALLELE
NAME
0.33534
0.57608
0.51465
0.32207
MIN OMEGA
T
C
G
TEG
0.00000
-1.21367
0.00000
0.00000
FREQ
0.66466
0.42392
0.48535
0.40515
ALLELE
NAME
A
T
A
ATA
And a more complete output file with the actual test statistics, all parameter estimates and
their standard errors. The statistics are:
THE
THE
THE
THE
LIKELIHOOD
LIKELIHOOD
LIKELIHOOD
LIKELIHOOD
RATIO
RATIO
RATIO
RATIO
TEST
TEST
TEST
TEST
STATISTIC
STATISTIC
STATISTIC
STATISTIC
IS
IS
IS
IS
0.4917E+02
0.6006E+02
0.7655E+02
0.8129E+02
AT
AT
AT
AT
LOCUS
LOCUS
LOCUS
LOCUS
Snp4.
Snp6.
Snp9.
t469.
(G) NO Homework - Please start working on your project data. In next week's
laboratory I will reserve time at the end for you to get help running Mendel with
your project data if you are having problems.
6
Download