040303 Lab 8 - UCLA Human Genetics

advertisement
Biostatistics 237B /Biomathematics 207B / Human Genetics 207B
March 2, 2004
Account name: m237 Password: winter2002, password: win2002
Laboratory #8:
NON-PARAMETRIC LINKAGE ANALYSIS FOR A QUANTITATIVE TRAIT
This week, we will run two methods of gene mapping for quantitative traits (1) HasemanElston regression using SAGE SIBPAL and (2) Variance Components using Mendel
option Polygenic_qtl.
(A) The data set:
The data set consists of 45 Jamaican extended and nuclear families phenotyped for
angiotensin I-converting enzyme (ACE) and genotyped for an insertion-deletion (ind)
polymorphism, previously found to be very close if not coincident in location with the
ACE gene, and a highly informative polymorphism in the neighboring growth hormone
(GH). The serum ACE levels have been normalized. These data represent a portion of
the data originally collected and analyzed by McKenzie et al. (1995) American Journal of
Human Genetics, 57:1426-35. The family data, locus, and map files are found in the
F:Class\Bio237\lab8 directory.)
(B) Visualizing the families using GAP.
Use GAP and the file acegap.in to make pictures of the pedigrees. GAP expects *.dat
files for the ascii input but you can type in the name of the file. The file has format
pedigree id, person id, mom id, dad id, sex, ace, genotype ind and genotype gh. Create a
new data file and define variables ace (floating point 6.3), ind (3 characters) and gh (5
characters). Then read in the data and define the missing data. Display the pedigrees with
their ace phenotype, ind genotype and gh genotype. Note that most of the families are
nuclear families but that there are also some multigenerational families.
(C) Running SAGE:
(1) fsp:
Like last week, we will need to run fsp. fspace.par:
file structure
1
0
0
112
(A8,I8,3(4X,A4),7X,A1)
This file is just like last week's fsp.par file. Line 1: project title (you decide). Line 2: 1=
make lnk file, 0 = don’t make segregation. file, 0 = proband code is not included, 1 record
line per individual, 1 = code for male, 2 = code for female. NOTE: These numbers
must be specified in the exact column positions shown here.
Line 3 = Fortran format for study id, family id, id., mom, dad, sex.
A8 = read in the first 8 positions as characters (the study name).
I8= read the next 8 spaces as integer (family id),
1
3(4x,A4) = do the following 3 times. Skip 4 spaces then read the next 4 positions as
character. This procedure reads in person id, mom id and dad id.
7X = skip 7 spaces
A1 = read in sex designation as a single character.
Notes: The format will depend on data file.
The A format must be used for study name, person id, mom id, dad id and sex. The I
(integer) format must be used for the family id. This means that the family id must be
right justified in your file.
Run fspace.par with the data file hase.in to create a link file.
(2) Running sibpal using a quantitative trait:
The parameter file, hase.par now has the following format:
settings for haseman-elston
5
5
5
0
0
1
2
1
1
0
1
0
0
ACE
0
-9
(A8,I8,4x,A4,24X,F8.3,2(2X,A6))
The first two lines are just like last week. We give our study a title (row 1) and then we
need to decide on how many individuals diagnostic information will be given (row 2).
(line 3) 1=quantitative trait, 2 markers, 1 trait, 1=simple linear regress., 0=no weight, 1=
plot squared difference versus ibd proportion. 0=no covariates, 0=use full and half sibs if
present.
(line 4) trait name, 0=no transformation, -9=missing
(line 5) file format : (study id, family id, person id, trait, 2 markers)
A8 = read in first 8 spaces as characters for the study id
I8 = read in the next 8 spaces as integers for the family id
8X = skip 8 spaces
A4 = read in next 4 spaces as integers for the person id
24X skip 24 spaces (skip over mom id, dad id and sex)
F8.3 = read the trait in the next 8 spaces as a floating point (continuous) variable with 3
decimals.
2(2X, A6) then read in the 2 markers.
NOTE: again the exact positions of these numbers in this file are very important.
Run SAGE sibpal using hase.par, hase.in, hase.loc, and fsp.lnk. If everything goes as
planned you should end up with two additional files, an hase.sum file and an hase.out
file. Look at the out file to convince yourself everything ran properly. Then open the sum
file and look at the results. You should see a table with effective degrees of freedom, the
mean proportion of genes shared ibd for full sibs, the mean proportion of genes shared
ibd for half sibs, the t-value for the regression coefficient, the pvalue, the intercept
estimate and the slope estimate. If the pvalue for a marker is less than 0.05, the results for
that marker are plotted.
2
(D) Running variance components using the specially modified version of Mendel:
(a) To run Mendel we need a control file, a locus file, a pedigree file, a map file, and a
variable file.
Files names are given in the control file:
! input files
MAP_FILE = mapace.in
LOCUS_FILE = locace.in
PEDIGREE_FILE = pedace.in
VARIABLE_FILE = varace.in
The control file is very similar to the control files we used last week. The major
differences are specific to the analysis. We are now running the option polygenic_qtl that
runs variance components analysis. Our model assumes that the variance depends on
additive polygenes and the environment and we are testing whether it depends on a
quantitative trait locus.
! analysis specific commands
ANALYSIS_OPTION = polygenic_qtl
PREDICTOR = Grand :: ACE
QUANTITATIVE_TRAIT = ACE
COVARIANCE_CLASS = Additive
COVARIANCE_CLASS = Environmental
COVARIANCE_CLASS = Qtl
INTERIOR_POINTS = 1
The locus file looks very similar to the locus files we used last week.
Because we want to compare the Mendel results to the SAGE results and because we
don't have many families, we use population estimates of the allele frequencies.
ID
1
2
GH
1
...
20
Autosome 2 0
0.47211
0.52789
Autosome20 0
0.08542
0.37710
The pedigree file is comma delimited and has a similar format to your project pedigree
file. The format is pedigree id, person id, dad id, mom id, sex, twin designation, markers,
and trait data.
1,1,,,1,,1-2,,0.408,
1,2,,,2,,1-2,9-17,-0.52,
1,3,2,1,2,,1-2,5-9,-0.317,
The map file has a familiar format:
ID
0.001
GH
We also need a variable file. It has the following format:
ACE
-5.0
5.0
The name of the trait and the lower and upper bounds on the values.
3
(b) IMPORTANT: COPY THE VERSION OF MENDEL IN THE CLASS
DIRECTORY (Comp_VAR) TO YOUR TEMP DIRECTORY. Start up Mendel
using Gregor. Set the working directory and read in the control file. Write out the
control file and run Mendel.
(c) The results will be found in the summaryace.out file. We have the location scores at
three points.
(d) Add in sex as a fixed effect in the model by adding the following line to the control
file: Predictor = sex :: ace. In gregor, go to modify control file, quantitative
analysis option, predictor. Add ,sex::ace and then write the control.in file.
(e) Compare the results with and without the sex effect.
(E) Running variance components using a standard version of Mendel:
Our special version of Mendel will only work with relatively small families. Version 5.0
of Mendel (the standard version) works for arbitrarily large families. If you try to our
files as is with Mendel 5.0 you will get the following error message
***ERROR*** THE COEFFICIENT FILE DOES NOT EXIST OR IS EMPTY
Because exact calculation of the conditional kinship coefficients can take too long when
the families get large, Mendel 5.0 doesn't calculate them but instead requires the user to
provide a file of conditional kinship coefficients created by the program Simwalk2.
Simwalk2 uses a Monte Carlo procedure called simulated annealing to sample the
possible descent states and determine estimates of the conditional kinship coefficients.
See the manual and the example input files for an example. Simwalk2 is available from
the genetics.ucla.edu/ software website. It requires the old style Mendel file formats.
These files can be made using the program Mega2. For your projects either use the
special version of Mendel we used today or you will need to use Mega2 and Simwalk2.
There are many statistical genetics software packages. The best place to get a
comprehensive list is at the Rockefeller statistical genetics website:
http://linkage.rockefeller.edu/soft/
Check out this resource.
4
(F) Homework (due 3/11/04):
(1) Use the homozygous genotype frequencies in the out file to calculate the
heterozygosity of the two markers. (Note these are based on the data in the hase.loc file).
HID=
HGH=
(2) Using SAGE, the p-value for the ID marker was larger than the GH marker’s p-value
in the second analysis. This is counterintuitive because the ID marker is closer to the
ACE gene (it’s actually in the gene) than the GH marker. List any possible reasons why
this strange result might happen.
(3) Are the results from Haseman-Elston regression and variance components in
agreement? Why or why not?
5
Download