Biostatistics 237B /Biomathematics 207B / Human Genetics 207B March 2, 2004 Account name: m237 Password: winter2002, password: win2002 Laboratory #8: NON-PARAMETRIC LINKAGE ANALYSIS FOR A QUANTITATIVE TRAIT This week, we will run two methods of gene mapping for quantitative traits (1) HasemanElston regression using SAGE SIBPAL and (2) Variance Components using Mendel option Polygenic_qtl. (A) The data set: The data set consists of 45 Jamaican extended and nuclear families phenotyped for angiotensin I-converting enzyme (ACE) and genotyped for an insertion-deletion (ind) polymorphism, previously found to be very close if not coincident in location with the ACE gene, and a highly informative polymorphism in the neighboring growth hormone (GH). The serum ACE levels have been normalized. These data represent a portion of the data originally collected and analyzed by McKenzie et al. (1995) American Journal of Human Genetics, 57:1426-35. The family data, locus, and map files are found in the F:Class\Bio237\lab8 directory.) (B) Visualizing the families using GAP. Use GAP and the file acegap.in to make pictures of the pedigrees. GAP expects *.dat files for the ascii input but you can type in the name of the file. The file has format pedigree id, person id, mom id, dad id, sex, ace, genotype ind and genotype gh. Create a new data file and define variables ace (floating point 6.3), ind (3 characters) and gh (5 characters). Then read in the data and define the missing data. Display the pedigrees with their ace phenotype, ind genotype and gh genotype. Note that most of the families are nuclear families but that there are also some multigenerational families. (C) Running SAGE: (1) fsp: Like last week, we will need to run fsp. fspace.par: file structure 1 0 0 112 (A8,I8,3(4X,A4),7X,A1) This file is just like last week's fsp.par file. Line 1: project title (you decide). Line 2: 1= make lnk file, 0 = don’t make segregation. file, 0 = proband code is not included, 1 record line per individual, 1 = code for male, 2 = code for female. NOTE: These numbers must be specified in the exact column positions shown here. Line 3 = Fortran format for study id, family id, id., mom, dad, sex. A8 = read in the first 8 positions as characters (the study name). I8= read the next 8 spaces as integer (family id), 1 3(4x,A4) = do the following 3 times. Skip 4 spaces then read the next 4 positions as character. This procedure reads in person id, mom id and dad id. 7X = skip 7 spaces A1 = read in sex designation as a single character. Notes: The format will depend on data file. The A format must be used for study name, person id, mom id, dad id and sex. The I (integer) format must be used for the family id. This means that the family id must be right justified in your file. Run fspace.par with the data file hase.in to create a link file. (2) Running sibpal using a quantitative trait: The parameter file, hase.par now has the following format: settings for haseman-elston 5 5 5 0 0 1 2 1 1 0 1 0 0 ACE 0 -9 (A8,I8,4x,A4,24X,F8.3,2(2X,A6)) The first two lines are just like last week. We give our study a title (row 1) and then we need to decide on how many individuals diagnostic information will be given (row 2). (line 3) 1=quantitative trait, 2 markers, 1 trait, 1=simple linear regress., 0=no weight, 1= plot squared difference versus ibd proportion. 0=no covariates, 0=use full and half sibs if present. (line 4) trait name, 0=no transformation, -9=missing (line 5) file format : (study id, family id, person id, trait, 2 markers) A8 = read in first 8 spaces as characters for the study id I8 = read in the next 8 spaces as integers for the family id 8X = skip 8 spaces A4 = read in next 4 spaces as integers for the person id 24X skip 24 spaces (skip over mom id, dad id and sex) F8.3 = read the trait in the next 8 spaces as a floating point (continuous) variable with 3 decimals. 2(2X, A6) then read in the 2 markers. NOTE: again the exact positions of these numbers in this file are very important. Run SAGE sibpal using hase.par, hase.in, hase.loc, and fsp.lnk. If everything goes as planned you should end up with two additional files, an hase.sum file and an hase.out file. Look at the out file to convince yourself everything ran properly. Then open the sum file and look at the results. You should see a table with effective degrees of freedom, the mean proportion of genes shared ibd for full sibs, the mean proportion of genes shared ibd for half sibs, the t-value for the regression coefficient, the pvalue, the intercept estimate and the slope estimate. If the pvalue for a marker is less than 0.05, the results for that marker are plotted. 2 (D) Running variance components using the specially modified version of Mendel: (a) To run Mendel we need a control file, a locus file, a pedigree file, a map file, and a variable file. Files names are given in the control file: ! input files MAP_FILE = mapace.in LOCUS_FILE = locace.in PEDIGREE_FILE = pedace.in VARIABLE_FILE = varace.in The control file is very similar to the control files we used last week. The major differences are specific to the analysis. We are now running the option polygenic_qtl that runs variance components analysis. Our model assumes that the variance depends on additive polygenes and the environment and we are testing whether it depends on a quantitative trait locus. ! analysis specific commands ANALYSIS_OPTION = polygenic_qtl PREDICTOR = Grand :: ACE QUANTITATIVE_TRAIT = ACE COVARIANCE_CLASS = Additive COVARIANCE_CLASS = Environmental COVARIANCE_CLASS = Qtl INTERIOR_POINTS = 1 The locus file looks very similar to the locus files we used last week. Because we want to compare the Mendel results to the SAGE results and because we don't have many families, we use population estimates of the allele frequencies. ID 1 2 GH 1 ... 20 Autosome 2 0 0.47211 0.52789 Autosome20 0 0.08542 0.37710 The pedigree file is comma delimited and has a similar format to your project pedigree file. The format is pedigree id, person id, dad id, mom id, sex, twin designation, markers, and trait data. 1,1,,,1,,1-2,,0.408, 1,2,,,2,,1-2,9-17,-0.52, 1,3,2,1,2,,1-2,5-9,-0.317, The map file has a familiar format: ID 0.001 GH We also need a variable file. It has the following format: ACE -5.0 5.0 The name of the trait and the lower and upper bounds on the values. 3 (b) IMPORTANT: COPY THE VERSION OF MENDEL IN THE CLASS DIRECTORY (Comp_VAR) TO YOUR TEMP DIRECTORY. Start up Mendel using Gregor. Set the working directory and read in the control file. Write out the control file and run Mendel. (c) The results will be found in the summaryace.out file. We have the location scores at three points. (d) Add in sex as a fixed effect in the model by adding the following line to the control file: Predictor = sex :: ace. In gregor, go to modify control file, quantitative analysis option, predictor. Add ,sex::ace and then write the control.in file. (e) Compare the results with and without the sex effect. (E) Running variance components using a standard version of Mendel: Our special version of Mendel will only work with relatively small families. Version 5.0 of Mendel (the standard version) works for arbitrarily large families. If you try to our files as is with Mendel 5.0 you will get the following error message ***ERROR*** THE COEFFICIENT FILE DOES NOT EXIST OR IS EMPTY Because exact calculation of the conditional kinship coefficients can take too long when the families get large, Mendel 5.0 doesn't calculate them but instead requires the user to provide a file of conditional kinship coefficients created by the program Simwalk2. Simwalk2 uses a Monte Carlo procedure called simulated annealing to sample the possible descent states and determine estimates of the conditional kinship coefficients. See the manual and the example input files for an example. Simwalk2 is available from the genetics.ucla.edu/ software website. It requires the old style Mendel file formats. These files can be made using the program Mega2. For your projects either use the special version of Mendel we used today or you will need to use Mega2 and Simwalk2. There are many statistical genetics software packages. The best place to get a comprehensive list is at the Rockefeller statistical genetics website: http://linkage.rockefeller.edu/soft/ Check out this resource. 4 (F) Homework (due 3/11/04): (1) Use the homozygous genotype frequencies in the out file to calculate the heterozygosity of the two markers. (Note these are based on the data in the hase.loc file). HID= HGH= (2) Using SAGE, the p-value for the ID marker was larger than the GH marker’s p-value in the second analysis. This is counterintuitive because the ID marker is closer to the ACE gene (it’s actually in the gene) than the GH marker. List any possible reasons why this strange result might happen. (3) Are the results from Haseman-Elston regression and variance components in agreement? Why or why not? 5