Biostatistics 237 /Biomathematics 207B/HG207B February 17, 2004 Account name: m237 Password: winter2002, win2002 Lab 6: Multipoint Mapping and Genotype Error Checking (A) Overview In this exercise has two parts. In part 1, we will map the disorder episodic ataxia relative to 4 markers on chromosome 12 using data from a single large pedigree. Episodic ataxia is a rare autosomal dominant disease that is almost completely penetrant. The data were originally presented by Litt et al. (1994, Amer. J. Hum. Gen., 55:702-709). The pedigree and the analysis are discussed in Professor Lange’s book (1997) chapters 7 and 9 (see pages 116-117 and 158-159). The pedigree is in the file labeled eaped.in. We will run the final 4 markers (D12S100, CACNL1A1, D12S372, pY2/1), first separately (twopoint) and then as a multipoint analysis. We will use Mendel version 5 (written by Dr. Kenneth Lange), option 2 - location_scores. In part 2, we will check pedigree data for genotyping errors in microsatellite markers on chromosome 21. The underlying model is the likelihood model we discussed in class and can use the information from multiple markers rather than using one marker at a time. We will use Mendel version 5, option 5 mistyping. (B) Getting ready to run Mendel: We need XX files for this week: (1) Part 1 uses files: ea4.in (the control file) ealoc.in (the locus names and allele frequencies) eapen.in (penetrance file) eamap.in (the order and recombination fractions between the markers) eaped.in (the pedigrees) (2) Part 2 uses files: control_err.in (the control file) errloc.in (the locus names and allele frequencies) errmap.in (the order and recombination fractions between the markers) errped.in (the pedigrees) To visualize the pedigree we need: errgap.dat (the pedigrees in gap format). (C) PART 1: Location Scores, Running multipoint linkage analysis (1)Understanding the input files: The file ea4.in contains all the commands needed determine the most likely location of the episodic ataxia susceptibility locus relative to 4 markers. (See the pedigree in Lange, Chapter 7, page 115). 1 MAP_FILE = eamap.in PEDIGREE_FILE = eaped.in PENETRANCE_FILE = eapen.in LOCUS_FILE = ealoc.in ANALYSIS_OPTION = LOCATION_SCORES !INTERIOR_POINTS = 41 TRAVEL = GRID grid_increment=0.001 NUMBER_OF_MARKERS_INCLUDED = 4 OUTPUT_FILE = ea4.out SUMMARY_FILE = sumea4.out ! ! ! ! ! ! ! ! ! ! ! map file pedigree file penetrance file marker information analysis option same # of points between markers grid or search (search finds max) even 0.001 morgan increments number of markers considered full output short version of output This file has features in common with the other control files. Analysis option location_scores estimates the location scores (when running only one marker, location score is equivalent to a LOD score). Most of the other commands have to do with input and output files. The exceptions include travel = grid. This command insures that the location score is calculated at a set number of points. The command grid_increment = 0.001 insures that these points are 0.001 Morgan apart. Alternatively, we could have a constant number of points between each marker (e.g. Interior_points = 41). The command Number_of_Markers_included allows us to analyze fewer markers at a time than are available in the map or locus file. This command can be useful when there are many markers. We may want to limit the analysis to using the markers nearby the current putative location since markers far away contain little information and increase the computation time. The marker and disease allele information is located in the file ealoc.in: Check out the contents of ealoc.in. S100 AUTOSOME 6 0 1 0.05970 2 0.31343 3 0.35075 4 0.08955 5 0.07463 6 0.11194 CACNL1A1AUTOSOME 7 0 1 0.01408 2 0.00704 7 0.01408 8 0.06338 9 0.65493 10 0.21831 12 0.02817 . . . DISEASE AUTOSOME 2 2 1 .9999 2 .0001 101 3 1/1 1/2 2/2 201 3 2 1/1 1/2 2/2 This file is similar to the GAP mdf file. The first line is the marker name (S100). Note the name in the location file must match the name in the map file. S100 is short for D12S100. S100 is located on an autosomal chromosome and there are 6 possible alleles. It is codominant so we don’t need to specify the genotypes. The next six lines are the allele names and the corresponding allele frequency. The file ealoc.in contains the information for all 4 markers as well as the designation for the phenotype and corresponding putative alleles. When considering a quantitative phenotype the disease locus is coded as codominant. The location_scores option uses a penetrance file. The penetrance file looks like: DISEASE PROB DISEASE 2 101 0.99900 0.01000 0.01000 201 0.00100 0.99000 0.99000 The first line of the file gives the trait name (Disease) for which penetrance is defined. This name must match the name in the locus file. Prob indicates that the penetrance is a conditional probability rather than a density function. The second Disease indicates where the penetrance classes (101 and 201) can be found in the pedigree file and the 2 indicates there are two phenotypic categories. Normal is coded by 101 in the file and affected is coded by 201. The second line lists the normal phenotype designation and the penetrances. The penetrance is the probability of the phenotype given the genotype. P(normal|1/1)=0.999, P(normal|1/2) = P(normal|2/2) = 0.01. The final line lists the affected phenotype designation and the penetrance, P(affected|1/1) = 0.001, P(affected|1/2) = P(affected|2/2) = 0.99. See the Mendel manual section 0.5.7 to see how to code a quantitative trait. (2) Run Mendel using Gregor: In general, go to the start menu and find Mendel5. Select the magic lantern labeled Gregor. First select the option Choose Working Directory and select the subdirectory containing your files. Now select the option Read in Control File. Make any modifications to the control file you want by selecting the option Modify Control File. Now select the option Write Control File and then Run Mendel. Specifically we will read in the control file EA4.in. Mendel requires that the control file be named control.in so select Write Control.in in Gregor. Then select Run Mendel. (3) Understanding the output. –multipoint linkage analysis Examine the output file, sumea4.out with Gregor, notepad or an equivalent editor. POINT 1 2 MARKER --- MAP DISTANCE -0.1106 -0.1106 LOCATION SCORE 2.8355 2.8355 3 3 -- -0.1100 2.8370 299 300 301 ---- 0.1800 0.1810 0.1810 2.8186 2.8164 2.8164 ... THE BEST LOCATION SCORE OF 3.8462 OCCURS AT POINT172. The first column gives the grid point number. The second column denotes the location of the markers (as opposed to the intervening points). The third column gives the location in Morgans and the final column gives the corresponding location score. At the end of the file, best location score over all the considered points is repeated. More details of the analysis can be found in the output file ea4.out. (4) Single marker Lod scores. Rerun the program to use the data from all the markers at once by changing the command: Number_of_markers_included = 4 to Number_of_markers_included = 1. Remove the "!" from in front of interior_points = 41 and put a "!" in front of grid_increment = 0.001 or make the changes using Modify Control Parameters.. Now the data for each marker will be used individually just like in lab 5. Change the output file names so that they are not overwritten. Run the analysis. The summary file now contains only the maximum lod over the grid points for each marker separately. The lod scores and recombination fractions can be plotted using excel. These values are available in the long output file. (D) PART TWO, ERROR CHECKING: (1) Understanding the input files. The control file for the error checking is called error.in. If you open it you should see: OUTPUT_FILE = error1.out PEDIGREE_FILE = errped.in LOCUS_FILE = errLoc.in MAP_FILE = errMap.in ANALYSIS_OPTION = Mistyping MODEL = 1 MALE = 1 FEMALE = 2 (2) The analysis_option, mistyping, has 5 models. Model 1 detects Mendelian inconsistencies, model 2 detects Mendelian consistent errors and estimates the posterior probability of an error at each genotyped member of the pedigree as well as estimating the posterior probability of an error in at least one member of the family. Model 2 assumes that genotyping errors are uniformly distributed among the available genotypes. Models 3 and 4 estimate overall genotyping error rates. These two models take a long time to run so beware if you try them. Model 5 is similar to model 2 except that it assumes that the genotyping errors are distributed according to their population frequencies. The default prior genotyping error rate is 0.025. This error rate means that for any given genotype there is a 2.5% chance that it is incorrect apriori. 4 (3) Running model 1. Using Gregor, read in the control file error.in and write over the control.in file. Then run Mendel using option 5, model 1, the analysis will run quite quickly. (4) Understanding the output. Open error1.out and scroll down to the bottom. Just before the end you should see: GENOTYPING ERROR OPTION PEDIGREE NUMBER 1 7 1 1 PEDIGREE NAME LOCUS NAME 1 12 1 1 ERROR NEAR PERSON NAMED D21S1256 D21S1256 D21S1914 D21S263 2 77 2 2 Families 1 and 12 have Mendelian inconsistencies at marker D21S1256 and marker 1 has inconsistencies at D21S1914 and D21S263. Examine the pedigree file to see if you can identify the errors. 4 1 2 3 4 1 1 1 2 2 1 2 1 1 1 1 2 1 04/01 06/04 06/06 04/04 02/01 10/07 07/03 10/07 10/06 07/02 Although it is possible that person 2 in pedigree 1 is in error at marker D21S1256, it is also possible that person 1 or person 3 is in error. At marker D21S1914 person 1 seems the most likely mistyped member. At marker D21S263 it is not possible to determine whether person 2, person 4 or both are in error. The most cautious approach would be to delete all the data for these three markers in family 1 until the gels can be reexamined or retyped. Now examine family 12. 10 1 2 3 4 5 6 76 77 78 87 12 1 2 1 2 3 5 76 76 4 6 77 77 1 2 1 2 1 2 1 2 2 2 1 1 1 1 1 1 1 1 2 2 04/03 06/05 04/04 08/06 08/07 08/07 09/04 10/04 Again it is not clear whether the genotyping error is in person 76, 77 or 78 at D21S1256. 5 (5) Change the model to 2. Remove “!” from in front of “summary_file” and “new_pedigree_file” in error.in or go to Modify Control Parameters -> Output Options and fill in fields “summary_file” and “new_pedigree_file” with “sumerror2.out” and “errped2.in” files respectively. Change the output file name to error2.out and rerun the analysis. You now have a summary file “sumerror2.out”. You should have an additional potential error. PEDIGREE NUMBER 1 1 1 1 1 1 6 6 7 7 PEDIGREE NAME 1 1 1 1 1 1 11 11 12 12 PERSON NAME LOCUS NAME ANYONE 1 ANYONE 1 ANYONE 4 ANYONE 74 ANYONE 78 D21S1256 D21S1256 D21S1914 D21S1914 D21S263 D21S263 D21S263 D21S263 D21S1256 D21S1256 PHENOTYPE 01/04 01/02 02/07 01/06 04/04 ERROR PROBABILITY 1.00000 0.94253 1.00000 1.00000 1.00000 0.90936 0.67063 0.62346 1.00000 0.79051 In addition, there is a new pedigree file. Genotypes with errors over the threshold value (default 0.25) have been eliminated in this file. Additional Analyses to try: (6) Change the model to 5, change the output file name, the summary file name and the new pedigree file name, and rerun. How do your results differ from the model 2 results? (7) Change the prior from the default (0.025) to 0.05. Compare the results to the previous run. How does changing the prior change the posterior the posterior probabilities? (8) Why is person's 74 genotype at D21S263 likely to be in error? (9) It can be helpful to have the pedigrees to help you better understand the errors. Use GAP (remember GAP?) to visualize the pedigrees. To read in the data, you will define three new variables that are 5 characters long, one each for the three genotypes. Label the pedigrees with the subject ID and the three marker genotypes and print them out. (10) If you have time (at least 1/2 hour) then run option 3 to get an overall estimate of the mistyping rate. THERE IS NO HOMEWORK FOR THIS LAB. REMINDER, EXAM 2/19/04 OPEN NOTE, OPEN BOOK, BRING CALCULATOR. 6