040219 Lab 6 - UCLA Human Genetics

advertisement
Biostatistics 237 /Biomathematics 207B/HG207B
February 17, 2004
Account name: m237
Password: winter2002, win2002
Lab 6:
Multipoint Mapping and Genotype Error Checking
(A) Overview
In this exercise has two parts. In part 1, we will map the disorder episodic ataxia relative
to 4 markers on chromosome 12 using data from a single large pedigree. Episodic ataxia
is a rare autosomal dominant disease that is almost completely penetrant. The data were
originally presented by Litt et al. (1994, Amer. J. Hum. Gen., 55:702-709). The pedigree
and the analysis are discussed in Professor Lange’s book (1997) chapters 7 and 9 (see
pages 116-117 and 158-159). The pedigree is in the file labeled eaped.in. We will run
the final 4 markers (D12S100, CACNL1A1, D12S372, pY2/1), first separately (twopoint) and then as a multipoint analysis. We will use Mendel version 5 (written by Dr.
Kenneth Lange), option 2 - location_scores. In part 2, we will check pedigree data for
genotyping errors in microsatellite markers on chromosome 21. The underlying model is
the likelihood model we discussed in class and can use the information from multiple
markers rather than using one marker at a time. We will use Mendel version 5, option 5 mistyping.
(B) Getting ready to run Mendel:
We need XX files for this week:
(1) Part 1 uses files:
ea4.in (the control file)
ealoc.in (the locus names and allele frequencies)
eapen.in (penetrance file)
eamap.in (the order and recombination fractions between the markers)
eaped.in (the pedigrees)
(2) Part 2 uses files:
control_err.in (the control file)
errloc.in (the locus names and allele frequencies)
errmap.in (the order and recombination fractions between the markers)
errped.in (the pedigrees)
To visualize the pedigree we need:
errgap.dat (the pedigrees in gap format).
(C) PART 1: Location Scores, Running multipoint linkage analysis
(1)Understanding the input files:
The file ea4.in contains all the commands needed determine the most likely location of
the episodic ataxia susceptibility locus relative to 4 markers. (See the pedigree in Lange,
Chapter 7, page 115).
1
MAP_FILE = eamap.in
PEDIGREE_FILE = eaped.in
PENETRANCE_FILE = eapen.in
LOCUS_FILE = ealoc.in
ANALYSIS_OPTION = LOCATION_SCORES
!INTERIOR_POINTS = 41
TRAVEL = GRID
grid_increment=0.001
NUMBER_OF_MARKERS_INCLUDED = 4
OUTPUT_FILE = ea4.out
SUMMARY_FILE = sumea4.out
!
!
!
!
!
!
!
!
!
!
!
map file
pedigree file
penetrance file
marker information
analysis option
same # of points between markers
grid or search (search finds max)
even 0.001 morgan increments
number of markers considered
full output
short version of output
This file has features in common with the other control files. Analysis option
location_scores estimates the location scores (when running only one marker, location
score is equivalent to a LOD score). Most of the other commands have to do with input
and output files. The exceptions include travel = grid. This command insures that the
location score is calculated at a set number of points. The command grid_increment =
0.001 insures that these points are 0.001 Morgan apart. Alternatively, we could have a
constant number of points between each marker (e.g. Interior_points = 41). The
command Number_of_Markers_included allows us to analyze fewer markers at a time
than are available in the map or locus file. This command can be useful when there are
many markers. We may want to limit the analysis to using the markers nearby the current
putative location since markers far away contain little information and increase the
computation time.
The marker and disease allele information is located in the file ealoc.in:
Check out the contents of ealoc.in.
S100
AUTOSOME 6 0
1 0.05970
2 0.31343
3 0.35075
4 0.08955
5 0.07463
6 0.11194
CACNL1A1AUTOSOME 7 0
1 0.01408
2 0.00704
7 0.01408
8 0.06338
9 0.65493
10 0.21831
12 0.02817
.
.
.
DISEASE AUTOSOME 2 2
1 .9999
2 .0001
101
3
1/1
1/2
2/2
201
3
2
1/1
1/2
2/2
This file is similar to the GAP mdf file. The first line is the marker name (S100). Note
the name in the location file must match the name in the map file. S100 is short for
D12S100. S100 is located on an autosomal chromosome and there are 6 possible alleles.
It is codominant so we don’t need to specify the genotypes. The next six lines are the
allele names and the corresponding allele frequency. The file ealoc.in contains the
information for all 4 markers as well as the designation for the phenotype and
corresponding putative alleles. When considering a quantitative phenotype the disease
locus is coded as codominant.
The location_scores option uses a penetrance file. The penetrance file looks like:
DISEASE PROB
DISEASE
2
101
0.99900 0.01000 0.01000
201
0.00100 0.99000 0.99000
The first line of the file gives the trait name (Disease) for which penetrance is defined.
This name must match the name in the locus file. Prob indicates that the penetrance is a
conditional probability rather than a density function. The second Disease indicates
where the penetrance classes (101 and 201) can be found in the pedigree file and the 2
indicates there are two phenotypic categories. Normal is coded by 101 in the file and
affected is coded by 201. The second line lists the normal phenotype designation and the
penetrances. The penetrance is the probability of the phenotype given the genotype.
P(normal|1/1)=0.999, P(normal|1/2) = P(normal|2/2) = 0.01. The final line lists the
affected phenotype designation and the penetrance, P(affected|1/1) = 0.001,
P(affected|1/2) = P(affected|2/2) = 0.99. See the Mendel manual section 0.5.7 to see
how to code a quantitative trait.
(2) Run Mendel using Gregor: In general, go to the start menu and find Mendel5.
Select the magic lantern labeled Gregor. First select the option Choose Working
Directory and select the subdirectory containing your files. Now select the option Read
in Control File. Make any modifications to the control file you want by selecting the
option Modify Control File. Now select the option Write Control File and then Run
Mendel.
Specifically we will read in the control file EA4.in. Mendel requires that the control file
be named control.in so select Write Control.in in Gregor. Then select Run Mendel.
(3) Understanding the output. –multipoint linkage analysis
Examine the output file, sumea4.out with Gregor, notepad or an equivalent editor.
POINT
1
2
MARKER
---
MAP
DISTANCE
-0.1106
-0.1106
LOCATION
SCORE
2.8355
2.8355
3
3
--
-0.1100
2.8370
299
300
301
----
0.1800
0.1810
0.1810
2.8186
2.8164
2.8164
...
THE BEST LOCATION SCORE OF
3.8462
OCCURS AT POINT172.
The first column gives the grid point number. The second column denotes the location of
the markers (as opposed to the intervening points). The third column gives the location
in Morgans and the final column gives the corresponding location score. At the end of
the file, best location score over all the considered points is repeated. More details of the
analysis can be found in the output file ea4.out.
(4) Single marker Lod scores. Rerun the program to use the data from all the markers at
once by changing the command: Number_of_markers_included = 4 to
Number_of_markers_included = 1. Remove the "!" from in front of interior_points =
41 and put a "!" in front of grid_increment = 0.001 or make the changes using Modify
Control Parameters.. Now the data for each marker will be used individually just like
in lab 5. Change the output file names so that they are not overwritten. Run the analysis.
The summary file now contains only the maximum lod over the grid points for each
marker separately. The lod scores and recombination fractions can be plotted using excel.
These values are available in the long output file.
(D) PART TWO, ERROR CHECKING:
(1) Understanding the input files. The control file for the error checking is called error.in.
If you open it you should see:
OUTPUT_FILE = error1.out
PEDIGREE_FILE = errped.in
LOCUS_FILE = errLoc.in
MAP_FILE = errMap.in
ANALYSIS_OPTION = Mistyping
MODEL = 1
MALE = 1
FEMALE = 2
(2) The analysis_option, mistyping, has 5 models. Model 1 detects Mendelian
inconsistencies, model 2 detects Mendelian consistent errors and estimates the
posterior probability of an error at each genotyped member of the pedigree as well as
estimating the posterior probability of an error in at least one member of the family.
Model 2 assumes that genotyping errors are uniformly distributed among the
available genotypes. Models 3 and 4 estimate overall genotyping error rates. These
two models take a long time to run so beware if you try them. Model 5 is similar to
model 2 except that it assumes that the genotyping errors are distributed according to
their population frequencies. The default prior genotyping error rate is 0.025. This
error rate means that for any given genotype there is a 2.5% chance that it is incorrect
apriori.
4
(3) Running model 1. Using Gregor, read in the control file error.in and write over the
control.in file. Then run Mendel using option 5, model 1, the analysis will run quite
quickly.
(4) Understanding the
output. Open error1.out and scroll down to the bottom. Just
before the end you should see:
GENOTYPING ERROR OPTION
PEDIGREE
NUMBER
1
7
1
1
PEDIGREE
NAME
LOCUS
NAME
1
12
1
1
ERROR NEAR
PERSON NAMED
D21S1256
D21S1256
D21S1914
D21S263
2
77
2
2
Families 1 and 12 have Mendelian inconsistencies at marker D21S1256 and marker 1 has
inconsistencies at D21S1914 and D21S263. Examine the pedigree file to see if you can
identify the errors.
4
1
2
3
4
1
1
1
2
2
1
2
1
1
1
1
2
1
04/01
06/04
06/06
04/04
02/01
10/07
07/03
10/07
10/06
07/02
Although it is possible that person 2 in pedigree 1 is in error at marker D21S1256, it is
also possible that person 1 or person 3 is in error. At marker D21S1914 person 1 seems
the most likely mistyped member. At marker D21S263 it is not possible to determine
whether person 2, person 4 or both are in error. The most cautious approach would be to
delete all the data for these three markers in family 1 until the gels can be reexamined or
retyped.
Now examine family 12.
10
1
2
3
4
5
6
76
77
78
87
12
1
2
1
2
3
5
76
76
4
6
77
77
1
2
1
2
1
2
1
2
2
2
1
1
1
1
1
1
1
1
2
2
04/03
06/05
04/04
08/06
08/07
08/07
09/04
10/04
Again it is not clear whether the genotyping error is in person 76, 77 or 78 at D21S1256.
5
(5) Change the model to 2. Remove “!” from in front of “summary_file” and
“new_pedigree_file” in error.in or go to Modify Control Parameters -> Output Options
and fill in fields “summary_file” and “new_pedigree_file” with “sumerror2.out” and
“errped2.in” files respectively. Change the output file name to error2.out and rerun the
analysis. You now have a summary file “sumerror2.out”. You should have an additional
potential error.
PEDIGREE
NUMBER
1
1
1
1
1
1
6
6
7
7
PEDIGREE
NAME
1
1
1
1
1
1
11
11
12
12
PERSON
NAME
LOCUS
NAME
ANYONE
1
ANYONE
1
ANYONE
4
ANYONE
74
ANYONE
78
D21S1256
D21S1256
D21S1914
D21S1914
D21S263
D21S263
D21S263
D21S263
D21S1256
D21S1256
PHENOTYPE
01/04
01/02
02/07
01/06
04/04
ERROR
PROBABILITY
1.00000
0.94253
1.00000
1.00000
1.00000
0.90936
0.67063
0.62346
1.00000
0.79051
In addition, there is a new pedigree file. Genotypes with errors over the threshold value
(default 0.25) have been eliminated in this file.
Additional Analyses to try:
(6) Change the model to 5, change the output file name, the summary file name and the
new pedigree file name, and rerun. How do your results differ from the model 2
results?
(7) Change
the prior from the default (0.025) to 0.05. Compare the results to the
previous run. How does changing the prior change the posterior the posterior
probabilities?
(8) Why is person's 74 genotype at D21S263 likely to be in error?
(9) It can be helpful to have the pedigrees to help you better understand the errors. Use
GAP (remember GAP?) to visualize the pedigrees. To read in the data, you will
define three new variables that are 5 characters long, one each for the three
genotypes. Label the pedigrees with the subject ID and the three marker genotypes
and print them out.
(10) If you have time (at least 1/2 hour) then run option 3 to get an overall estimate of
the mistyping rate.
THERE IS NO HOMEWORK FOR THIS LAB. REMINDER, EXAM 2/19/04 OPEN NOTE, OPEN BOOK, BRING CALCULATOR.
6
Download