Manual_pedchin_sas - Division of Statistical Genomics

advertisement
Division of Statistical Genomics
Washington University School of Medicine
Campus Box 8506, 4444 Forest Park Blvd.
Saint Louis, MO 63108
Program: pedchin.sas
Use it free of charge with NO WARRANTIES WHATSOEVER
No rights of distribution are granted. Copyright © 2006
Version: 1
OS: Linux, Unix, Windows
Date: 03.29.2006
Prepared by:
Aldi Kraja DSc., PhD., Res. Ass. Prof. of Genetics
aldi@wustl.edu
Michael Province, PhD., Prof. of Genetics and Biostatistics
mprovince@wustl.edu
Purpose: Use PedCheck Program written by O’Connell and Weeks1 in c/c++, with
SAS formatted datasets, and correct in an automatic fashion the inconsistencies in
marker data. The program can be used for macrosatellite and SNP data. Input a
marker data set and if any inconsistencies are present, the program will pinpoint to
such inconsistencies by zeroing them out.
Two interfaces are available:
pedchin.sas, where one can run 1-22 chromosomes by selecting AUTOSOMAL or
specific chromosome for example CHROM 3, or a range of chromosomes CHROM 1-5,
The above interface works with a maximum number of markers that is limited to the
default of PedCheck.
The second interface is pedchin1.sas, and can be used only with one chromosome at a
time, but it is assumed that the chromosome is gigantic (with a very large number of
markers, for example with 300, 000 markers). The program will create an index that will
loop based on the bymark= value provided by the user.
Formats of the data needed:
We use a very simple but specific format:
We expect from the user five SAS datasets:
1. GMARSH SAS dataset (marker set), with rows subjects’ observations, and columns
variables SUBJECT (ID) or whatever you want to call this id variable, and M1, M2,…,Mn
(the name of the markers: your choice). Type: subject (numeric), markers (character)
See a schematic below:
subject
1
2
…
22560
ACT1A01
ATC4D09
0/0
0/0
…
182/188
194/200
200/200
…
194/194
…
…
…
…
…
m1000
3/3
4/4
…
3/4
2. TRIPLET SAS dataset (familial relation set), with rows subjects’ observations, and
columns variables: PEDID (famid) or whatever you want to call this familial id variable;
SUBJECT id; DADSUBJ id, MOMSUBJ id, gender (sex) or any other name used by you.
Type: pedid (numeric), subject (numeric), dadsubj (numeric), momsubj (numeric),
gender (numeric).
See a schematic below:
pedid
1
2
…
1001
Subject
1
1
…
22560
dadsubj
0
0
…
22548
momsubj
0
0
…
22549
sex
1
2
…
1
3. LOCDES SAS dataset (list of markers that will be analyzed), with at least two
variables MARKNAME, MARSHMAP. This SAS dataset will serve only as a template
telling our program which markers to select from GANON set for analysis. The
MARKNAME can be whatever name you want, the same applies for MARSHMAP
variable, but they correspond 1 to 1 regarding another SAS set MARSHMAP that will
follow latter.
Type: markname (character), marshmap (character).
See a schematic below:
markname
marshmap
ACT1A01
ATC4D09
D18S843
D6S1006
…
m1000
…
rs346
4. MARSHMAP SAS dataset (list of mapping markers that can be used in your
study and in other studies). This SAS dataset normally has all possible markers that are
provided by some database. For example the map from Marshfield, or a map created by
you from the dbSNP etc. Normally we expect in this SAS dataset the following variables:
MARKNAME, MARSHMAP (another variable name for the map), DIST, CHROM
number. The names (the content) of markname, and marshmap can be the same, but they
have to be unique within each variable, otherwise markers will be deleted.
Type: markname (character), marshmap (character), chrom (numeric), dist (numeric)
See a schematic below:
markname
marshmap
ACT1A01
ATC4D09
D18S843
D6S1006
m1000
rs346
chrom
18
6
13
dist
28.10
26.71
44.1
5. GENEFREQ SAS dataset (list of markers in your study with all alleles and their
allele frequencies expressed in percent). This dataset normally has all markers in your
study represented by the following variables: MARKNAME, ALLELE, PERCENT.
Markname variables is the same as above, allele is a number that represent the allele
number, percent represent the percentage of that allele toward a total of 100%.
Type: markname (character), allele (numeric), percent (numeric).
See below a real example for 3 markers from some real data:
markname
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ACT1A01
ATC4D09
ATC4D09
ATC4D09
ATC4D09
ATC4D09
ATC6A06
ATC6A06
ATC6A06
ATC6A06
ATC6A06
ATC6A06
ATC6A06
ATC6A06
allele
170
173
179
182
185
188
191
194
197
203
182
194
197
200
203
110
113
116
119
122
125
128
131
percent
0.067092
0.033546
6.692385
20.26166
30.91245
30.92922
10.61724
0.436095
0.016773
0.033546
0.025497
34.80367
8.618052
55.73687
0.81591
0.067774
3.270078
6.81125
47.96679
28.9224
9.58997
3.337852
0.033887
Following is an intro how to install the program, how to use the program
PEDCHIN.SAS and which parameters you are asked to provide.
How to install the program pedchin.sas:
The interface “pedchin.sas” together with a group of other auxiliary programs are
included in a tar file, and gzipped named: pedch.tar.gz
After you download this package to your system (Linux or Unix server) apply the
following at the command prompt:
gunzip pedch.tar.gz
tar –xvf pedch.tar
tar will create a directory named pedch in the working directory.
In advance it is expected you have installed pedcheck_linux_intel, pedcheck_sol, or
pedcheck_win of O’Connell and Weeks depending on the operating system you have,
and it is viewable from anywhere in your system, by being present as a symbolic link in
your main path (for example under /usr/bin/). For more see here:
http://www.biostat.wustl.edu/~aldi/geneticssoft/index.htm
All the dependencies in our SAS programs are called with %include as if they are
present in the working directory (no path defined). So if you change the working
directory, please change the reference to dependencies to some permanent path: for
example /users/genetics/toolbox/pedcheck/ by moving all the auxiliary files to this
permanent directory. Almost all the auxiliary files have some %include or %inc. The
only file that the user has to modify for running the program, after that, is pedchin.sas
which can be placed anywhere in a working directory.
You have to change also the name of the program you will be calling for PedCheck.
Do a search for pedcheck_linux in the file pedcheck.sas, and you will find immediately
where is the place of the program name you have to substitute with pedcheck_linux_intel,
pedcheck_sol, or pedcheck_win.
Use pedchin.sas:
One has to change the interface provided, with his/her own data. After changing the
interface with your own data, submit the program as usually is done in the SAS system,
in batch or interactive mode (see SAS manuals and your system administrator).
An important observation: It is expected that PedCheck of O’Connell and Weeks
internally to deal with nuclear families. For more see here:
http://www.biostat.wustl.edu/~aldi/geneticssoft/index.htm
We provide an option UsOrPedcheckNuk=1 which permits our program to perform split
of large pedigrees in nuclear families. We recommend this process to be applied in your
data. Especially if you have large pedigrees, the final results of zeroing out the marker
inconsistencies will not be the same by not using the above option. Otherwise if you think
that you have already created nuclear families apply (UsOrPedcheckNuk=2) to skip our
split of the data in nuclear families. If you use our nuclearization step, please
acknowledge that Bill Howells contributed on the nuclearization program. Be aware that
if UsOrPedcheckNuk=1 is used the pedid reported are in fact nuclear family ids created
as you run the program. So if you are interested to go backward to the original data,
please use the subject ID of a problematic subject and from there figure out the pedid.
How to run the pedchin.sas?
A. Make sure that errlevel34=1 and errlevel1=0. The program starts with GMARSH
and after correcting any inconsistencies creates the GMARSH1 (label provided by
you).
B. Rerun the program by substituting GMARSH with the most recent GMARSH1,
and after correcting any inconsistencies create GMARSH2.
C. Repeat this process up to no more errors in levels 3 and 4 are present (check
pedcheck#chrom.err files, where #chrom represent the number of chrom to see if
there are still any inconsistencies present). Let’s assume that GMARSH3 was the
last run output. If you see that some error 3-4 level (after a few runs) is still not
eliminated, it can be possible that it is a strange case: (please check it if that is the
case) See the following example: Pedcheck is reporting odds 1 for 14471 and
odds 6 for 65241 for marker M1000. When we say soft elimination of 3-4 level
errors we select first those with the lowest odds against. In this case if we
continue only with the soft elimination we will never reach to zero out the
inconsistency of 65242. Only in this case we need to use the option
strongassump=1 (which means that we are sure that soft elimination is not
working, and with brute force we will zero out all the remaining level 3-4
inconsistencies reported).
14470
14469
14471
0/0
65226
65241
4/4
D. Start a new process by turning on errlevel1=1 and turning off errlevel34=0.
Rerun the program with the latest GMARSH3 and create GMARSH4 if any
inconsistencies are still present in the level 1 of errors, otherwise the GMARSH4
will not be created.
E. If there are still level1 inconsistencies in the data, then repeat step D with
GMARSH4 as a start and GMARSH5 as an end point.
F. If after all runs you see still some errors that are very difficult to resolve then the
last resort will be to zero-out all the subjects of a specific nuclear family for a
specific marker. This assumption is very strong, therefore you have to apply this
one only as a final step. If one sees that there is no other solutions to the data
(ERROR LEVEL 3/4 failed) and ERROR LEVEL 1 does not provide a clear cut
for the problematic subject, only in that time one applies:
errlevel1=1 and strongassump1=1;
Be aware that this last step is very drastic, it will zero out all members of a
nuclear family for the problematic marker about the level 1 errors.
If you apply this step, please check the apederr&i (located at &dir2) as in any
other steps, to match with pedcheck&i.err (located in the working directory) and
the zero-out in the final version of the sas marker data (located at &dir2). If one
has not finished with the other above steps and applies step F, the program will
abort to protect the data from unnecessary zero-out.
G. An important note: We applied pedchin.sas program and all the steps in a large
dataset with about 5000 subjects and about 600 microsatellite markers and at the
end still 3 nuclear families were reported with problems. They had about 6-8
alleles in the children and no parents typed. Only in this case, the error two level
was reported by pedcheck and level 3 failed. In this case we have not provideda
solution, the user is asked to zero-out the problematic marker for the reported
subjects.
If you think the program is not able to eliminate other inconsistencies, please send us an
email with an example dataset and your output of the inconsistencies not eliminated so
we can improve this program. Finally any feedback is appreciated! Thanks for using
our program.
Interface and parameters to be provided by the user:
*-----------------------------------------------------------------------------------------;
* Program: pedchin.sas
;
* By: Aldi Kraja and Michael Province
;
* Purpose: Use of the program pedcheck of O-Connell written in c-c++
;
* as the first source for finding inconsistances in the marker data
;
* Notes:
;
* Improved: Version 1: 03.14.06
;
*----------------------------------------------------------------------------------------- ;
/******************************************************************
*******************************************************************
* A. INPUT:
*----------------------------------------------------------------* 1. Provide the name of the marker dataset that will have
* marker names; function: they will be used for
* check on inconsistances. The markers for chrom
* 23 (X) and chrom 25 (Y) will be eliminated.
* -* default: GMARSH=,
*
* 2. Provide the name of new corrected marker dataset. Do not write
* the same name as GMARSH above, because you can overwrite the source
* SAS dataset.
* -* default: gmarsh2=,
*
* 3. Provide the name of the set that will have any possible inconsistencies
* in the marker data and which will be corrected.
* -* default: pederr=pederr,
*
* 4. Provide the name of the dataset that has frequencies
* of marker alleles; function: internally will be recoded a new
* number of alleles starting from 1 and not 70, 150 etc.
* and the percent will be devided by 100:
* -* default: MARSHFRQ=,
*
* 5. Provide the name of dataset that keeps markname and
* marshmap variables; function: this will be used
* for matching marshmap and markname.
* True variable is marshmap and created variable
* is markname.
* Internally we will use markname.
*
* -* default: MARSHLOC=,
*
* 6. Provide the name of the dataset where you want to check
* for the CHD or any other trait (only one pheno) that will
* be used to assess affaction in terms 2-affected/1-not affected, 0-unknown.
* For example in our case we will use chddata
* -* default: CHDDAT=,
*
* 7. Provide the name of the map dataset
* -* default: MARSHMAP=,
*
* 8. What is the pheno that will be studed for affection/ornot?
* CHD is assumed to be 2 for affected and 1 not affected and 0 unknown
* -* default: CHD=,
*
* 9. Write the affected value: (You can write a 2 depending on affection
* values used in the data for CHD. If affection in the CHD variable was
* coded as 2, then the affected =2 is the right choice.
* -* default: affected=2,
*
* 10. Provide the dataset where we will get these variables:
* PEDID (or FAMID), SUBJECT, DADSUBJ (or FID), MOMSUBJ (or MID) and SEX.
* In FHS case: g1triple
* -* default: TRIPLET=g1triple,
*
* 11. What is the unique identification for the subjects in the TRIPLET dataset?
* If subject write subject, if id write id.
* -* default: SUBJECT=subject,
*
* 12. Provide the label for fatherid in the TRIPLET dataset:
* -* default: FATHERID=dadsubj,
*
* 13. Provide the label for matherid in the TRIPLET dataset:
* -* default: MATHERID=momsubj,
*
* 14. PEDID in the TRIPLET dataset:
* -* default: PEDID=pedid,
*
* 15. SEX numeric 1 and 2 or M and F or MALE and FEMALE in the TRIPLET dataset
* -* default: SEX=sex,
*
* B. OUTPUT:
*---------------------------------------------------------------------------* 16. What is the directory you want to write the locFHS&i.dat and pedFHS&i.ped?
* It will be used also as a the path for reading the errors from pedcheckNR.err
* -* default: dir1=,
*
* 17. dir2=,
* it will be used as default for the output of SAS datasets
* -* default: dir2=,
*
* 18. What do you want to call the pedfile that will be put in the DIR1?
* USE NO MORE THAN SIX LETTERS TO WRITE THE BODY OF THE
FILENAME.
* I will use the two last one positions
* for putting numbers from 1 to 22 separate for each chromosome.
* So the output will be with &PEDFILE1-22.ped
* -* default: PEDFILE=pedFHS,
*
* 19. What do you want to call the locfile that will be put in the DIR1?
* USE NO MORE THAN SIX LETTERS TO WRITE THE BODY OF THE
FILENAME.
* I will use the two last one positions
* for putting numbers from 1 to 22 separate for each chromosome.
* So the output will be with &LOCFILE1-22.dat
* -* default: LOCFILE=locFHS,
*
* 20. How many chromosomes you want to check. Default: AUTOSOMAL.
* You can use also CHROM and a number, for specific chromosome.
* -* default: markers=AUTOSOMAL,
*
* 21. Provide the markname variable
* -* default: markvar=markname,
*
* 22. Provide the marshname variable
* -* default: marshvar=marshmap,
*
* 23. Provide the format value for markname and marshname (one value)
* for example 12.
* -* default: fmtmarkname=,
*
* 24. Provide the choice (1 or 2) if you want Us or Pedcheck to do the
* nuclearization for these data. In both cases the inconsistencies of
* markers are corrected based on relationships within nuclear families.
* For sure the end result will be the regular family structure you
* provide with the data.
* -* default: UsOrPedcheckNuk=1,
*
* 25. Provide the name of the PedCheck executable, assuming that is
* present in the general path, otherwise provide the full path and name
* -* default: pedcheckname=pedcheck_linux,
*
* Quality checks
* ==============
* 26. errlevel34=1, when you start a run and finish level 34.
* -* default: errlevel34=1,
*
* 27. strongassump34=0, or 1 the default is 0. This means that the elimination of
* errors of type 3-4 will be gradual (soft) by selecting those with odds against with
* the smallest values. Only when you see that after running the program let-s say 3 times
* you see that some error of type 3-4 is not eliminated, that can be some type of this kind:
* example: 14469(grandfather)--14470(grandmother)-->14471(mother)[0/0]-65226(father)-->65241(child)[4/4]
* In this case pedcheck is reporting mother 14471 is problematic with odds against 1 and
65241 with odds against 6.
* If we apply only the soft elimination in this case we never eliminate the problem.
* Only in such a case it is permitted to use the strongassump34=1 which will zero out
markers of all subjects reported
* with inconsistencies.
* -* default: strongassump=0,
*
* 28. errlevel1=1 if level 1 errors remain after one has worked all possible inconsistencies
of level 3/4
* You have to turn errlevel34=0. As USUAL THE ORDER OF ERRLEVEL IS VERY
IMPORTANT: START WITH
* ERRLEVEL34=1, after a few reruns continue with ERRLEVEL1.
* -* default: errlevel1=0,
* 29. strongassump=1 if level 1 remains with errors that are not any more resolved with
* above switches. This step will zeroout subjects of a nuclear fam. for a specific marker.
* This is a very strong assumption, so try well the above choices before this last one.
* -* deafult: strongassump1=0,
*******************************************************************
******************************************************************/*****
**************************************************************
******************************************************************/
1. PedCheck: A program for identifying genotype incompatibilities in linkage analysis,"
O'Connell JR, Weeks DE, Am J Hum Genet 63:259-266
Download