Division of Statistical Genomics Washington University School of Medicine Campus Box 8506, 4444 Forest Park Blvd. Saint Louis, MO 63108 Program: pedchin.sas Use it free of charge with NO WARRANTIES WHATSOEVER No rights of distribution are granted. Copyright © 2006 Version: 1 OS: Linux, Unix, Windows Date: 03.29.2006 Prepared by: Aldi Kraja DSc., PhD., Res. Ass. Prof. of Genetics aldi@wustl.edu Michael Province, PhD., Prof. of Genetics and Biostatistics mprovince@wustl.edu Purpose: Use PedCheck Program written by O’Connell and Weeks1 in c/c++, with SAS formatted datasets, and correct in an automatic fashion the inconsistencies in marker data. The program can be used for macrosatellite and SNP data. Input a marker data set and if any inconsistencies are present, the program will pinpoint to such inconsistencies by zeroing them out. Two interfaces are available: pedchin.sas, where one can run 1-22 chromosomes by selecting AUTOSOMAL or specific chromosome for example CHROM 3, or a range of chromosomes CHROM 1-5, The above interface works with a maximum number of markers that is limited to the default of PedCheck. The second interface is pedchin1.sas, and can be used only with one chromosome at a time, but it is assumed that the chromosome is gigantic (with a very large number of markers, for example with 300, 000 markers). The program will create an index that will loop based on the bymark= value provided by the user. Formats of the data needed: We use a very simple but specific format: We expect from the user five SAS datasets: 1. GMARSH SAS dataset (marker set), with rows subjects’ observations, and columns variables SUBJECT (ID) or whatever you want to call this id variable, and M1, M2,…,Mn (the name of the markers: your choice). Type: subject (numeric), markers (character) See a schematic below: subject 1 2 … 22560 ACT1A01 ATC4D09 0/0 0/0 … 182/188 194/200 200/200 … 194/194 … … … … … m1000 3/3 4/4 … 3/4 2. TRIPLET SAS dataset (familial relation set), with rows subjects’ observations, and columns variables: PEDID (famid) or whatever you want to call this familial id variable; SUBJECT id; DADSUBJ id, MOMSUBJ id, gender (sex) or any other name used by you. Type: pedid (numeric), subject (numeric), dadsubj (numeric), momsubj (numeric), gender (numeric). See a schematic below: pedid 1 2 … 1001 Subject 1 1 … 22560 dadsubj 0 0 … 22548 momsubj 0 0 … 22549 sex 1 2 … 1 3. LOCDES SAS dataset (list of markers that will be analyzed), with at least two variables MARKNAME, MARSHMAP. This SAS dataset will serve only as a template telling our program which markers to select from GANON set for analysis. The MARKNAME can be whatever name you want, the same applies for MARSHMAP variable, but they correspond 1 to 1 regarding another SAS set MARSHMAP that will follow latter. Type: markname (character), marshmap (character). See a schematic below: markname marshmap ACT1A01 ATC4D09 D18S843 D6S1006 … m1000 … rs346 4. MARSHMAP SAS dataset (list of mapping markers that can be used in your study and in other studies). This SAS dataset normally has all possible markers that are provided by some database. For example the map from Marshfield, or a map created by you from the dbSNP etc. Normally we expect in this SAS dataset the following variables: MARKNAME, MARSHMAP (another variable name for the map), DIST, CHROM number. The names (the content) of markname, and marshmap can be the same, but they have to be unique within each variable, otherwise markers will be deleted. Type: markname (character), marshmap (character), chrom (numeric), dist (numeric) See a schematic below: markname marshmap ACT1A01 ATC4D09 D18S843 D6S1006 m1000 rs346 chrom 18 6 13 dist 28.10 26.71 44.1 5. GENEFREQ SAS dataset (list of markers in your study with all alleles and their allele frequencies expressed in percent). This dataset normally has all markers in your study represented by the following variables: MARKNAME, ALLELE, PERCENT. Markname variables is the same as above, allele is a number that represent the allele number, percent represent the percentage of that allele toward a total of 100%. Type: markname (character), allele (numeric), percent (numeric). See below a real example for 3 markers from some real data: markname ACT1A01 ACT1A01 ACT1A01 ACT1A01 ACT1A01 ACT1A01 ACT1A01 ACT1A01 ACT1A01 ACT1A01 ATC4D09 ATC4D09 ATC4D09 ATC4D09 ATC4D09 ATC6A06 ATC6A06 ATC6A06 ATC6A06 ATC6A06 ATC6A06 ATC6A06 ATC6A06 allele 170 173 179 182 185 188 191 194 197 203 182 194 197 200 203 110 113 116 119 122 125 128 131 percent 0.067092 0.033546 6.692385 20.26166 30.91245 30.92922 10.61724 0.436095 0.016773 0.033546 0.025497 34.80367 8.618052 55.73687 0.81591 0.067774 3.270078 6.81125 47.96679 28.9224 9.58997 3.337852 0.033887 Following is an intro how to install the program, how to use the program PEDCHIN.SAS and which parameters you are asked to provide. How to install the program pedchin.sas: The interface “pedchin.sas” together with a group of other auxiliary programs are included in a tar file, and gzipped named: pedch.tar.gz After you download this package to your system (Linux or Unix server) apply the following at the command prompt: gunzip pedch.tar.gz tar –xvf pedch.tar tar will create a directory named pedch in the working directory. In advance it is expected you have installed pedcheck_linux_intel, pedcheck_sol, or pedcheck_win of O’Connell and Weeks depending on the operating system you have, and it is viewable from anywhere in your system, by being present as a symbolic link in your main path (for example under /usr/bin/). For more see here: http://www.biostat.wustl.edu/~aldi/geneticssoft/index.htm All the dependencies in our SAS programs are called with %include as if they are present in the working directory (no path defined). So if you change the working directory, please change the reference to dependencies to some permanent path: for example /users/genetics/toolbox/pedcheck/ by moving all the auxiliary files to this permanent directory. Almost all the auxiliary files have some %include or %inc. The only file that the user has to modify for running the program, after that, is pedchin.sas which can be placed anywhere in a working directory. You have to change also the name of the program you will be calling for PedCheck. Do a search for pedcheck_linux in the file pedcheck.sas, and you will find immediately where is the place of the program name you have to substitute with pedcheck_linux_intel, pedcheck_sol, or pedcheck_win. Use pedchin.sas: One has to change the interface provided, with his/her own data. After changing the interface with your own data, submit the program as usually is done in the SAS system, in batch or interactive mode (see SAS manuals and your system administrator). An important observation: It is expected that PedCheck of O’Connell and Weeks internally to deal with nuclear families. For more see here: http://www.biostat.wustl.edu/~aldi/geneticssoft/index.htm We provide an option UsOrPedcheckNuk=1 which permits our program to perform split of large pedigrees in nuclear families. We recommend this process to be applied in your data. Especially if you have large pedigrees, the final results of zeroing out the marker inconsistencies will not be the same by not using the above option. Otherwise if you think that you have already created nuclear families apply (UsOrPedcheckNuk=2) to skip our split of the data in nuclear families. If you use our nuclearization step, please acknowledge that Bill Howells contributed on the nuclearization program. Be aware that if UsOrPedcheckNuk=1 is used the pedid reported are in fact nuclear family ids created as you run the program. So if you are interested to go backward to the original data, please use the subject ID of a problematic subject and from there figure out the pedid. How to run the pedchin.sas? A. Make sure that errlevel34=1 and errlevel1=0. The program starts with GMARSH and after correcting any inconsistencies creates the GMARSH1 (label provided by you). B. Rerun the program by substituting GMARSH with the most recent GMARSH1, and after correcting any inconsistencies create GMARSH2. C. Repeat this process up to no more errors in levels 3 and 4 are present (check pedcheck#chrom.err files, where #chrom represent the number of chrom to see if there are still any inconsistencies present). Let’s assume that GMARSH3 was the last run output. If you see that some error 3-4 level (after a few runs) is still not eliminated, it can be possible that it is a strange case: (please check it if that is the case) See the following example: Pedcheck is reporting odds 1 for 14471 and odds 6 for 65241 for marker M1000. When we say soft elimination of 3-4 level errors we select first those with the lowest odds against. In this case if we continue only with the soft elimination we will never reach to zero out the inconsistency of 65242. Only in this case we need to use the option strongassump=1 (which means that we are sure that soft elimination is not working, and with brute force we will zero out all the remaining level 3-4 inconsistencies reported). 14470 14469 14471 0/0 65226 65241 4/4 D. Start a new process by turning on errlevel1=1 and turning off errlevel34=0. Rerun the program with the latest GMARSH3 and create GMARSH4 if any inconsistencies are still present in the level 1 of errors, otherwise the GMARSH4 will not be created. E. If there are still level1 inconsistencies in the data, then repeat step D with GMARSH4 as a start and GMARSH5 as an end point. F. If after all runs you see still some errors that are very difficult to resolve then the last resort will be to zero-out all the subjects of a specific nuclear family for a specific marker. This assumption is very strong, therefore you have to apply this one only as a final step. If one sees that there is no other solutions to the data (ERROR LEVEL 3/4 failed) and ERROR LEVEL 1 does not provide a clear cut for the problematic subject, only in that time one applies: errlevel1=1 and strongassump1=1; Be aware that this last step is very drastic, it will zero out all members of a nuclear family for the problematic marker about the level 1 errors. If you apply this step, please check the apederr&i (located at &dir2) as in any other steps, to match with pedcheck&i.err (located in the working directory) and the zero-out in the final version of the sas marker data (located at &dir2). If one has not finished with the other above steps and applies step F, the program will abort to protect the data from unnecessary zero-out. G. An important note: We applied pedchin.sas program and all the steps in a large dataset with about 5000 subjects and about 600 microsatellite markers and at the end still 3 nuclear families were reported with problems. They had about 6-8 alleles in the children and no parents typed. Only in this case, the error two level was reported by pedcheck and level 3 failed. In this case we have not provideda solution, the user is asked to zero-out the problematic marker for the reported subjects. If you think the program is not able to eliminate other inconsistencies, please send us an email with an example dataset and your output of the inconsistencies not eliminated so we can improve this program. Finally any feedback is appreciated! Thanks for using our program. Interface and parameters to be provided by the user: *-----------------------------------------------------------------------------------------; * Program: pedchin.sas ; * By: Aldi Kraja and Michael Province ; * Purpose: Use of the program pedcheck of O-Connell written in c-c++ ; * as the first source for finding inconsistances in the marker data ; * Notes: ; * Improved: Version 1: 03.14.06 ; *----------------------------------------------------------------------------------------- ; /****************************************************************** ******************************************************************* * A. INPUT: *----------------------------------------------------------------* 1. Provide the name of the marker dataset that will have * marker names; function: they will be used for * check on inconsistances. The markers for chrom * 23 (X) and chrom 25 (Y) will be eliminated. * -* default: GMARSH=, * * 2. Provide the name of new corrected marker dataset. Do not write * the same name as GMARSH above, because you can overwrite the source * SAS dataset. * -* default: gmarsh2=, * * 3. Provide the name of the set that will have any possible inconsistencies * in the marker data and which will be corrected. * -* default: pederr=pederr, * * 4. Provide the name of the dataset that has frequencies * of marker alleles; function: internally will be recoded a new * number of alleles starting from 1 and not 70, 150 etc. * and the percent will be devided by 100: * -* default: MARSHFRQ=, * * 5. Provide the name of dataset that keeps markname and * marshmap variables; function: this will be used * for matching marshmap and markname. * True variable is marshmap and created variable * is markname. * Internally we will use markname. * * -* default: MARSHLOC=, * * 6. Provide the name of the dataset where you want to check * for the CHD or any other trait (only one pheno) that will * be used to assess affaction in terms 2-affected/1-not affected, 0-unknown. * For example in our case we will use chddata * -* default: CHDDAT=, * * 7. Provide the name of the map dataset * -* default: MARSHMAP=, * * 8. What is the pheno that will be studed for affection/ornot? * CHD is assumed to be 2 for affected and 1 not affected and 0 unknown * -* default: CHD=, * * 9. Write the affected value: (You can write a 2 depending on affection * values used in the data for CHD. If affection in the CHD variable was * coded as 2, then the affected =2 is the right choice. * -* default: affected=2, * * 10. Provide the dataset where we will get these variables: * PEDID (or FAMID), SUBJECT, DADSUBJ (or FID), MOMSUBJ (or MID) and SEX. * In FHS case: g1triple * -* default: TRIPLET=g1triple, * * 11. What is the unique identification for the subjects in the TRIPLET dataset? * If subject write subject, if id write id. * -* default: SUBJECT=subject, * * 12. Provide the label for fatherid in the TRIPLET dataset: * -* default: FATHERID=dadsubj, * * 13. Provide the label for matherid in the TRIPLET dataset: * -* default: MATHERID=momsubj, * * 14. PEDID in the TRIPLET dataset: * -* default: PEDID=pedid, * * 15. SEX numeric 1 and 2 or M and F or MALE and FEMALE in the TRIPLET dataset * -* default: SEX=sex, * * B. OUTPUT: *---------------------------------------------------------------------------* 16. What is the directory you want to write the locFHS&i.dat and pedFHS&i.ped? * It will be used also as a the path for reading the errors from pedcheckNR.err * -* default: dir1=, * * 17. dir2=, * it will be used as default for the output of SAS datasets * -* default: dir2=, * * 18. What do you want to call the pedfile that will be put in the DIR1? * USE NO MORE THAN SIX LETTERS TO WRITE THE BODY OF THE FILENAME. * I will use the two last one positions * for putting numbers from 1 to 22 separate for each chromosome. * So the output will be with &PEDFILE1-22.ped * -* default: PEDFILE=pedFHS, * * 19. What do you want to call the locfile that will be put in the DIR1? * USE NO MORE THAN SIX LETTERS TO WRITE THE BODY OF THE FILENAME. * I will use the two last one positions * for putting numbers from 1 to 22 separate for each chromosome. * So the output will be with &LOCFILE1-22.dat * -* default: LOCFILE=locFHS, * * 20. How many chromosomes you want to check. Default: AUTOSOMAL. * You can use also CHROM and a number, for specific chromosome. * -* default: markers=AUTOSOMAL, * * 21. Provide the markname variable * -* default: markvar=markname, * * 22. Provide the marshname variable * -* default: marshvar=marshmap, * * 23. Provide the format value for markname and marshname (one value) * for example 12. * -* default: fmtmarkname=, * * 24. Provide the choice (1 or 2) if you want Us or Pedcheck to do the * nuclearization for these data. In both cases the inconsistencies of * markers are corrected based on relationships within nuclear families. * For sure the end result will be the regular family structure you * provide with the data. * -* default: UsOrPedcheckNuk=1, * * 25. Provide the name of the PedCheck executable, assuming that is * present in the general path, otherwise provide the full path and name * -* default: pedcheckname=pedcheck_linux, * * Quality checks * ============== * 26. errlevel34=1, when you start a run and finish level 34. * -* default: errlevel34=1, * * 27. strongassump34=0, or 1 the default is 0. This means that the elimination of * errors of type 3-4 will be gradual (soft) by selecting those with odds against with * the smallest values. Only when you see that after running the program let-s say 3 times * you see that some error of type 3-4 is not eliminated, that can be some type of this kind: * example: 14469(grandfather)--14470(grandmother)-->14471(mother)[0/0]-65226(father)-->65241(child)[4/4] * In this case pedcheck is reporting mother 14471 is problematic with odds against 1 and 65241 with odds against 6. * If we apply only the soft elimination in this case we never eliminate the problem. * Only in such a case it is permitted to use the strongassump34=1 which will zero out markers of all subjects reported * with inconsistencies. * -* default: strongassump=0, * * 28. errlevel1=1 if level 1 errors remain after one has worked all possible inconsistencies of level 3/4 * You have to turn errlevel34=0. As USUAL THE ORDER OF ERRLEVEL IS VERY IMPORTANT: START WITH * ERRLEVEL34=1, after a few reruns continue with ERRLEVEL1. * -* default: errlevel1=0, * 29. strongassump=1 if level 1 remains with errors that are not any more resolved with * above switches. This step will zeroout subjects of a nuclear fam. for a specific marker. * This is a very strong assumption, so try well the above choices before this last one. * -* deafult: strongassump1=0, ******************************************************************* ******************************************************************/***** ************************************************************** ******************************************************************/ 1. PedCheck: A program for identifying genotype incompatibilities in linkage analysis," O'Connell JR, Weeks DE, Am J Hum Genet 63:259-266