ReadMe

1. Quality Control (QC) 1.1 Obtain the data The genotype data of Crohn’s Disease (CD) were obtained from the Wellcome Trust Case Control Consortium (WTCCC)’s website: http://www.wtccc.org.uk/ . A permission from WTCCC is needed to obtain and use the data. 1.2 Process the data (1) Genotype data Quality control (QC) process was performed using the R package “GenABEL”. The example of the input dataset is show as below: Name rs1001 rs2401 rs123 Chr 2 3 3 Pos 12897 12357 5327 id1 AC AG TC id2 AA GG TT id3 AA AG CC Here, every row corresponds to a SNP, and each column, starting with the 4th, corresponds to a subject. We need to make the genotype data into the above format. The dataset obtained from WTCCC is for 22 chromosomes separately. The dataset is in the long format -- every single nucleotide polymorphism (SNP) is repeated N times in rows where N is the number of subjects. Firstly, the dataset should be made into the wide format so that each row contains the data for one SNP, and each column corresponds to a subject showing his/her genotype of the SNP. The genotype data with a confidence score <0.9 were regarded as missing. We followed the WTCCC’s recommendations based on the sample call rates and evidence of recent non-European ancestry to exclude cases and controls (the exclusions lists from WTCCC website are: exclusion-list-0502-2007-CD.txt, exclusion-list_05-02-2007.txt, and exclusion-list-snps26_04_2007.txt). The genotype data after the above process are saved as “genoheaderChro*.txt” where * is the chromosome number. (2) Phenotype data The phenotype data contain all phenotypic information in a data frame. Rows of this data frame correspond to study subjects, and the columns correspond to phenotypic variables. There are two default variables, which are always present: "id" and "sex", which are the study subject identification code and sex of the subjects, respectively. Males are coded with ones ("1") and females are coded with zero ("0"). For more information, please refer to the tutorial of the GenABLE package http://www.genabel.org/tutorials/ABEL-tutorial. 1.3 Filters used in QC  SNPs with SNP call rates less than 95% were removed.  We also removed SNPs based on their minor allele frequencies: the default minor allele frequency cutoff in the GenABEL R package was used (2.5/N where N is the number of subjects)  We used a cutoff of 0.2 for the Hardy-Weinberg Equilibrium (HWE) test’s false discovery rates, based on combined cases and controls. Please refer to the example code “QC and LD example.R” for the QC steps. 2. Linkage Disequilibrium To reduce redundancy of logic trees that genotype indicators of SNPs within a gene can form, we removed SNPs within each gene sequentially, such that no pair of remaining SNPs within a gene had linkage disequilibrium (r2≥0.8). SNP-gene mapping files were obtained from the OpenBioinformatics website (http://www.openbioinformatics.org/gengen/tutorial_calculate_gsea.html#_Toc21088741 4). Please refer to the example code “QC and LD example.R” for the LD procedure. 3. Logic regression We do the analysis using the subjects and SNPs that have passed the above QC and LD process. The R code “logic regression example.R” is provided in our website http://www.ualberta.ca/~yyasui/homepage.html, with an example of gene “AARS2” for 495 controls and 251 cases. The genotype data after QC and LD were made into binary predictors, i.e., indicators of the minor-allele homozygous and indicators of the heterozygous and the minor-allele homozygous. Please refer to our example dataset “binary.txt”. Logic regression is used to model the outcome (e.g., the disease status in a case-control study) with intersections and/or unions of these binary predictors. To correct for the inherent instability of the performance measure when searching a large space, we refit the logic regression 20 times, starting the algorithm with 20 different initial values: this process was applied to the original dataset as well as 20 datasets obtained by permutations of the case- control labels. Of the 20 results produced by the 20 starting values, we selected the best fit, measured by deviance. The result for one gene is a vector of length 21 (refer to the vector “Dev_per” in the R code), with the first element being the best fit for the original case-control labels, and the remaining 20 elopements are the best fit for the 20 permuted case-control labels, respectively. 4. P-value In Step 3 above, we calculated the 21 (the original + the 20 permuted) best fit dev for one gene. We do this for all the genes in the study, and the results are then saved in a data frame with 21 rows and m columns, where m is the number is genes in the study. Please refer to our example “DevAll.txt” for results of dev. The P-value for each gene is calculated as the proportion of all permuted BF values of all genes larger than the gene’s observed BF. This p-value calculation properly takes the multiple testing into account. Please see our example code for p-value calculation “pvaue calculation.R”.

ReadMe

Related documents

Products

Support

ReadMe

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib