Analyzing human variation with Galaxy Belinda Giardine and Cathy Riemer Feb 8, 2012 Outline Part 1: Preparing input data Uploading files Using Galaxy libraries Basic filtering Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways PolyPhen-2 Gene-based analysis Part 3: Running new predictions of coding SNPs likely to be detrimental SIFT Workflows Part 4: Finding SNPs that fall in suspected functional regions Predicted regulatory regions, ENCODE functional data, phyloP conserved positions Fake example dataset SNP calls from Complete Genomics GS12880 5 known disease variants added for illustration Various genes and parts of the gene (coding, regulatory, splicing, …) Realistic background for search, but not a realistic SNP combination Uploading a file Converting file format Shared data Importing datasets from library Filtering SNPs Filter results Outline Part 1: Preparing input data Uploading files Using Galaxy libraries Basic filtering Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways PolyPhen-2 Gene-based analysis Part 3: Running new predictions of coding SNPs likely to be detrimental SIFT Workflows Part 4: Finding SNPs that fall in suspected functional regions Predicted regulatory regions, ENCODE functional data, phyloP conserved positions PolyPhen-2 pre-computed predictions PolyPhen-2 Filtering PolyPhen-2 results PolyPhen-2 results Linking identifiers Identifier fields Join identifiers to result Comparative Toxicogenomics Database (CTD) Outline Part 1: Preparing input data Uploading files Using Galaxy libraries Basic filtering Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways PolyPhen-2 Gene-based analysis Part 3: Running new predictions of coding SNPs likely to be detrimental SIFT Workflows Part 4: Finding SNPs that fall in suspected functional regions Predicted regulatory regions, ENCODE functional data, phyloP conserved positions SIFT inputs Shared data Workflow Your workflows Running the workflow Running SIFT Filter SIFT results SIFT results Outline Part 1: Preparing input data Uploading files Using Galaxy libraries Basic filtering Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways PolyPhen-2 Gene-based analysis Part 3: Running new predictions of coding SNPs likely to be detrimental SIFT Workflows Part 4: Finding SNPs that fall in suspected functional regions Predicted regulatory regions, ENCODE functional data, phyloP conserved positions Import predicted regulatory regions Filter with intersect tool PRPs results Using ENCODE data Again filter with intersect DNase HSS results Conservation Histogram of phyloP scores Filter on phyloP greater than or equal to 0.5 phyloP results What we covered Part 1: Preparing input data Uploading files Using Galaxy libraries Basic filtering Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways PolyPhen-2 Gene-based analysis Part 3: Running new predictions of coding SNPs likely to be detrimental SIFT Workflows Part 4: Finding SNPs that fall in suspected functional regions Predicted regulatory regions, ENCODE functional data, phyloP conserved positions Editing the dataset name and build Overview of part 1 Preparing input data a. We start with a full set of SNP calls from a particular individual, in the masterVar format used by Complete Genomics. b. Upload and import the file to our Galaxy history via FTP or URL. c. Convert to pgSnp format. d. Subtract some SNPs found in healthy individuals to narrow down the search. Overview of part 2 Selecting known coding SNPs predicted to be damaging, then finding their genes and associated pathways. a. Import a public library file containing pre-computed results from running PolyPhen-2 on the dbSNP database. b. Join our input file with the PolyPhen-2 results row-by-row, based on interval overlap. This adds new columns to our input, including the UniProt protein accession and the predicted effect of the SNP. c. Filter the results to select rows containing the word “damaging”. d. Translate the UniProt accessions to HUGO gene symbols by joining with an identifier table imported from UCSC. e. Run the CTD tool to extract curated pathway associations for these genes from the Comparative Toxicogenomics Database. Overview of part 3 Running new predictions of coding SNPs likely to be detrimental a. Augment our input dataset (from part 1) to add a new column containing both the variant and reference alleles, by running the shared workflow “Prep pgSnp file to run SIFT”. b.Run the SIFT tool, keeping the original allele column as a comment, and requesting the gene name, OMIM disease, etc. in the output. c. Filter the results to select rows containing the word “DAMAGING”. Overview of part 4 Finding SNPs that fall in suspected functional regions a. Filter our input dataset (from part 1) to keep rows only whose intervals intersect (i.e. overlap) those in the library dataset of predicted regulatory regions. b. Filter our input dataset (from part 1) to keep only rows whose intervals intersect those in an ENCODE regulatory dataset (DNase clusters) imported from UCSC. c. Run the phyloP tool on our input dataset (from part 1) to add a column of interspecies conservation scores. Then use the Histogram tool to help choose a suitable score threshold, and filter the SNPs on the score column to keep only those at highly conserved positions.