snpGalaxyEx.new

advertisement
Analyzing human variation with Galaxy
Belinda Giardine and Cathy Riemer
Feb 8, 2012
Outline
 Part 1: Preparing input data
 Uploading files
 Using Galaxy libraries
 Basic filtering
 Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes
and associated pathways
 PolyPhen-2
 Gene-based analysis
 Part 3: Running new predictions of coding SNPs likely to be detrimental
 SIFT
 Workflows
 Part 4: Finding SNPs that fall in suspected functional regions
 Predicted regulatory regions, ENCODE functional data, phyloP conserved positions
Fake example dataset
 SNP calls from Complete Genomics GS12880
 5 known disease variants added for illustration
 Various genes and parts of the gene (coding, regulatory, splicing, …)
 Realistic background for search, but not a realistic SNP combination
Uploading a file
Converting file format
Shared data
Importing datasets from library
Filtering SNPs
Filter results
Outline
 Part 1: Preparing input data
 Uploading files
 Using Galaxy libraries
 Basic filtering
 Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes
and associated pathways
 PolyPhen-2
 Gene-based analysis
 Part 3: Running new predictions of coding SNPs likely to be detrimental
 SIFT
 Workflows
 Part 4: Finding SNPs that fall in suspected functional regions
 Predicted regulatory regions, ENCODE functional data, phyloP conserved positions
PolyPhen-2 pre-computed predictions
PolyPhen-2
Filtering PolyPhen-2 results
PolyPhen-2 results
Linking identifiers
Identifier fields
Join identifiers to result
Comparative Toxicogenomics Database (CTD)
Outline
 Part 1: Preparing input data
 Uploading files
 Using Galaxy libraries
 Basic filtering
 Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes
and associated pathways
 PolyPhen-2
 Gene-based analysis
 Part 3: Running new predictions of coding SNPs likely to be detrimental
 SIFT
 Workflows
 Part 4: Finding SNPs that fall in suspected functional regions
 Predicted regulatory regions, ENCODE functional data, phyloP conserved positions
SIFT inputs
Shared data
Workflow
Your workflows
Running the workflow
Running SIFT
Filter SIFT results
SIFT results
Outline
 Part 1: Preparing input data
 Uploading files
 Using Galaxy libraries
 Basic filtering
 Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes
and associated pathways
 PolyPhen-2
 Gene-based analysis
 Part 3: Running new predictions of coding SNPs likely to be detrimental
 SIFT
 Workflows
 Part 4: Finding SNPs that fall in suspected functional regions
 Predicted regulatory regions, ENCODE functional data, phyloP conserved positions
Import predicted regulatory regions
Filter with intersect tool
PRPs results
Using ENCODE data
Again filter with intersect
DNase HSS results
Conservation
Histogram of phyloP scores
Filter on phyloP greater than or equal to 0.5
phyloP results
What we covered
 Part 1: Preparing input data
 Uploading files
 Using Galaxy libraries
 Basic filtering
 Part 2: Selecting known coding SNPs predicted to be damaging, then finding their genes
and associated pathways
 PolyPhen-2
 Gene-based analysis
 Part 3: Running new predictions of coding SNPs likely to be detrimental
 SIFT
 Workflows
 Part 4: Finding SNPs that fall in suspected functional regions
 Predicted regulatory regions, ENCODE functional data, phyloP conserved positions
Editing the dataset name and build
Overview of part 1
Preparing input data
a. We start with a full set of SNP calls from a particular individual, in
the masterVar format used by Complete Genomics.
b. Upload and import the file to our Galaxy history via FTP or URL.
c. Convert to pgSnp format.
d. Subtract some SNPs found in healthy individuals to narrow down
the search.
Overview of part 2
Selecting known coding SNPs predicted to be damaging, then finding their genes
and associated pathways.
a. Import a public library file containing pre-computed results from running
PolyPhen-2 on the dbSNP database.
b. Join our input file with the PolyPhen-2 results row-by-row, based on interval
overlap. This adds new columns to our input, including the UniProt protein
accession and the predicted effect of the SNP.
c. Filter the results to select rows containing the word “damaging”.
d. Translate the UniProt accessions to HUGO gene symbols by joining with an
identifier table imported from UCSC.
e. Run the CTD tool to extract curated pathway associations for these genes from
the Comparative Toxicogenomics Database.
Overview of part 3
Running new predictions of coding SNPs likely to be detrimental
a. Augment our input dataset (from part 1) to add a new column
containing both the variant and reference alleles, by running the
shared workflow “Prep pgSnp file to run SIFT”.
b.Run the SIFT tool, keeping the original allele column as a comment,
and requesting the gene name, OMIM disease, etc. in the output.
c. Filter the results to select rows containing the word “DAMAGING”.
Overview of part 4
Finding SNPs that fall in suspected functional regions
a. Filter our input dataset (from part 1) to keep rows only whose intervals
intersect (i.e. overlap) those in the library dataset of predicted regulatory
regions.
b. Filter our input dataset (from part 1) to keep only rows whose intervals
intersect those in an ENCODE regulatory dataset (DNase clusters) imported
from UCSC.
c. Run the phyloP tool on our input dataset (from part 1) to add a column of
interspecies conservation scores. Then use the Histogram tool to help
choose a suitable score threshold, and filter the SNPs on the score column
to keep only those at highly conserved positions.
Download