NIEHS SNPs Workshop January 10, 2008 All of the workshop presentations, web links and files for exercises are available at: http://egp.gs.washington.edu/workshop/download/ Interactive Tutorial 1: SNP Database Resources The following tutorial is designed to help you explore the SNP data available for candidate genes and illustrate the various NIEHS datasets and tools available from the Environmental Genome Project (EGP) and the other resources described in this workshop. As a launching point, we will begin our search at the NIEHS SNPs resource. This can be accessed at http://egp.gs.washington.edu/ The NIEHS SNPs Program at the University of Washington is part of the EGP. The EGP is a multi-disciplinary effort focused on exploring the relationships between environmental exposures, inter-individual sequence variation and disease risk. The NIEHS SNPs Program is systematically identifying and genotyping single nucleotide polymorphisms (SNPs) in environmental response genes. This resource provides dense genetic maps of more than 600 key genes that can be applied in evaluating human disease risk with environmental exposures. For these exercises, we will be accessing data for the gene nitric oxide synthase 2A (NOS2A). Note: Answers to questions from this tutorial are included at the end of this document. NIEHS SNPs http://egp.gs.washington.edu The variation data for the NOS2A gene can be accessed through the search box on the top right part of the NIEHS SNPs home page, or via the ‘A-Z Finished Genes Directory’ link in the left hand navigation bar. Using this last link, find NOS2A in the alphabetical listing of genes. 1. Under the ‘Mapping Data’ section, click on the ‘cSNPs’ link. 2. How many non-synonymous cSNPs were discovered in this gene? What is the position in our reference sequence of the first synonymous cSNP? 3. What is the cDNA position of this synonymous cSNP? 4. In which population was it discovered? 5. Go ‘BACK’. In the ‘Genotyping Data’ category, click on the ‘Visual Genotype’ link. An image of all of the genotyping data for this gene is displayed. Using the SNP location of the synonymous SNP, determine which individual carries this polymorphism. 6. Explore other links in the ‘Mapping Data,’ ‘Genotyping Data’ and ‘Predictive Analyses’ sections. The ‘Linkage Data’ and ‘Haplotype Data’ will be covered in a subsequent talk and tutorial. In the predictive analyses, are any of the sites possibly damaging and predicted to be intolerant? Which? Are any of these polymorphisms common in the population (above a minor allele frequency [MAF] of 5%)? Which site(s) and of what frequency? Bonus – In which population(s) does the MAF exceed 5%? In the cSNPs table, the first AD-freq is for African Descent (AfricanAmericans); the second AD-freq is for African-Descent (Yorubans of Africa and HapMap samples); the ED-freq is for European Descent (CEPH and HapMap Samples); the HD-freq is Hispanic Descent (Mexican-Americans); and the XD-freq is Asian Descent (split between Chinese and Japanese HapMap samples). GeneSNPs http://www.genome.utah.edu/genesnps/ The GeneSNPs resource integrates gene, sequence and polymorphism data from the NIEHS SNPs project, dbSNP and other resources into individually annotated gene models. 1. 2. 3. 4. 5. Select ‘Cell Cycle’ from the ‘Gene List’ drop down menu. Select the ‘UCSC:hg16:4’ Gene Model link for the gene CCNA2. What is the orientation of the gene? Hint – Look at arrows across the gene model. In the ‘All Submitter’ drop down menu, select EGP_SNPS. Scroll down the list of SNPs to the first non-synonymous cSNP. In what codon and which position of the codon is this SNP? 6. On what population was this gene re-sequenced? Click on the GS link at the right. What is the genotype of the sample P012? PolyPhen http://genetics.bwh.harvard.edu/pph/ PolyPhen predicts the potential consequence of an amino acid substitution on the structure and function of a human protein using physical and comparative considerations. 1. Enter the amino acid sequence for BRCA1 from the fasta file available at http://egp.gs.washington.edu/workshop/BRCA1.protein.fsa. Include the first line starting with the ‘>’ character. 2. Enter 356 for position, Q (glutamine) and R (arginine) for AA1 and AA2, then “Process query.” 3. The PolyPhen site returns its prediction, based on alignments of both polypeptide sequences to sequences in the SwissProt data base, the potential of disrupting known structural motifs (coils, active sites, disulfide bridges, phosphorylation sites, etc.), and the steric changes to the three dimensional structure. What impact is predicted for this amino acid change? 2 4. Explore the definitions of score parameters by clicking on the headers of the output tables. The headers are linked to documentation explaining the scoring and alignment algorithms. 5. Return to the NIEHS SNPs EGP web site and use the ‘A-Z Finished Genes Directory’ link to list the “B” genes. Go to BRCA1. 6. Click on the non-synonymous cSNP analysis link in the Predictive Analysis section and examine the PolyPhen and Sift predictions. 7. Which substitutions are predicted by PolyPhen to be damaging? Which substitutions are predicted by Sift to be intolerant? Polydoms http://polydoms.cchmc.org/polydoms/ Polydoms is an integrated database of human cSNPs and their annotations. 1. Select ‘Gene/Protein Symbol’ from the Drop down box in the ‘Basic Search’ section of the page. Enter BRCA1 in the search field. Leave the Filter options selection at the default setting for ‘Non-synonymous SNPs.’ Select ‘Search.’ 2. In the Search Results section, “11 proteins (1 unique)” should be listed. Click on the ‘NonSynonymous’ link of the first Gene Description (protein NP_009225). 3. Find the GLN 356 ARG substitution in the gene diagram and click on the gray box corresponding to the 356 allele. In the table, find the entries for the GLN 356 ARG allele and list the non-EGP submitters who have also identified this variant. 4. Use the color key under the gene schematic to find the “mutation.” What substitutions are coded as mutations by OMIM? Which are mutations and predicted to impair function by PolyPhen or SIFT? Genome Variation Server http://gvs.gs.washington.edu/GVS/ The Genome Variation Server (GVS) is a simple tool for providing rapid access to the genotypes of 4.5 million human variations found in dbSNP. It also provides tools for analysis of this genotype data. This site can also be accessed from the NIEHS SNPs home page. 1. Click the Gene Name search button 2. Enter ‘NOS2A’ 3. A table of ‘Populations’ and ‘Submitters’ will appear, along with the number of genotyped SNPs for each entry in the “Select Data Set” panel. 4. Select the check-box for ‘HapMap-CEU- Panel.’ De-select EGP-Yoruban Panel. 5. Select the ‘Display Genotypes’ button at the bottom of the page. A new page, ‘Select Display Type,’ should appear. Select the ‘open graphical display of genotypes.’ For the visual genotype, the numbers at the top of the image represent the SNPs (numbered along a reference sequence used in re-sequencing the gene). The numbers on the left side of the image represent the sample ID. Each square represents an individual sample’s genotype: homozygous for the common allele (blue), 3 heterozygous (red), and homozygous for the rare allele (yellow). The SNPs are color coded to describe their location within the gene. Look for other information on this page. How many SNPs and samples are being displayed? (Hint look below the Gene Name). 6. Notice the SNP sites across the top of the image are listed by RS_ID. What is the RS_ID for the first nonsynonymous SNP from the left? 7. Close the VG2 image and Select Display Type windows and return to the Genome Variation Server Page. Now, change the allele frequency cutoff to 5. How many SNPs have a minor allele frequency of 5% or greater? 8. Now deselect HapMap-CEU-Panel and select EGP-CEPH-Panel, enter 0 in the allele frequency cut off, and display genotypes and select the ‘open graphical display of genotypes.’ How many SNPs and samples were examined in the EGP? Return to the Genome Variation Server Page and change the allele frequency cutoff from 0 to 5. Display genotypes and select the ‘open graphical display of genotypes.’ How many SNPs occur at 5% or greater? 9. Close the VG2 image and Select Display Type windows and return to the Genome Variation Server Page. Now let’s explore the merge samples and variation options. In addition to selecting EGP-CEPH-Panel, also select the HapMap-CEU. Return allele frequency cut-off to 0. The default option is A- common samples with combined variations. Select display genotypes and ‘open graphical display of genotypes’. How many samples overlap and how many SNPs? Close the VG2 image and Select Display Type windows and return to the Genome Variation Server Page and now select B-combined samples with common variation? Select display genotypes and ‘open graphical display of genotypes’. How many samples and SNPs are displayed? Close the VG2 image and Select Display Type windows and return to the Genome Variation Server Page and now select C-combined samples with combined variation? Select display genotypes and ‘open graphical display of genotypes’. How many samples and SNPs are being displayed? Notice the grey missing information for many of the HapMap Samples. Remember that more common variation exists in the human genome and that the HapMap is only a sample of the common variation. Sequencing identifies the majority of common variation. 10. Close the VG2 image and Select Display Type windows and return to the Genome Variation Server Page. To see the SNP summary information, choose the rightmost green button ‘Display SNP Summary.’ Explore the type of information that is provided. What is the RS_ID of the first synonymous SNP listed in the SNP Summary? What is the conservation score? Go to the bottom of the page and click on ‘Description of the Columns.’ 11. In the ‘Add/Remove columns’ table (upper right part of the page), select ‘Chip availability’ and ‘Reset.’ Which SNP (by rs#) is the first one in this gene found on the HumanHap300, 550, 650 and the 1M BeadChip? What is the function reported for this SNP? 4 For a cross-database search: Entrez Gene, dbSNP and Entrez SNP If your candidate gene is not listed on the NIEHS SNPs site, you can perform a cross database search for SNPs and other information available from dbSNP, Entrez Gene and Entrez SNP. Navigate to the National Center of Biotechnology web site at http://www.ncbi.nlm.nih.gov/ 1. Enter the gene symbol (NOS2A) into the empty box next to the ‘Search All Databases.’ Type NOS2A into the empty box and click on the GO button, or simply hit the return key on your keyboard. 2. Which NCBI database gives the highest number of results? 3. What information does this database provide? (Hint: click on the ‘?’ next to the database icon for a popup description.) 4. On the left column, note the results returned for the ‘SNP’ and ‘Gene’ databases. How many results were returned for the ‘SNP’ and ‘Gene’ databases? 5. Why did the ‘Gene’ database return more than one result? Entrez Gene 1. 2. 3. 4. 5. From the cross database search, click on the ‘Gene’ database icon. Click on the result that corresponds to the ‘Homo sapiens’ NOS2A gene. NOS2A maps to which chromosome? What are the loci 5’ and 3’ of NOS2A? (Hint: look at the ‘Genomic context’ section.) On the right margin of the page next to the NOS2A gene name and description, note the word ‘Links.’ Scroll down this list and select “SNP: Geneview.” dbSNP 1. The initial dbSNP Geneview only shows SNPs that are located in the coding region of the gene (cSNPs). How many cSNPs are found in dbSNP for NOS2A? (Hint: Find the ‘SNP count’ column of the Gene Model table, or count the number of red and green markers in the gene diagram, or count the number of rs-numbers, the “reference SNP numbers” in the table.) How many are validated? 2. Under the ‘Gene Model’ heading, use the button selectors to view all the SNPs in the “gene region” (select that button), and then select the “refresh” button. After selecting this, the page will update and show all SNPs in this gene. How many SNPs are found in dbSNP for NOS2A? Note: this number is updated in the Gene Model table. (You don’t need to count all the rs-numbers!) 3. Find the “rs# cluster id” link for the intronic SNP rs7208775 (use your browser's page search to look for “rs7208775”). What methods were used to validate this SNP? 4. Click on the link. How many submitters have recorded a discovery of this SNP? 5. Click on the SNP submitter number, 'ss38342908', next to the ‘EGP_SNPS|NOS2A0044457’ SNP submission. 6. On this page, scroll down and find the frequency data for this SNP in each of the four populations studied by this submitter (EGP_AD, EGP_ASIAN, EGP_YORUB, EGP_CEPH). What are the allele frequency of the C and G alleles in each of these populations? 5 7. Using the ‘BACK’ button in your browser, return to the Entrez Gene page for NOS2A. Entrez SNP 1. Starting from the Entrez Gene page again, use the ‘Links’ menu on the right side to view the linkout choices and select the ‘SNP’ option. This will automatically query the Entrez SNP database for all SNPs in dbSNP for the NOS2A gene. 2. How many SNPs are returned? How many human SNPs? 3. Below the search box and tabbed menu choices (i.e. ‘Limits’, ‘Preview/Index’, etc), click the ‘Display’ drop down menu and select ‘FASTA’. The page should automatically update after you make the selection. 4. In the ‘Send To’ drop down menu, select the ‘Text’ option. The page should update the results in plain text format. This selection can be directly copied to a file on your computer. Alternatively, this data can be “Sent To’ a ‘File’ directly—that is, saved on your computer. 5. Use the ‘BACK’ button on your browser. 6. Select the ‘Limits’ tab below the main search box. In the main search box, type the gene name ‘NOS2A’. 7. Select the following search limits from the selections on this page: Organism: Homo sapiens Validation: 2hit-2allele 8. After making these selections, use the ‘Go’ button next to the main search box to get the result. 9. How many results are returned for validated 2hit-2allele SNPs in this gene? 10. Experiment with saving these in different formats using both the ‘Send To’ ‘Text’ option and ‘Send To’ ‘File’ option. 11. Finally, to demonstrate the ability of using search term fields directly in the main search box, type the following: NOS2A[gene] AND “EGP_SNPS”[handle] 12. How many of all of the entries in dbSNP for this gene were submitted by the NIEHS EGP project (our handle is EGP_SNPS)? HapMap Browser http://hapmap.org/ The HapMap Browser hosts the genotype data produced as part of the International HapMap Project, the international effort to genotype and understand the common patterns of genetic diversity in the human genome. The HapMap Genome Browser is linked directly from the main page at hapmap.org by selecting ‘HapMap Genome Browser (B35- full data set)’ in the left margin of the page. 1. In the main search box (labeled "Landmark or Region"), enter the abbreviated gene name with a wild card exactly as follows: NOS2A. The browser page with tracks will be presented. 6 2. Zoom in to view 20kb using the “Scroll/Zoom” drop down menu. At this zoom level, the page is reloaded and frequency data in the form of red and blue pie charts are visible for the SNPs. 3. Note the display of frequency data for each population using the pie graphs for each SNP. Click on the HapMap SNP with the number rs4796052. 4. Note the allele frequencies for each population. Click on the pie chart for one of the rs4796052 populations. Click on the “retrieve genotypes” links for this SNP. Genotype data for this SNP and population will be displayed in a parallel list format. 5. Use the ‘BACK’ button to get to the main gene view with 20 KB zoom. 6. Below the gene structure image (exons) in the Tracks section and under the ‘Analysis’ subheading, select ‘tagSNP picker’. Next, select the ‘Update Image’ using the button on the right side of the page. 7. TagSNPs should appear on a track. 8. How many tagSNPs are listed for the CEU population? 9. In the Tracks section under ‘Variation,’ select dbSNP SNPs and update. Were all the SNPs in dbSNP genotyped by the HapMap? 7 Answer Key NIEHS SNPs Variation Data: 2. How many non-synonymous cSNPs were discovered in this gene? Four. What is the position of the first synonymous SNP location in our reference sequence? 3816 3. What is the cDNA position of this cSNP? 357 4. In which population was it discovered? European (ED) 5. Which individual carries this polymorphism? E119 6. In the predictive analyses, how many sites are possibly damaging and predicted to be intolerant? Two Which? 20461(aa221) and 42375 (aa1009) Are any of these polymorphisms common in the population (above minor allele frequency (MAF) of 5%)? Yes Which site(s) and what frequency? 42375 (aa1009) Bonus – In which population(s) does the MAF exceed 5%? African-descent (AfricanAmerican and Yoruban) and Hispanic-descent GeneSNPs 3. 3’ to 5’ or right to left. 5. The first position of codon 163 causing an isoleucine to valine substitution (“163.1”). 6. The Polymorphism Discovery Resource (PDR), G/G Polyphen 6. What impact is predicted for this amino acid change? This change is predicted to be “probably damaging” (and this is a known BRCA1 mutation). 7. PolyPhenQ356R; SIFT Q356R and S1038N – Predictions don’t always agree because there are differences in underlying algorithms – PolyPhen uses information beyond just the protein alignments. PolyDoms 3. HGBASE, SNP500 Cancer, Sequenom, MGC_GENOME_DIFF 4. CYS61GLY, THR826LYS, ARG841TRP, SER1040ASN, MET1628THR, ALA1708GLU; CYS61GLY, THR826LYS, ARG841TRP, MET1628THR, ALA1708GLU Genome Variation Server: 5. 39 SNPs and 60 samples 6. rs2297518 7. 34 SNPs 8. 122 SNPs and 22 Samples at 0 and at 5% 95 SNPs and 22 samples. 9. 22 Samples and 125 SNPs; 60 Samples and 36 SNPs; 60 Samples and 125 SNPs 10. rs1060826, 0.0850 11. rs8068149, intron 8 Cross-database Search via NCBI databases 1. Geo Profiles 2. SNP = 602 and Gene = 17 3. The search hits sequences from other species, replaced gene models, aliases, and descriptions of interacting proteins. Entrez Gene 3. NOS2A is on chromosome 17 4. 5’ LOC201229 and 3’ LOC729562 (Remember is coded on the complementary strand) dbSNP 1. How many cSNPs are found in dbSNP for NOS2A? 23. How many cSNPs are validated? 12. 2. 350 3. Multiple independent submissions, allele frequency data, and the HapMap project. 4. Two, EGP and BCM 6. YORUBAN = G = 0.875, C = 0.125; CEPH = G = 0.955, C = 0.045; AD (African Descent) = G = 0.857, C = 0.143; ASIAN = G = 0.818, C = 0.182 Entrez SNP 3. 602, 350 10. 44 12. How many total entries are in dbSNP for this gene were submitted by the NIEHS EGP project? 262 HapMap Browser 1. Seven tagSNPs. 2. No, less than a third. 9