SeattleSNPs Variation Discovery Workshop

advertisement
NIEHS SNPs Workshop
January 10, 2008
All of the workshop presentations, web links and files for exercises are
available at: http://egp.gs.washington.edu/workshop/download/
Interactive Tutorial 1: SNP Database Resources
The following tutorial is designed to help you explore the SNP data available for
candidate genes and illustrate the various NIEHS datasets and tools available from the
Environmental Genome Project (EGP) and the other resources described in this
workshop.
As a launching point, we will begin our search at the NIEHS SNPs resource. This can be
accessed at http://egp.gs.washington.edu/
The NIEHS SNPs Program at the University of Washington is part of the EGP. The EGP
is a multi-disciplinary effort focused on exploring the relationships between
environmental exposures, inter-individual sequence variation and disease risk. The
NIEHS SNPs Program is systematically identifying and genotyping single nucleotide
polymorphisms (SNPs) in environmental response genes. This resource provides dense
genetic maps of more than 600 key genes that can be applied in evaluating human disease
risk with environmental exposures.
For these exercises, we will be accessing data for the gene nitric oxide synthase 2A
(NOS2A). Note: Answers to questions from this tutorial are included at the end of this
document.
NIEHS SNPs
http://egp.gs.washington.edu
The variation data for the NOS2A gene can be accessed through the search box on the top
right part of the NIEHS SNPs home page, or via the ‘A-Z Finished Genes Directory’ link
in the left hand navigation bar.
Using this last link, find NOS2A in the alphabetical listing of genes.
1. Under the ‘Mapping Data’ section, click on the ‘cSNPs’ link.
2. How many non-synonymous cSNPs were discovered in this gene? What is the
position in our reference sequence of the first synonymous cSNP?
3. What is the cDNA position of this synonymous cSNP?
4. In which population was it discovered?
5. Go ‘BACK’. In the ‘Genotyping Data’ category, click on the ‘Visual Genotype’ link.
An image of all of the genotyping data for this gene is displayed. Using the SNP
location of the synonymous SNP, determine which individual carries this
polymorphism.
6. Explore other links in the ‘Mapping Data,’ ‘Genotyping Data’ and ‘Predictive
Analyses’ sections. The ‘Linkage Data’ and ‘Haplotype Data’ will be covered in a
subsequent talk and tutorial. In the predictive analyses, are any of the sites possibly
damaging and predicted to be intolerant? Which? Are any of these polymorphisms
common in the population (above a minor allele frequency [MAF] of 5%)? Which
site(s) and of what frequency? Bonus – In which population(s) does the MAF exceed
5%? In the cSNPs table, the first AD-freq is for African Descent (AfricanAmericans); the second AD-freq is for African-Descent (Yorubans of Africa and
HapMap samples); the ED-freq is for European Descent (CEPH and HapMap
Samples); the HD-freq is Hispanic Descent (Mexican-Americans); and the XD-freq is
Asian Descent (split between Chinese and Japanese HapMap samples).
GeneSNPs
http://www.genome.utah.edu/genesnps/
The GeneSNPs resource integrates gene, sequence and polymorphism data from the
NIEHS SNPs project, dbSNP and other resources into individually annotated gene
models.
1.
2.
3.
4.
5.
Select ‘Cell Cycle’ from the ‘Gene List’ drop down menu.
Select the ‘UCSC:hg16:4’ Gene Model link for the gene CCNA2.
What is the orientation of the gene? Hint – Look at arrows across the gene model.
In the ‘All Submitter’ drop down menu, select EGP_SNPS.
Scroll down the list of SNPs to the first non-synonymous cSNP. In what codon and
which position of the codon is this SNP?
6. On what population was this gene re-sequenced? Click on the GS link at the right.
What is the genotype of the sample P012?
PolyPhen
http://genetics.bwh.harvard.edu/pph/
PolyPhen predicts the potential consequence of an amino acid substitution on the
structure and function of a human protein using physical and comparative considerations.
1. Enter the amino acid sequence for BRCA1 from the fasta file available at
http://egp.gs.washington.edu/workshop/BRCA1.protein.fsa. Include the first line
starting with the ‘>’ character.
2. Enter 356 for position, Q (glutamine) and R (arginine) for AA1 and AA2, then
“Process query.”
3. The PolyPhen site returns its prediction, based on alignments of both polypeptide
sequences to sequences in the SwissProt data base, the potential of disrupting known
structural motifs (coils, active sites, disulfide bridges, phosphorylation sites, etc.), and
the steric changes to the three dimensional structure. What impact is predicted for
this amino acid change?
2
4. Explore the definitions of score parameters by clicking on the headers of the output
tables. The headers are linked to documentation explaining the scoring and alignment
algorithms.
5. Return to the NIEHS SNPs EGP web site and use the ‘A-Z Finished Genes Directory’
link to list the “B” genes. Go to BRCA1.
6. Click on the non-synonymous cSNP analysis link in the Predictive Analysis section
and examine the PolyPhen and Sift predictions.
7. Which substitutions are predicted by PolyPhen to be damaging? Which substitutions
are predicted by Sift to be intolerant?
Polydoms
http://polydoms.cchmc.org/polydoms/
Polydoms is an integrated database of human cSNPs and their annotations.
1. Select ‘Gene/Protein Symbol’ from the Drop down box in the ‘Basic Search’ section
of the page. Enter BRCA1 in the search field. Leave the Filter options selection at
the default setting for ‘Non-synonymous SNPs.’ Select ‘Search.’
2. In the Search Results section, “11 proteins (1 unique)” should be listed. Click on the
‘NonSynonymous’ link of the first Gene Description (protein NP_009225).
3. Find the GLN 356 ARG substitution in the gene diagram and click on the gray box
corresponding to the 356 allele. In the table, find the entries for the GLN 356 ARG
allele and list the non-EGP submitters who have also identified this variant.
4. Use the color key under the gene schematic to find the “mutation.” What
substitutions are coded as mutations by OMIM? Which are mutations and predicted
to impair function by PolyPhen or SIFT?
Genome Variation Server
http://gvs.gs.washington.edu/GVS/
The Genome Variation Server (GVS) is a simple tool for providing rapid access to the
genotypes of 4.5 million human variations found in dbSNP. It also provides tools for
analysis of this genotype data. This site can also be accessed from the NIEHS SNPs home
page.
1. Click the Gene Name search button
2. Enter ‘NOS2A’
3. A table of ‘Populations’ and ‘Submitters’ will appear, along with the number of
genotyped SNPs for each entry in the “Select Data Set” panel.
4. Select the check-box for ‘HapMap-CEU- Panel.’ De-select EGP-Yoruban Panel.
5. Select the ‘Display Genotypes’ button at the bottom of the page. A new page, ‘Select
Display Type,’ should appear. Select the ‘open graphical display of genotypes.’ For
the visual genotype, the numbers at the top of the image represent the SNPs
(numbered along a reference sequence used in re-sequencing the gene). The numbers
on the left side of the image represent the sample ID. Each square represents an
individual sample’s genotype: homozygous for the common allele (blue),
3
heterozygous (red), and homozygous for the rare allele (yellow). The SNPs are color
coded to describe their location within the gene. Look for other information on this
page. How many SNPs and samples are being displayed? (Hint look below the Gene
Name).
6. Notice the SNP sites across the top of the image are listed by RS_ID. What is the
RS_ID for the first nonsynonymous SNP from the left?
7. Close the VG2 image and Select Display Type windows and return to the Genome
Variation Server Page. Now, change the allele frequency cutoff to 5. How many
SNPs have a minor allele frequency of 5% or greater?
8. Now deselect HapMap-CEU-Panel and select EGP-CEPH-Panel, enter 0 in the allele
frequency cut off, and display genotypes and select the ‘open graphical display of
genotypes.’ How many SNPs and samples were examined in the EGP? Return to the
Genome Variation Server Page and change the allele frequency cutoff from 0 to 5.
Display genotypes and select the ‘open graphical display of genotypes.’ How many
SNPs occur at 5% or greater?
9. Close the VG2 image and Select Display Type windows and return to the Genome
Variation Server Page. Now let’s explore the merge samples and variation options.
In addition to selecting EGP-CEPH-Panel, also select the HapMap-CEU. Return
allele frequency cut-off to 0. The default option is A- common samples with
combined variations. Select display genotypes and ‘open graphical display of
genotypes’. How many samples overlap and how many SNPs? Close the VG2 image
and Select Display Type windows and return to the Genome Variation Server Page
and now select B-combined samples with common variation? Select display
genotypes and ‘open graphical display of genotypes’. How many samples and SNPs
are displayed? Close the VG2 image and Select Display Type windows and return to
the Genome Variation Server Page and now select C-combined samples with
combined variation? Select display genotypes and ‘open graphical display of
genotypes’. How many samples and SNPs are being displayed? Notice the grey
missing information for many of the HapMap Samples. Remember that more
common variation exists in the human genome and that the HapMap is only a sample
of the common variation. Sequencing identifies the majority of common variation.
10. Close the VG2 image and Select Display Type windows and return to the Genome
Variation Server Page. To see the SNP summary information, choose the rightmost
green button ‘Display SNP Summary.’ Explore the type of information that is
provided. What is the RS_ID of the first synonymous SNP listed in the SNP
Summary? What is the conservation score? Go to the bottom of the page and click on
‘Description of the Columns.’
11. In the ‘Add/Remove columns’ table (upper right part of the page), select ‘Chip
availability’ and ‘Reset.’ Which SNP (by rs#) is the first one in this gene found on the
HumanHap300, 550, 650 and the 1M BeadChip? What is the function reported for
this SNP?
4
For a cross-database search: Entrez Gene, dbSNP and Entrez SNP
If your candidate gene is not listed on the NIEHS SNPs site, you can perform a cross
database search for SNPs and other information available from dbSNP, Entrez Gene and
Entrez SNP. Navigate to the National Center of Biotechnology web site at
http://www.ncbi.nlm.nih.gov/
1. Enter the gene symbol (NOS2A) into the empty box next to the ‘Search All
Databases.’ Type NOS2A into the empty box and click on the GO button, or simply
hit the return key on your keyboard.
2. Which NCBI database gives the highest number of results?
3. What information does this database provide? (Hint: click on the ‘?’ next to the
database icon for a popup description.)
4. On the left column, note the results returned for the ‘SNP’ and ‘Gene’ databases.
How many results were returned for the ‘SNP’ and ‘Gene’ databases?
5. Why did the ‘Gene’ database return more than one result?
Entrez Gene
1.
2.
3.
4.
5.
From the cross database search, click on the ‘Gene’ database icon.
Click on the result that corresponds to the ‘Homo sapiens’ NOS2A gene.
NOS2A maps to which chromosome?
What are the loci 5’ and 3’ of NOS2A? (Hint: look at the ‘Genomic context’ section.)
On the right margin of the page next to the NOS2A gene name and description, note
the word ‘Links.’ Scroll down this list and select “SNP: Geneview.”
dbSNP
1. The initial dbSNP Geneview only shows SNPs that are located in the coding region of
the gene (cSNPs). How many cSNPs are found in dbSNP for NOS2A? (Hint: Find
the ‘SNP count’ column of the Gene Model table, or count the number of red and
green markers in the gene diagram, or count the number of rs-numbers, the “reference
SNP numbers” in the table.) How many are validated?
2. Under the ‘Gene Model’ heading, use the button selectors to view all the SNPs in the
“gene region” (select that button), and then select the “refresh” button. After
selecting this, the page will update and show all SNPs in this gene. How many SNPs
are found in dbSNP for NOS2A? Note: this number is updated in the Gene Model
table. (You don’t need to count all the rs-numbers!)
3. Find the “rs# cluster id” link for the intronic SNP rs7208775 (use your browser's page
search to look for “rs7208775”). What methods were used to validate this SNP?
4. Click on the link. How many submitters have recorded a discovery of this SNP?
5. Click on the SNP submitter number, 'ss38342908', next to the ‘EGP_SNPS|NOS2A0044457’ SNP submission.
6. On this page, scroll down and find the frequency data for this SNP in each of the four
populations studied by this submitter (EGP_AD, EGP_ASIAN, EGP_YORUB,
EGP_CEPH). What are the allele frequency of the C and G alleles in each of these
populations?
5
7. Using the ‘BACK’ button in your browser, return to the Entrez Gene page for
NOS2A.
Entrez SNP
1. Starting from the Entrez Gene page again, use the ‘Links’ menu on the right side to
view the linkout choices and select the ‘SNP’ option. This will automatically query
the Entrez SNP database for all SNPs in dbSNP for the NOS2A gene.
2. How many SNPs are returned? How many human SNPs?
3. Below the search box and tabbed menu choices (i.e. ‘Limits’, ‘Preview/Index’, etc),
click the ‘Display’ drop down menu and select ‘FASTA’. The page should
automatically update after you make the selection.
4. In the ‘Send To’ drop down menu, select the ‘Text’ option. The page should update
the results in plain text format. This selection can be directly copied to a file on your
computer. Alternatively, this data can be “Sent To’ a ‘File’ directly—that is, saved on
your computer.
5. Use the ‘BACK’ button on your browser.
6. Select the ‘Limits’ tab below the main search box. In the main search box, type the
gene name ‘NOS2A’.
7. Select the following search limits from the selections on this page:
Organism: Homo sapiens
Validation: 2hit-2allele
8. After making these selections, use the ‘Go’ button next to the main search box to get
the result.
9. How many results are returned for validated 2hit-2allele SNPs in this gene?
10. Experiment with saving these in different formats using both the ‘Send To’  ‘Text’
option and ‘Send To’  ‘File’ option.
11. Finally, to demonstrate the ability of using search term fields directly in the main
search box, type the following:
NOS2A[gene] AND “EGP_SNPS”[handle]
12. How many of all of the entries in dbSNP for this gene were submitted by the NIEHS
EGP project (our handle is EGP_SNPS)?
HapMap Browser
http://hapmap.org/
The HapMap Browser hosts the genotype data produced as part of the International
HapMap Project, the international effort to genotype and understand the common patterns
of genetic diversity in the human genome.
The HapMap Genome Browser is linked directly from the main page at hapmap.org by
selecting ‘HapMap Genome Browser (B35- full data set)’ in the left margin of the page.
1. In the main search box (labeled "Landmark or Region"), enter the abbreviated gene
name with a wild card exactly as follows: NOS2A. The browser page with tracks will
be presented.
6
2. Zoom in to view 20kb using the “Scroll/Zoom” drop down menu. At this zoom level,
the page is reloaded and frequency data in the form of red and blue pie charts are
visible for the SNPs.
3. Note the display of frequency data for each population using the pie graphs for each
SNP. Click on the HapMap SNP with the number rs4796052.
4. Note the allele frequencies for each population. Click on the pie chart for one of the
rs4796052 populations. Click on the “retrieve genotypes” links for this SNP.
Genotype data for this SNP and population will be displayed in a parallel list format.
5. Use the ‘BACK’ button to get to the main gene view with 20 KB zoom.
6. Below the gene structure image (exons) in the Tracks section and under the
‘Analysis’ subheading, select ‘tagSNP picker’. Next, select the ‘Update Image’ using
the button on the right side of the page.
7. TagSNPs should appear on a track.
8. How many tagSNPs are listed for the CEU population?
9. In the Tracks section under ‘Variation,’ select dbSNP SNPs and update. Were all the
SNPs in dbSNP genotyped by the HapMap?
7
Answer Key
NIEHS SNPs Variation Data:
2. How many non-synonymous cSNPs were discovered in this gene? Four. What is the
position of the first synonymous SNP location in our reference sequence? 3816
3. What is the cDNA position of this cSNP? 357
4. In which population was it discovered? European (ED)
5. Which individual carries this polymorphism? E119
6. In the predictive analyses, how many sites are possibly damaging and predicted to be
intolerant? Two
Which? 20461(aa221) and 42375 (aa1009)
Are any of these polymorphisms common in the population (above minor allele
frequency (MAF) of 5%)? Yes
Which site(s) and what frequency? 42375 (aa1009)
Bonus – In which population(s) does the MAF exceed 5%? African-descent (AfricanAmerican and Yoruban) and Hispanic-descent
GeneSNPs
3. 3’ to 5’ or right to left.
5. The first position of codon 163 causing an isoleucine to valine substitution (“163.1”).
6. The Polymorphism Discovery Resource (PDR), G/G
Polyphen
6. What impact is predicted for this amino acid change? This change is predicted to be
“probably damaging” (and this is a known BRCA1 mutation).
7. PolyPhenQ356R; SIFT Q356R and S1038N – Predictions don’t always agree because
there are differences in underlying algorithms – PolyPhen uses information beyond
just the protein alignments.
PolyDoms
3. HGBASE, SNP500 Cancer, Sequenom, MGC_GENOME_DIFF
4. CYS61GLY, THR826LYS, ARG841TRP, SER1040ASN, MET1628THR,
ALA1708GLU; CYS61GLY, THR826LYS, ARG841TRP, MET1628THR,
ALA1708GLU
Genome Variation Server:
5. 39 SNPs and 60 samples
6. rs2297518
7. 34 SNPs
8. 122 SNPs and 22 Samples at 0 and at 5% 95 SNPs and 22 samples.
9. 22 Samples and 125 SNPs; 60 Samples and 36 SNPs; 60 Samples and 125 SNPs
10. rs1060826, 0.0850
11. rs8068149, intron
8
Cross-database Search via NCBI databases
1. Geo Profiles
2. SNP = 602 and Gene = 17
3. The search hits sequences from other species, replaced gene models, aliases, and
descriptions of interacting proteins.
Entrez Gene
3. NOS2A is on chromosome 17
4. 5’ LOC201229 and 3’ LOC729562 (Remember is coded on the complementary
strand)
dbSNP
1. How many cSNPs are found in dbSNP for NOS2A? 23. How many cSNPs are
validated? 12.
2. 350
3. Multiple independent submissions, allele frequency data, and the HapMap project.
4. Two, EGP and BCM
6. YORUBAN = G = 0.875, C = 0.125; CEPH = G = 0.955, C = 0.045; AD (African
Descent) = G = 0.857, C = 0.143; ASIAN = G = 0.818, C = 0.182
Entrez SNP
3. 602, 350
10. 44
12. How many total entries are in dbSNP for this gene were submitted by the NIEHS
EGP project? 262
HapMap Browser
1. Seven tagSNPs.
2. No, less than a third.
9
Download