Exercises Biological databases

advertisement
Exercise databases
Bioinformatics (updated 2013 September)
Exercises Biological databases
Discovering genome projects in NCBI
1.1






View the genome sequence initiatives (Go to Bioproject)
How many prokaryotic genomes have been sequenced? October 2006: 381 How many are
in progress (October 2006: 267)? Note also the nice taxonomic overview of all prokaryotic
species that have been sequenced.
How many plant species have been fully sequenced
What is the 1000 genomes project
What is HMP
Can you find the mammoth sequencing project
Search for the genomic map of the Chimp link (Pan troglodytes)
Kathleen Marchal
1
Exercise databases
Bioinformatics (updated 2013 September)
Microbial genomes
Kathleen Marchal
2
Exercise databases
Bioinformatics (updated 2013 September)
(Browse genomes is at the bottom of the page if you wait long enough)
Kathleen Marchal
3
Exercise databases
Bioinformatics (updated 2013 September)
Plant genomes
Find a full genome (mammoth, chimp)
Kathleen Marchal
4
Exercise databases
1.2
Bioinformatics (updated 2013 September)
Get a look at the large initiatives:
What is HMP?
Kathleen Marchal
5
Exercise databases
Bioinformatics (updated 2013 September)
Why is this an example of a metagenomics project?
What is the 1000 genomes project
Why can it be useful
See also http://www.1000genomes.org/about
The 1000 Genomes Project (human)
“The purpose of the project is to support the discovery and understanding of genetic variants that
influence human disease. Specifically defined goals are (a) the discovery of single nucleotide
variants at frequencies of 1% or higher in diverse populations, (b) even more comprehensive
discovery (variants down to frequencies of 0.1 - 0.5%) in functional gene regions, and (c) discovery
of structural variants, such as copy number variants, other insertions and deletions, and inversions,
Kathleen Marchal
6
Exercise databases
Bioinformatics (updated 2013 September)
including sequence-level understanding of breakpoints. The volume of data generated by
1000genomes project is unprecedented. The data is accessible from two mirrored ftp sites at EBI
and NCBI.”
2
Using the Entrez search engine to discover distinct databases at NCBI
2.1
Pubmed database
Search for articles on pax6.
2.2
Gene
Check out the Gene database. This is the major curation project at Ncbi. They try to convert the
redundant sequence databases into 1 non redundant, comprehensive sequence database in which
each locus in the genome is completely described by a representative mRNA sequence(s).
Entrez
Gene
is
the
American
counterpart
of
ENSEMBL.
http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene
Search in the gene database for pax6 human (note the difference in result when searching with AND
or using the preview/index).


Find the accession numbers of the pax6 transcripts, proteins, genome REFSEQ sequence
Find the Pax6 gene ID
Compare the results with what you found for Pax6 at Ensembl (see later).
The Gene database contains for each locus in the genome all associated features (indicated by the
corresponding Gene ids). A transcript is indicated by NM, a protein by NP, a genomic contig by
Kathleen Marchal
7
Exercise databases
Bioinformatics (updated 2013 September)
NT. All features (mRNA, genomic DNA, EST) associated with the same locus obtain the same
Gene ID. The output is less graphical than Ensembl (see below). In Gene non redundant sequence
features are also grouped to generate a comprehensive view of the gene.
How many transcripts are known?, corresponding to how many different isoforms?
Pax6:
Note you find 2 splice variants (now there are more…sequence databases get continuously updated)
Find the sequence entries from which Gene was derived.
What is the meaning of 2 alternative assemblies?
Select the genome view
(Can you see the two representative isoforms?, what is the meaning of the purple squares?)
Kathleen Marchal
8
Exercise databases

Bioinformatics (updated 2013 September)
Find the GO categories of Pax6 (note there are three ontology classification systems
(function, process, component)
Kathleen Marchal
9
Exercise databases

Bioinformatics (updated 2013 September)
Find the diseases in which Pax6 is involved
How is this gene found to be related to these diseases?
(via GWAS study, how many variants have been detected in this gene?
GWAS was performed against which trait?)
Kathleen Marchal
10
Exercise databases
2.3
Bioinformatics (updated 2013 September)
Redundant sequence database: Nucleotide, Protein, genome, EST…
This database contains all the redundant information that is used by ENSEMBL and GENE.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotide
Search for sequence entries that contain pax6 and human using a complex query using the ‘limits’
and ‘advanced’ buttons.
1. First just search for Pax 6, what do you retrieve?
2. Search Pax6 limit genome.
Use an advanced search to find pax6 (gene name) in human (organism) and limit to genomic
sequences only. There are many entries, most of which only contain part of the sequence
(incomplete e.g. only a certain exon and many sequences that come from genomic surveys).
Kathleen Marchal
11
Exercise databases
Bioinformatics (updated 2013 September)
3. Indicate that you only want the ref seq sequences. 5 entries contains the complete genomic
sequence, derived from alternative assemblies. Use the accession number NT_009237.
Make a visual representation of this genomic sequence (contig).
(click on graphics)
4. Open the Genbank file of a gene entry and interprete the output (what is the difference
between and exon and an mRNA, do you find the reference ID?)
Kathleen Marchal
12
Exercise databases
Kathleen Marchal
Bioinformatics (updated 2013 September)
13
Exercise databases
Bioinformatics (updated 2013 September)
View the pax 6 graphically
Kathleen Marchal
14
Exercise databases
Bioinformatics (updated 2013 September)
Through the REFSEQ gene number you can also go to the gene entry.
Location : 11p13
Sequence : Chromosome: 11; NC_000011.9 (31806340..31839509, complement)
See PAX6 in Epigenomics, MapViewer
-
Repeat the exercise but do now restrict the search to mRNA only. In this case you will find
accession numbers that start with NM_: these are REFSEQ sequences representing the transcript.
Besides these you will also find some cDNA sequences that are derived from the IMAGE clone
library (publicly available libraries that contain all clones covering the human cDNAs, these are
used for microarray construction.
2.4
EST database and Unigene
Go to Unigene.
Go the overview page for human (Unigene statistics).
Kathleen Marchal
15
Exercise databases
Bioinformatics (updated 2013 September)
How many unigene clusters contain only 1 sequence (i.e. unclustered sequences). What will
happen if more EST sequences become available. How many clusters contain both an mRNA
sequence and an EST. How many only an EST. What will be the most reliable clusters?
(HTC = a high throughput cDNA; Sequences in this division may still have 5' and 3' UTRs at their
ends, partial coding regions, and introns.)
Search for the homo sapiens pax6 unigene cluster. Interpret the output (based on which sequences
the cluster was built?, Which other organisms contain a protein similar to the human pax6.
Kathleen Marchal
16
Exercise databases
Bioinformatics (updated 2013 September)
View the expression of the Pax6 gene based on the analysis of EST counts (expression, EST
counts). In which tissues do you expect the gene to be expressed? Is this the case?
From the Unigene page
 Go to the DDD (digital differential display)
Compare the difference in expression between two human tissues
Kathleen Marchal
17
Exercise databases
2.5
Bioinformatics (updated 2013 September)
Performing more advances searches using Entrez
1. This problem practices using the Entrez search program at the national Center for Biotechnology
Information (NCBI) to perform a search for the amino acid sequence of the human heat shock factor
HSF1. Normally a large number of matches are found in such searches. We will use the Entrez
Boolean search features, which restrict the reported matches to a series of required conditions. This
feature allows us to narrow the search to the sequence we want.
a) Go to the Entrez Web Site and choose Protein
b) Enter the terms heat shock factor in the search window and click GO [heat shock protein
AND human]. This search is to find any sequence entry in the protein sequence database
that include this phrase.
c) Now limit the search by clicking the mouse on advanced search, go to add terms, choose
organism in the first box, type human in the second, then click AND to limit the search to
just human proteins and then click preview. The history will show the results of a search for
database entrees with the term heat shock protein AND originating from humans as the
organism. How many hits are there now?
d) We can limit the hits to matches to RefSeq, which is Genbank’s annotated sequence
database, to give a best representative sequence entry for each protein. Click the mouse on
Limits, and in the Limited to section of the page, ignore the boxes on the left and choose
RefSeq in the right box. Then click GO and history. Now we have all human heat shock
factors in RefSeq. The gene of interest is HSF 1. Add this term to the query using. How
many hits did you receive? [no limit on gene name]
e) The gene of interest is HSF. Click clear in the text entry box at the top of the page, type HSF
1 and click preview. You obtain more hits because you performed a keyword search. It is
better to search via the limits option.
f) There are other ways of arriving at this final sequence. As another example, pull out all
human protein sequences in RefSeq and all HSF 1 sequences in all organisms and then
select the human one using another Boolean search feature of Entrez. First clear history,
clear the upper text box, and reselect advanced search. Enter human and organism in the text
box, click Limits, and limit to RefSEq. Click GO and then History. Now we have a
complete list of human sequences in RefSeq.
Kathleen Marchal
18
Exercise databases
Bioinformatics (updated 2013 September)
g) Now click Advanced search choose gene name in the left box and HSF1. Combine this
search with the previous one using Booleans in the history. The result should be a small
number of HSF 1 proteins.
h) Finally note the RefSeq accession number starting with NP and use the mouse click to
display the FASTA format. NP identifies the protein as curated protein sequence. The
sequence may be copied and pasted into the page of a simple text editor and save as a local
computer file.
i) While on the page with the target sequence click on LINKs and choose Gene option. Now
the gene entry becomes visible. Note that the RefSEq numbers in the GENE database start
with NM for annotated mRNA and NT for annotated genome/ chromosome.
Kathleen Marchal
19
Download