BSc Bioinformatics

advertisement
BSc Bioinformatics
Spring Term 2004
Exercise: Searching Bioinformatics Databases
These exercises are expected to take you not much more than about six hours – that is,
they should fill the time allocated to the two sessions on 21 and 28 February.
The work is fairly close to a “real life” situation at the beginning of a bioinformatics
research project. We are expecting you to find out as much as you can about a genetic
disease, using bioinformatics databases and the scientific literature.
We are concerned with the eye disease aniridia1, and the genetic defects that cause it.
The exercise is in two parts. In the first part, you will explore some bioinformatics
databases and answer some questions about aniridia and the gene that is mutated in
patients with this condition. In the second, you will write an essay of no more than 800
words about this gene and the protein that it encodes. You should refer the literature as
well as to online bioinformatics resources when preparing this essay: you may, of course,
also use textbooks. All sources must be referenced. This is an assessed essay: please
hand it in to me, with your answers to the questions on this sheet, at the lecture on
Wednesday 11 February.
The URLs of the databases referred to in the exercises are listed on the last page of this
handout. (You are, of course, not limited to these particular resources!)
You are not expected to do all this work in the set Wednesday evening sessions.
However, the computers in the Crystallography basement room have been reserved for
your use during those sessions, and Mark Halling-Brown will be there then to answer
your questions about the databases and their use. If you have questions about the
exercises, or if you have serious problems identifying the gene, email me –
c.sansom@mail.cryst.bbk.ac.uk
This practical exercise is loosely based on some of the material in the “Bioinformatics Essentials” course
run by the Rosalind Franklin Centre for Genomic Research, http://www.rfcgr.mrc.ac.uk.
1
Database Exercises
Note (1): all the URLs that you need for this exercise are printed on the back page of this
handout.
Note (2): it will be perfectly possible for you to get through this exercise in much less
than three hours, if you don’t look farther than the answers to the set questions. You
should take your time to explore the workings of the databases and to read the
appropriate database entries thoroughly: this will help you a lot when you come to write
your essay.
If you are interested in a genetic disease, one good place to start is the OMIM database –
Online Mendelian Information in Man [1]. This is a database of genetic defects and
human disease phenotypes.
Go to the OMIM home page [1], and search for aniridia.
What is the name of the gene that is mutated in patients with type II aniridia? _________
Which of the human chromosomes is it located on? __ On which cytogenetic band? ____
What are some of the symptoms of this disease?
________________________________________________________________________
Now follow some of the links to papers in the PubMed database [2]. PubMed is one of
the freely accessible online versions of the Medline database of the medical and related
literature. (Authors’ names link to the reference list in the OMIM entry; you will need to
find and click on a Medline numerical ID to access the full reference.)
Give the reference of one paper that describes a mouse gene that is homologous to our
gene.
________________________________________________________________________
You may find some of the papers linked from this entry useful for your essay: but, for
now, leave PubMed and explore some more clinical and mutation databases.
The Human Gene Mutation Database (HGMD) is a good place to start to look at the
individual mutations that cause a particular disease. Go to the HGMD home page [3],
select “HGMD Search” and type aniridia into the keyword field. If you click on the single
entry retrieved you will see a list of mutations that cause this disease, sorted by mutation
type and then by phenotype. Look at this second list.
How many nucleotide substitution type mutations have been found in this gene in total?
____
How many of these cause aniridia (rather than a different eye disease)? ____
If you look hard enough you will find a link to a database that only contains information
about this gene and its mutations. What is its URL?
________________________________________________________________________
Compare the information available in HGBASE with that in another resource concerned
with mutations: the SNP database, dbSNP. SNPs are Single Nucleotide Polymorphisms –
single base changes between individual organisms. We know of over 3 million SNP
positions in the human genome: that’s one in about 1,000 bases. Many of these occur in
non-coding DNA: others still are “silent” mutations where the base change does not
affect the amino acid coded at that position.
Go to the dbSNP home page [4]; enter the name of your gene into the search box, and
click “Go”. Select the top entry retrieved.
What is the base change recorded in this SNP? ______
Now go back to HGBASE, and follow a link towards the bottom of the page to the entry
for this gene in GeneCards [5]. GeneCards is an integrated database or “encyclopaedia”
of different types of information about disease causing genes in the human genome.
What are some of the other types of information about this gene available from its
GeneCards entry?
________________________________________________________________________
You may want to bookmark this page, as its links will be helpful when you come to write
your essay.
The Ensembl genome browser [6] has been set up to provide access to automatic
annotation of complete and near complete eukaryotic genomes. It is a complex resource;
you should read through the Ensembl tour (linked on the left hand side of the page, under
“Help and documentation”) before attempting to answer even these simple questions
using it.
How many species are represented in the current version of Ensembl? ____
Follow the links to the Human database and then to the chromosome containing our gene.
Find the band where the gene is located (by eye). What can you tell from the picture
about the gene density and the SNP density of this region?
________________________________________________________________________
Now click on “Browse OMIM databases on this chromosome” (towards the bottom of the
page) and scroll through alphabetically until you reach aniridia. (You will not have to go
very far!) Click on the entry.
What is the Ensembl Gene ID of this gene? ______________
Give the Latin and common names of one species that is predicted to contain an
orthologue of this gene, other than the rat and mouse. You may find the Taxonomy
Database [7] useful for this exercise.
Latin name: ______________________________
Common name: ___________________________
Scroll all the way down the page to “Transcript/Translation Summary”. How many exons
does the transcript of this gene contain? _____
Now scroll back up to “Predicted Transcript” and you will see three different predictions.
Note that the thick bars represent predicted exons, and the thin lines introns.
Only one of these has the right number of exons – is it the top, middle or bottom
transcript? __________What is its length in kB? ____________
You should already have spotted references to the SwissProt database on your travels
through the databases. This is a very well annotated database of protein sequences, but it
is not as complete as many databases. If you can find an entry in SwissProt for the protein
that you are interested in, you are in luck: fortunately, this is the case here!
Go to the SwissProt main page [8] and type the name of your gene into the box next to
“Search SwissProt/TrEMBL”. When the results are returned, select the SwissProt entry
corresponding to the human gene (it should be fairly obvious which one it is), and click
on it.
What is the name of the entry in the SwissProt database corresponding to the protein that
is encoded by our gene? ____________
What is its primary accession number? ___________
What, from the SwissProt annotation, is the function of this protein?
________________________________________________________________________
________________________________________________________________________
This protein contains a feature known as a “homeobox”. If you scroll down the SwissProt
entry you will find links to this feature in several databases of protein families, including
InterPro [9]. Find the link to the homeobox feature (or domain) in InterPro and follow it.
Which other macromolecule does the homeobox domain in our protein bind to? _______
What features of protein structure are involved in this interaction? __________________
Finally, go back to the SwissProt entry and look for links to “GO” [20]. This is the Gene
Ontology database – a structured vocabulary of biochemical terminology. It is designed
to standardise terminology and make automatic text searching easier.
Follow the link to GO’s definition of “Vision”, and reproduce it here.
________________________________________________________________________
________________________________________________________________________
Do you find this definition clear? If not, why do you think this is?
________________________________________________________________________
That’s it! Well done for getting this far. Now you may start working on your essay.
Database URLs
[1] OMIM: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
[2] PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
[3] HGMD: http://www.hgmd.org
[4] dbSNP: http://www.ncbi.nlm.nih.gov/SNP/index.html
[5] GeneCards: http://bioinfo.weizmann.ac.il/cards/index.html
[6] Ensembl: http://www.ensembl.org
[7] Taxonomy Database: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
[8] SwissProt: http://ca.expasy.org/sprot/
[9] InterPro: http://www.ebi.ac.uk/interpro/
[10] Gene Ontology: http://www.ebi.ac.uk/GOA/
Download