BSc Bioinformatics Spring Term 2004 Exercise: Searching Bioinformatics Databases These exercises are expected to take you not much more than about six hours – that is, they should fill the time allocated to the two sessions on 21 and 28 February. The work is fairly close to a “real life” situation at the beginning of a bioinformatics research project. We are expecting you to find out as much as you can about a genetic disease, using bioinformatics databases and the scientific literature. We are concerned with the eye disease aniridia1, and the genetic defects that cause it. The exercise is in two parts. In the first part, you will explore some bioinformatics databases and answer some questions about aniridia and the gene that is mutated in patients with this condition. In the second, you will write an essay of no more than 800 words about this gene and the protein that it encodes. You should refer the literature as well as to online bioinformatics resources when preparing this essay: you may, of course, also use textbooks. All sources must be referenced. This is an assessed essay: please hand it in to me, with your answers to the questions on this sheet, at the lecture on Wednesday 11 February. The URLs of the databases referred to in the exercises are listed on the last page of this handout. (You are, of course, not limited to these particular resources!) You are not expected to do all this work in the set Wednesday evening sessions. However, the computers in the Crystallography basement room have been reserved for your use during those sessions, and Mark Halling-Brown will be there then to answer your questions about the databases and their use. If you have questions about the exercises, or if you have serious problems identifying the gene, email me – c.sansom@mail.cryst.bbk.ac.uk This practical exercise is loosely based on some of the material in the “Bioinformatics Essentials” course run by the Rosalind Franklin Centre for Genomic Research, http://www.rfcgr.mrc.ac.uk. 1 Database Exercises Note (1): all the URLs that you need for this exercise are printed on the back page of this handout. Note (2): it will be perfectly possible for you to get through this exercise in much less than three hours, if you don’t look farther than the answers to the set questions. You should take your time to explore the workings of the databases and to read the appropriate database entries thoroughly: this will help you a lot when you come to write your essay. If you are interested in a genetic disease, one good place to start is the OMIM database – Online Mendelian Information in Man [1]. This is a database of genetic defects and human disease phenotypes. Go to the OMIM home page [1], and search for aniridia. What is the name of the gene that is mutated in patients with type II aniridia? _________ Which of the human chromosomes is it located on? __ On which cytogenetic band? ____ What are some of the symptoms of this disease? ________________________________________________________________________ Now follow some of the links to papers in the PubMed database [2]. PubMed is one of the freely accessible online versions of the Medline database of the medical and related literature. (Authors’ names link to the reference list in the OMIM entry; you will need to find and click on a Medline numerical ID to access the full reference.) Give the reference of one paper that describes a mouse gene that is homologous to our gene. ________________________________________________________________________ You may find some of the papers linked from this entry useful for your essay: but, for now, leave PubMed and explore some more clinical and mutation databases. The Human Gene Mutation Database (HGMD) is a good place to start to look at the individual mutations that cause a particular disease. Go to the HGMD home page [3], select “HGMD Search” and type aniridia into the keyword field. If you click on the single entry retrieved you will see a list of mutations that cause this disease, sorted by mutation type and then by phenotype. Look at this second list. How many nucleotide substitution type mutations have been found in this gene in total? ____ How many of these cause aniridia (rather than a different eye disease)? ____ If you look hard enough you will find a link to a database that only contains information about this gene and its mutations. What is its URL? ________________________________________________________________________ Compare the information available in HGBASE with that in another resource concerned with mutations: the SNP database, dbSNP. SNPs are Single Nucleotide Polymorphisms – single base changes between individual organisms. We know of over 3 million SNP positions in the human genome: that’s one in about 1,000 bases. Many of these occur in non-coding DNA: others still are “silent” mutations where the base change does not affect the amino acid coded at that position. Go to the dbSNP home page [4]; enter the name of your gene into the search box, and click “Go”. Select the top entry retrieved. What is the base change recorded in this SNP? ______ Now go back to HGBASE, and follow a link towards the bottom of the page to the entry for this gene in GeneCards [5]. GeneCards is an integrated database or “encyclopaedia” of different types of information about disease causing genes in the human genome. What are some of the other types of information about this gene available from its GeneCards entry? ________________________________________________________________________ You may want to bookmark this page, as its links will be helpful when you come to write your essay. The Ensembl genome browser [6] has been set up to provide access to automatic annotation of complete and near complete eukaryotic genomes. It is a complex resource; you should read through the Ensembl tour (linked on the left hand side of the page, under “Help and documentation”) before attempting to answer even these simple questions using it. How many species are represented in the current version of Ensembl? ____ Follow the links to the Human database and then to the chromosome containing our gene. Find the band where the gene is located (by eye). What can you tell from the picture about the gene density and the SNP density of this region? ________________________________________________________________________ Now click on “Browse OMIM databases on this chromosome” (towards the bottom of the page) and scroll through alphabetically until you reach aniridia. (You will not have to go very far!) Click on the entry. What is the Ensembl Gene ID of this gene? ______________ Give the Latin and common names of one species that is predicted to contain an orthologue of this gene, other than the rat and mouse. You may find the Taxonomy Database [7] useful for this exercise. Latin name: ______________________________ Common name: ___________________________ Scroll all the way down the page to “Transcript/Translation Summary”. How many exons does the transcript of this gene contain? _____ Now scroll back up to “Predicted Transcript” and you will see three different predictions. Note that the thick bars represent predicted exons, and the thin lines introns. Only one of these has the right number of exons – is it the top, middle or bottom transcript? __________What is its length in kB? ____________ You should already have spotted references to the SwissProt database on your travels through the databases. This is a very well annotated database of protein sequences, but it is not as complete as many databases. If you can find an entry in SwissProt for the protein that you are interested in, you are in luck: fortunately, this is the case here! Go to the SwissProt main page [8] and type the name of your gene into the box next to “Search SwissProt/TrEMBL”. When the results are returned, select the SwissProt entry corresponding to the human gene (it should be fairly obvious which one it is), and click on it. What is the name of the entry in the SwissProt database corresponding to the protein that is encoded by our gene? ____________ What is its primary accession number? ___________ What, from the SwissProt annotation, is the function of this protein? ________________________________________________________________________ ________________________________________________________________________ This protein contains a feature known as a “homeobox”. If you scroll down the SwissProt entry you will find links to this feature in several databases of protein families, including InterPro [9]. Find the link to the homeobox feature (or domain) in InterPro and follow it. Which other macromolecule does the homeobox domain in our protein bind to? _______ What features of protein structure are involved in this interaction? __________________ Finally, go back to the SwissProt entry and look for links to “GO” [20]. This is the Gene Ontology database – a structured vocabulary of biochemical terminology. It is designed to standardise terminology and make automatic text searching easier. Follow the link to GO’s definition of “Vision”, and reproduce it here. ________________________________________________________________________ ________________________________________________________________________ Do you find this definition clear? If not, why do you think this is? ________________________________________________________________________ That’s it! Well done for getting this far. Now you may start working on your essay. Database URLs [1] OMIM: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM [2] PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed [3] HGMD: http://www.hgmd.org [4] dbSNP: http://www.ncbi.nlm.nih.gov/SNP/index.html [5] GeneCards: http://bioinfo.weizmann.ac.il/cards/index.html [6] Ensembl: http://www.ensembl.org [7] Taxonomy Database: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ [8] SwissProt: http://ca.expasy.org/sprot/ [9] InterPro: http://www.ebi.ac.uk/interpro/ [10] Gene Ontology: http://www.ebi.ac.uk/GOA/