Exercise databases Bioinformatics (updated 2013 September) Exercises Biological databases Discovering genome projects in NCBI 1.1 View the genome sequence initiatives (Go to Bioproject) How many prokaryotic genomes have been sequenced? October 2006: 381 How many are in progress (October 2006: 267)? Note also the nice taxonomic overview of all prokaryotic species that have been sequenced. How many plant species have been fully sequenced What is the 1000 genomes project What is HMP Can you find the mammoth sequencing project Search for the genomic map of the Chimp link (Pan troglodytes) Kathleen Marchal 1 Exercise databases Bioinformatics (updated 2013 September) Microbial genomes Kathleen Marchal 2 Exercise databases Bioinformatics (updated 2013 September) (Browse genomes is at the bottom of the page if you wait long enough) Kathleen Marchal 3 Exercise databases Bioinformatics (updated 2013 September) Plant genomes Find a full genome (mammoth, chimp) Kathleen Marchal 4 Exercise databases 1.2 Bioinformatics (updated 2013 September) Get a look at the large initiatives: What is HMP? Kathleen Marchal 5 Exercise databases Bioinformatics (updated 2013 September) Why is this an example of a metagenomics project? What is the 1000 genomes project Why can it be useful See also http://www.1000genomes.org/about The 1000 Genomes Project (human) “The purpose of the project is to support the discovery and understanding of genetic variants that influence human disease. Specifically defined goals are (a) the discovery of single nucleotide variants at frequencies of 1% or higher in diverse populations, (b) even more comprehensive discovery (variants down to frequencies of 0.1 - 0.5%) in functional gene regions, and (c) discovery of structural variants, such as copy number variants, other insertions and deletions, and inversions, Kathleen Marchal 6 Exercise databases Bioinformatics (updated 2013 September) including sequence-level understanding of breakpoints. The volume of data generated by 1000genomes project is unprecedented. The data is accessible from two mirrored ftp sites at EBI and NCBI.” 2 Using the Entrez search engine to discover distinct databases at NCBI 2.1 Pubmed database Search for articles on pax6. 2.2 Gene Check out the Gene database. This is the major curation project at Ncbi. They try to convert the redundant sequence databases into 1 non redundant, comprehensive sequence database in which each locus in the genome is completely described by a representative mRNA sequence(s). Entrez Gene is the American counterpart of ENSEMBL. http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene Search in the gene database for pax6 human (note the difference in result when searching with AND or using the preview/index). Find the accession numbers of the pax6 transcripts, proteins, genome REFSEQ sequence Find the Pax6 gene ID Compare the results with what you found for Pax6 at Ensembl (see later). The Gene database contains for each locus in the genome all associated features (indicated by the corresponding Gene ids). A transcript is indicated by NM, a protein by NP, a genomic contig by Kathleen Marchal 7 Exercise databases Bioinformatics (updated 2013 September) NT. All features (mRNA, genomic DNA, EST) associated with the same locus obtain the same Gene ID. The output is less graphical than Ensembl (see below). In Gene non redundant sequence features are also grouped to generate a comprehensive view of the gene. How many transcripts are known?, corresponding to how many different isoforms? Pax6: Note you find 2 splice variants (now there are more…sequence databases get continuously updated) Find the sequence entries from which Gene was derived. What is the meaning of 2 alternative assemblies? Select the genome view (Can you see the two representative isoforms?, what is the meaning of the purple squares?) Kathleen Marchal 8 Exercise databases Bioinformatics (updated 2013 September) Find the GO categories of Pax6 (note there are three ontology classification systems (function, process, component) Kathleen Marchal 9 Exercise databases Bioinformatics (updated 2013 September) Find the diseases in which Pax6 is involved How is this gene found to be related to these diseases? (via GWAS study, how many variants have been detected in this gene? GWAS was performed against which trait?) Kathleen Marchal 10 Exercise databases 2.3 Bioinformatics (updated 2013 September) Redundant sequence database: Nucleotide, Protein, genome, EST… This database contains all the redundant information that is used by ENSEMBL and GENE. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotide Search for sequence entries that contain pax6 and human using a complex query using the ‘limits’ and ‘advanced’ buttons. 1. First just search for Pax 6, what do you retrieve? 2. Search Pax6 limit genome. Use an advanced search to find pax6 (gene name) in human (organism) and limit to genomic sequences only. There are many entries, most of which only contain part of the sequence (incomplete e.g. only a certain exon and many sequences that come from genomic surveys). Kathleen Marchal 11 Exercise databases Bioinformatics (updated 2013 September) 3. Indicate that you only want the ref seq sequences. 5 entries contains the complete genomic sequence, derived from alternative assemblies. Use the accession number NT_009237. Make a visual representation of this genomic sequence (contig). (click on graphics) 4. Open the Genbank file of a gene entry and interprete the output (what is the difference between and exon and an mRNA, do you find the reference ID?) Kathleen Marchal 12 Exercise databases Kathleen Marchal Bioinformatics (updated 2013 September) 13 Exercise databases Bioinformatics (updated 2013 September) View the pax 6 graphically Kathleen Marchal 14 Exercise databases Bioinformatics (updated 2013 September) Through the REFSEQ gene number you can also go to the gene entry. Location : 11p13 Sequence : Chromosome: 11; NC_000011.9 (31806340..31839509, complement) See PAX6 in Epigenomics, MapViewer - Repeat the exercise but do now restrict the search to mRNA only. In this case you will find accession numbers that start with NM_: these are REFSEQ sequences representing the transcript. Besides these you will also find some cDNA sequences that are derived from the IMAGE clone library (publicly available libraries that contain all clones covering the human cDNAs, these are used for microarray construction. 2.4 EST database and Unigene Go to Unigene. Go the overview page for human (Unigene statistics). Kathleen Marchal 15 Exercise databases Bioinformatics (updated 2013 September) How many unigene clusters contain only 1 sequence (i.e. unclustered sequences). What will happen if more EST sequences become available. How many clusters contain both an mRNA sequence and an EST. How many only an EST. What will be the most reliable clusters? (HTC = a high throughput cDNA; Sequences in this division may still have 5' and 3' UTRs at their ends, partial coding regions, and introns.) Search for the homo sapiens pax6 unigene cluster. Interpret the output (based on which sequences the cluster was built?, Which other organisms contain a protein similar to the human pax6. Kathleen Marchal 16 Exercise databases Bioinformatics (updated 2013 September) View the expression of the Pax6 gene based on the analysis of EST counts (expression, EST counts). In which tissues do you expect the gene to be expressed? Is this the case? From the Unigene page Go to the DDD (digital differential display) Compare the difference in expression between two human tissues Kathleen Marchal 17 Exercise databases 2.5 Bioinformatics (updated 2013 September) Performing more advances searches using Entrez 1. This problem practices using the Entrez search program at the national Center for Biotechnology Information (NCBI) to perform a search for the amino acid sequence of the human heat shock factor HSF1. Normally a large number of matches are found in such searches. We will use the Entrez Boolean search features, which restrict the reported matches to a series of required conditions. This feature allows us to narrow the search to the sequence we want. a) Go to the Entrez Web Site and choose Protein b) Enter the terms heat shock factor in the search window and click GO [heat shock protein AND human]. This search is to find any sequence entry in the protein sequence database that include this phrase. c) Now limit the search by clicking the mouse on advanced search, go to add terms, choose organism in the first box, type human in the second, then click AND to limit the search to just human proteins and then click preview. The history will show the results of a search for database entrees with the term heat shock protein AND originating from humans as the organism. How many hits are there now? d) We can limit the hits to matches to RefSeq, which is Genbank’s annotated sequence database, to give a best representative sequence entry for each protein. Click the mouse on Limits, and in the Limited to section of the page, ignore the boxes on the left and choose RefSeq in the right box. Then click GO and history. Now we have all human heat shock factors in RefSeq. The gene of interest is HSF 1. Add this term to the query using. How many hits did you receive? [no limit on gene name] e) The gene of interest is HSF. Click clear in the text entry box at the top of the page, type HSF 1 and click preview. You obtain more hits because you performed a keyword search. It is better to search via the limits option. f) There are other ways of arriving at this final sequence. As another example, pull out all human protein sequences in RefSeq and all HSF 1 sequences in all organisms and then select the human one using another Boolean search feature of Entrez. First clear history, clear the upper text box, and reselect advanced search. Enter human and organism in the text box, click Limits, and limit to RefSEq. Click GO and then History. Now we have a complete list of human sequences in RefSeq. Kathleen Marchal 18 Exercise databases Bioinformatics (updated 2013 September) g) Now click Advanced search choose gene name in the left box and HSF1. Combine this search with the previous one using Booleans in the history. The result should be a small number of HSF 1 proteins. h) Finally note the RefSeq accession number starting with NP and use the mouse click to display the FASTA format. NP identifies the protein as curated protein sequence. The sequence may be copied and pasted into the page of a simple text editor and save as a local computer file. i) While on the page with the target sequence click on LINKs and choose Gene option. Now the gene entry becomes visible. Note that the RefSEq numbers in the GENE database start with NM for annotated mRNA and NT for annotated genome/ chromosome. Kathleen Marchal 19