Introduction to Bioinformatics Databases The Problem • Current size of Genbank (June 2011): 129,178,292,958 bp (1.3 x 1011, or 129 tera-base pairs) in 140,482,268 entries. This doesn’t include unprocessed genomic sequences, which would double the size. – This would be hard to deal with if it were all written on pieces of paper stored in file cabinets. – In 1982, Genbank contained 690,338 bp in 606 entries, which fit on two 5 ¼ inch floppy disks (360 kB capacity), which Genbank mailed to you. • At its simplest, a database is a way to store information and retrieve it efficiently. – However, nearly all databases add value to the information by processing it in different ways. • We want to introduce some ideas about how databases work, and what kinds of databases are available. Database Background • • The general concepts of storing and retrieving data go back to very beginnings of writing. How to record data in a uniform fashion, and how to file it where you can find it again later. Concepts like: – – – – – forms, alphabetical order, serial numbers, filing cabinets with separate drawers and folders within drawers Use of machines to accurately tabulate information (typewriter, adding machine, etc.) • Computers allowed even larger amounts of information to be stored – Computers are like being blind: they deal with information one small chunk at a time, and they can’t see what’s coming next. Sumerian land purchase records from about 2400 BCE Codd’s Normal Forms • In 1970, IBM researcher E. F. Codd published the seminal paper “A relational model of data for large shared data banks”, which specified the basic design principles still used today for designing databases. – “relational” databases, that is. There are other ways of doing databases, not as widely used. • • Codd’s fundamental insight was to put data into multiple tables connected together by unique keys. This is opposed to the typical spreadsheet idea of having all the data together in one big table. Making a database adhere to Codd’s principals is called “normalization”, and the principles themselves are called “normal forms”. – At present there are 6-8 normal forms: defining them is part of database theoretical research. “relational algebra” • Databases that conform to the normal forms are: – Easy and fast to search, and give the correct unique results – Easy to update: each piece of data is stored in a single location – Easy to extend to new types of data Basic Structure • Databases are composed of tables of data. – Tables hold logically related sets of data. A table is essentially the same thing as a spreadsheet: a set of rows and columns. • Each table has several records: – a record stores all the information for a given individual – Records are the rows of a data table • Each record has several fields: – A field is an individual piece of data, a single attribute of the record. – Fields are the columns of a data table • Each record (row) has a unique identifier, the primary key. – • • the primary key serves to identify the data stored in this record across all the tables in the database. Databases are manipulated with a language called SQL (Structured Query Language). It’s a “baby English” type of language: uses real words, but rigid in terms of the order and placement. Various database software: Oracle, MS Access, MySQL, etc. Tables in the GL database • Example: a pet grooming business called Grooming Lodge. – thanks to L. Jennifer Bosch, who created this example • • I am using database software called MySQL. It is running on one of our departmental servers, using a “command line” interface. The computer shows me a prompt: mysql> and I type in a command: show tables; I then hit Enter, and the data prints onto my monitor. This database uses 3 tables: 1. Charges: a list of each bill sent to a customer 2. Clients: contact information for each customer 3. Pets: a list of each individual pet seen at Grooming Lodge. mysql> show tables; +--------------+ | Tables_in_GL | +--------------+ | charges | | clients | | pets | +--------------+ GL Data Table • • Here is part of the “clients” table. Each record (row) is a client, and each client has several attributes or fields (columns). Note that each client has a unique identifier, the client_id. This is a very important aspect of a good database table: each record represents one unique individual, with no duplicates and none left out. – Names themselves are often not unique, so an identifier number is used. – The unique ID for each record in the table is the primary key. mysql> select client_id,name_last,name_first, phone,last_visit,balance from clients; +-----------+------------+------------+--------------+------------+---------+ | client_id | name_last | name_first | phone | last_visit | balance | +-----------+------------+------------+--------------+------------+---------+ | 1 | Bosch | Linda | 123-234-3456 | 2011-03-18 | 134 | | 2 | Harford | Cornelius | 234-354-2987 | 2011-01-10 | 0 | | 3 | Perkins | Laura | 815-823-9000 | 2010-05-01 | 0 | | 4 | Gramme | Barbara | 898-555-2008 | 2011-02-05 | 23 | | 5 | Granillo | Richard | 323-543-3328 | 2011-03-01 | 23 | | 6 | Hambourger | Colleen | 959-456-2345 | 2010-10-15 | 0 | | 7 | Harrell | Kenneth | 324-888-5555 | 2010-03-08 | 0 | +-----------+------------+------------+--------------+------------+---------+ Another Table • • • • On the “pets” table, each pet gets its own record and ID: pet_id is the primary key on this table. Pets are associated with their owner through the client_id, which is the same as the client_id primary key in the “clients” table. Note that all the data fro each pet and for each client is only entered once. All connections between the tables go through the primary keys. The “charges” table has these fields: client_id, pet_id, job_code, and fee. mysql> select pet_id,client_id,name,species,age,sex,breed from pets; +--------+----------+---------+---------+------+------+-----------------+ | pet_id | client_id | name | species | age | sex | breed | +--------+----------+---------+---------+------+------+-----------------+ | 1 | 1 | Merlin | feline | 3 | Mn | DSH | | 2 | 1 | Azzy | feline | 4 | Fs | DSH | | 3 | 1 | Brach | feline | 4 | Mn | DSH | | 4 | 2 | Wheezy | canine | 7 | Mn | Yorkie mix | | 5 | 3 | Foghorn | avian | 12 | F | Amazon - Mexica | | 6 | 4 | Pepper | feline | 13 | Fs | DLH | | 7 | 5 | Arlyn | canine | 5 | Fs | Afghan hound | | 8 | 5 | Taylor | canine | 7 | Ms | Afghan hound | | 9 | 6 | Creeper | reptile | 2 | M | iguana | | 10 | 6 | Jasper | canine | 10 | Mn | Dobie/Greyhound | | 11 | 7 | Maxwell | canine | 2 | Mn | Dauschund | +--------+----------+---------+---------+------+------+-----------------+ First Normal Form • • • As an example of how database tables are made, here are Codd’s original concepts, the first normal form. He also came up with second and third normal forms, in later publications. 1NF is a way of putting data tables into a regular order, so they are easy to process. The first normal form (1NF): – The order of rows and columns is irrelevant (so you can’t have a row that refers to the “previous” row, etc.). • This allows random access to the data, rather than needing to read it from top to bottom. – No duplicate rows (this way, the data are stored only a single time) – No repeating columns – Each cell (row-column intersection) must contain a single item of data. Grooming Lodge Data and 1NF • What would the data look like as a spreadsheet? – One logical approach would be to list each client on a separate line along with their contact information. Since some people have more than one pet, this would lead to multiple data items in a single cell, or multiple identical columns. Hard to update accurately or search quickly. – Another approach would be to list each pet on a separate line along with its owner. This would lead to multiple copies of owner data, which will be difficult to update. – Another approach would be to list each owner a single time, with multiple pets referring to that line. This requires that the data be read in a specific order, top to bottom. • By using 2 tables, one for clients and one for pets, all the data is in 1NF and is easy to search and update. +-----------+------------+------------+--------------+------------------+ | client_id | name_last | name_first | phone | pets | +-----------+------------+------------+--------------+------------------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin (feline), | | Azzy (feline), | | Brach (feline) | | 2 | Harford | Cornelius | 234-354-2987 | Wheezy (canine) | | 3 | Perkins | Laura | 815-823-9000 | Foghorn (avian) | | 5 | Granillo | Richard | 323-543-3328 | Arlyn (canine), | | | Taylor (canine) | +-----------+------------+------------+--------------+------------------+ Multiple data items in a single cell. Very difficult to search efficiently. +-----------+------------+------------+--------------+------------+------+ | client_id | name_last | name_first | phone | pet | species | +-----------+------------+------------+--------------+------------+------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin | feline | | 1 | Bosch | Linda | 123-234-3456 | Azzy | feline | | 1 | Bosch | Linda | 123-234-3456 | Brach | feline | | 2 | Harford | Cornelius | 234-354-2987 | Wheezy | canine | | 3 | Perkins | Laura | 815-823-9000 | Foghorn | avian | | 5 | Granillo | Richard | 323-543-3328 | Arlyn | canine | | 5 | Granillo | Richard | 323-543-3328 | Taylor | canine | +-----------+------------+------------+--------------+------------+------+ Each pet gets its own line, resulting in multiple rows for a single client. +----+------------+------------+--------------+-----------------+---------------+---------------+ |id | name_last | name_first | phone | pet | pet | pet | +----+------------+------------+--------------+-----------------+---------------+---------------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin (feline) | Azzy (feline) | Brach (feline)| | 2 | Harford | Cornelius | 234-354-2987 | Wheezy (canine) | | | | 3 | Perkins | Laura | 815-823-9000 | Foghorn (avian) | | | | 5 | Granillo | Richard | 323-543-3328 | Arlyn (canine) |Taylor(canine) | | +----+------------+------------+--------------+-----------------+---------------+---------------+ Multiple columns for pets. Frequent need to expand the table, lots of columns with no data. +-----------+------------+------------+--------------+---------+---------+ | client_id | name_last | name_first | phone | pet | species | +-----------+------------+------------+--------------+---------+---------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin | feline | | see 1 | | | | Azzy | feline | | see 1 | | | | Brach | feline | | 2 | Harford | Cornelius | 234-354-2987 | Wheezy | canine | | 3 | Perkins | Laura | 815-823-9000 | Foghorn | avian | | 5 | Granillo | Richard | 323-543-3328 | Arlyn | canine | | see 5 | | | | Taylor | canine | +-----------+------------+------------+-------------+---------+----------+ Rows refer to each other, requiring that the data be read from top to bottom. mysql> SELECT name_last, name_first, Searching • Retrieving information is a matter of specifying what you want: – which fields, – which tables, – which records you want, based on the values of given fields (i.e. the conditions for selection). mysql> SELECT name, pet_id FROM pets WHERE species='feline'; +--------+--------+ | name | pet_id | +--------+--------+ | Merlin | 1 | | Azzy | 2 | | Brach | 3 | | Pepper | 6 | +--------+--------+ “which pets are cats?” phone,balance FROM clients WHERE balance>0; +-----------+------------+--------------+---------+ | name_last | name_first | phone | balance | +-----------+------------+--------------+---------+ | Bosch | Linda | 123-234-3456 | 134 | | Gramme | Barbara | 898-555-2008 | 23 | | Granillo | Richard | 323-543-3328 | 23 | +-----------+------------+--------------+---------+ “who owes money?” mysql> SELECT name_last, name_first, pets.name FROM clients, pets WHERE clients.client_id=pets.client_id AND species='feline'; +-----------+------------+--------+ | name_last | name_first | name | +-----------+------------+--------+ | Bosch | Linda | Merlin | | Bosch | Linda | Azzy | | Bosch | Linda | Brach | | Gramme | Barbara | Pepper | +-----------+------------+--------+ “which clients have cats?” (this question uses data from two different tables, connected by the client_id). mysql> SELECT client_id,name_last,name_first,phone,last_visit,balance FROM clients; +-----------+------------+------------+--------------+------------+---------+ | client_id | name_last | name_first | phone | last_visit | balance | +-----------+------------+------------+--------------+------------+---------+ | 1 | Bosch | Linda | 123-234-3456 | 2002-03-18 | 134 | | 2 | Harford | Cornelius | 234-354-2987 | 2002-01-10 | 0 | | 3 | Perkins | Laura | 815-823-9000 | 2000-05-01 | 0 | | 4 | Gramme | Barbara | 898-555-2008 | 2002-02-05 | 23 | | 5 | Granillo | Richard | 323-543-3328 | 2002-03-01 | 23 | | 6 | Hambourger | Colleen | 959-456-2345 | 2001-10-15 | 0 | | 7 | Harrell | Kenneth | 324-888-5555 | 2001-03-08 | 0 | +-----------+------------+------------+--------------+------------+---------+ mysql> SELECT pet_id,client_id,name,species,age,sex,breed FROM pets; +--------+----------+---------+---------+------+------+-----------------+ | pet_id | client_id | name | species | age | sex | breed | +--------+----------+---------+---------+------+------+-----------------+ | 1 | 1 | Merlin | feline | 3 | Mn | DSH | | 2 | 1 | Azzy | feline | 4 | Fs | DSH | | 3 | 1 | Brach | feline | 4 | Mn | DSH | | 4 | 2 | Wheezy | canine | 7 | Mn | Yorkie mix | | 5 | 3 | Foghorn | avian | 12 | F | Amazon - Mexica | | 6 | 4 | Pepper | feline | 13 | Fs | DLH | | 7 | 5 | Arlyn | canine | 5 | Fs | Afghan hound | | 8 | 5 | Taylor | canine | 7 | Ms | Afghan hound | | 9 | 6 | Creeper | reptile | 2 | M | iguana | | 10 | 6 | Jasper | canine | 10 | Mn | Dobie/Greyhound | | 11 | 7 | Maxwell | canine | 2 | Mn | Dauschund | +--------+----------+---------+---------+------+------+-----------------+ Indexing • • • • A database index is a data structure that improves the speed of data retrieval, but at a cost of slower changes to the data and increased storage space. Any column (field) of any table can be indexed. The index is just a list of which records (rows) have given values for that field. You make an index that matches a frequent query. For instance, “which pets are cats? “ or “which pets belong to which client?” Note that you don’t need an index: you can search for anything in the database. But, the index makes the search go faster. Indexed on “species” column feline : 1, 2, 3, 6 canine: 4, 7, 8, 10, 11 avian: 5 reptile: 9 Indexed on “client_id” column 1: 2: 3: 4: 5: 6: 7: 1, 2, 3 4 5 6 7, 8 9, 10 11 Search Trees • • • From the computer’s point of view, the simplest way to find a given record in a file is to start at the beginning of the file and read every record until it finds the right one. – This is obviously not very efficient for big files. It is much faster to keep the file sorted, and then search it in a binary fashion: as with finding a word in a dictionary. The basic concept is the binary tree. You find a given record by dividing the problem into halves repeatedly. – To find “tree” in the dictionary, you open it halfway, then determine your word is in the second half, then open that part at the halfway point, etc. – This is efficient: on average you look at logN records with a binary tree, and N/2 records with a linear search, where N is the number of records. • Computers use a generalization of this concept: the Btree, which uses multiple levels of index, and more than 2 child nodes for each parent node. Computational Complexity • Different programs use different amounts of time, computer memory, number of calculations, etc., which greatly affect your ability to get results. – Time, space, and number of computations are all interrelated. • “Big O notation” is a way of quantifying this. The basic idea is how does the time, space, complexity scale with the number of objects being examined (n)? – O(1) = constant time. E.g. determine whether a number is even or odd. – O(log n) = logarithmic time. E.g. binary search – O(n) – linear time. E.g. finding something in an unsorted list – O(n log n). E.g. fast sorting algorithms – O(n2). Quadratic time. Simple sorting “bubble sort” – -O(2n) = exponential time. Very bad scaling! Travelling salesman problem, other NP-complete problems. Travelling salesman problem: --how to plan a route between many cities minimizing the total distance travelled --trying all possibilities: the number increases exponentially --going to the closest remaining city each time doesn’t work. End of Database Theory • Lots more to database theory and practice: proper table design, search queries, indexing. Writing SQL statements. • Mostly, users don’t need to worry about this. The user interface takes your input and writes the appropriate SQL, retrieves the information you want, and formats it for you. Indexes have been created to speed up common query types. • Of course this means you don’t have access to the raw data and you can’t make custom queries. And it usually means you can’t get all the information in the database downloaded onto your own computer to play with as you see fit. • However, many online databases allow downloading of large amounts of information by a process called FTP, which we will discuss in a bit. Plain Text vs. Formatted Text • • A basic bioinformatics problem is that your data analysis will probably use several different applications. Also, you may want to do some direct manipulation or analysis of the data. If every application inputs and outputs different arrangements of the data, you will have a tough time getting them to work together. The common language for trading information between applications without having to deal with compatibility is the flat file written in plain text. – As opposed to formatted text or a binary file or a file written in a proprietary format. • • Plain text is the lowest common denominator: just the letters, numbers, and punctuation marks that are on the keyboard. Formatted text contains codes that affect the appearance of the text, but which don’t appear on the screen. – HTML is a good example of formatted text – Word processors (like MS Word) insert formatting codes that you can’t see. Often the codes are proprietary. – All word processors these days are WYSIWYG: “what you see is what you get”. What appears on the screen is what will print out, but there is more in the file than just the characters. HTML Formatting • • • • • The format codes are enclosed in angle brackets, called tags: like <b>. The text to be formatted is between the opening tag and the closing tag (which is the same as the opening tag except that it has a slash: like </b>. Thus, all text between <b> and </b> is written in boldface. <h2> and <h4> are header tags: they start new sections of text. <p> is a paragraph tag. <h2>Some Text</h2> <h4>Formatted by HTML</h4> <p>This is a paragraph with <b>Bold</b> or <i>italicized</i> words.</p> Some Text Formatted by HTML This is a paragraph with Bold or italicized words. In plain text this would appear as: Some text Formatted by HTML This is a paragraph with Bold or italicized words. Text Encoding • Computers store and process information as binary bits: 0’s and 1’s. However, most information is actually processed in bytes, which are groups of 8 bits. – A byte has • • 28 = 256 possible states. Text files store each character as a single byte. ASCII (American Standard Code for Information Interchange) is a way of encoding the English alphabet, plus numbers and punctuation, plus some control characters (like TAB and EOT: “end of transmission”) and graphics elements (like ¬ ). – ASCII is really a 7-bit code, with 128 possibilities. The eighth bit was used for error-checking: the code was originally developed for data transmission. • A file written in ASCII (or a variant) is a text file. – Text files are readable with Notepad • Files of compiled computer code, or in some proprietary format, are called binary files. – Look like gibberish in Notepad. Printout of an executable binary file. ^L ZY[å]Ê ÷Ãðÿu0øÑËÑËÑË ÑË© ´BÍ!r#ÇÖ¸) Å^W3Àë V´òô 2ä@Hð») ASCII Oddities • ASCII separates capital and small letters: why sorts sometimes put all capital letter file names before the small letter names. – Windows file names are not case sensitive, but Unix file names are. Thus MyFile.txt and myfile.txt are the same name in Windows, but different in Unix. • The most troublesome control character is “newline”: at the end of each line of text. – On a typewriter (or teletype) this requires both a “line feed” and a “carriage return”, which seems a bit wasteful. – Unix just wants a line feed. – Macintosh just wants a carriage return – Windows wants both. – Sending files as “text” rather than “binary” or “default” sometimes cleans up this problem. • Often written in octal (base 8) or hexadecimal (base 16). Hexadecimal uses ABCDEF to represent the single “digits” 10-15. ASCII Extensions and Unicode • ASCII was developed for the US, and many countries use entirely different alphabets and special symbols. – ASCII contains some “national symbols” that vary between countries. • US national symbols: $ # @ % & = ? – 8 bit (256 character) ASCII: ISO-8859-1 (Latin 1) uses the regular 128 ASCI characters as the first 128, then another set for the second 128. Used widely for internet applications • Things like Ù ę þ ¿ • Unicode covers most of the world’s writing systems. It currently has about 107,000 characters in 90 sets, plus rules for converting them, adding to them, etc. – You can even encode ancient Egyptian hieroglyphics! • Unicode can be in 1 byte (UTF-8), 2 byte (UTF-16), or 4 byte (UTF32) encoding. – The first 256 Unicode characters are the same as the 8 bit (ISO-8895-1) ASCII Flat Files • • A flat file is written in plain text, in a standard defined format General text files written in plain text can be written by programs like Notepad (for Windows) or TextEdit (Mac) or vi (Unix). They are often given a .txt extension on their file names. – .txt helps Windows figure out what kind of file they are. Unix and Macs don’t use the extension for anything, but it helps the user remember what the file is. – The word processing software just renders the characters in whatever font is uses as a default. • For database tables, each record is on a separate line, and the fields are separated by delimiters, usually tabs or commas. – Excel and other spreadsheet programs can easily open a flat file in this format, no matter which operating system is being used. – Spreadsheet programs can also save the files in tab-delimited (or comma) format • DNA and protein sequences are written in FASTA format. FASTA Format • • • • • • • • A FASTA file can contain one sequence, or several sequences (sometimes called multi-FASTA). Each sequence start with a single comment line that starts with a >, followed by one or more lines of sequence. Comments usually start with a short unique identifier for the sequence, a space, and then (optionally) some other descriptive information. The sequence should contain only sequence characters: no punctuation, spaces, or numbers allowed. Either upper or lower case is acceptable, The sequence can be written all on one line (which can be quite long), or on multiple lines. Protein sequences use the single letter codes. Note that FASTA format is not rigidly defined, and it is always worth checking sequence files for what exactly is being used as a format, and the information for any new program for what is an acceptable format. FASTA Examples >gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans LNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNK DYECIQPHTALTGWEPSNVPALWKYVGEGTGGGTPTPDTTPPTVPAGLT SSLVTDTSVNLTWASTDNVGVTGYEVYRNGTLVANTSTTTAVVTGLTAGT TYVFTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKWLIGYWHNFDN GSTNIKLRNVSTAYDVINVSFAEPISPGSGTLAFTPYNATVEEFKSDIAYLQ SQGKKVLISMGGANGRIELTDATKKRQQFEDSLKSIISTYGFNGLDIDLEGS SLSLNAGDTDFRSPTTPKIVNLINGVKALKSHFGANFVLTAAPETAYVQGG YLNYGGPWGAYLPVIHALRNDLTLLHVQHYNTGSMVGLDGRSYAQGTAD FHVAMAQMLLQGFNVGGSSGPFFSPLRPDQIAIGVPASQQAAGGGYTAP AELQKALNYLIKGVSYGGSYTLRQLRAMSVSRAL >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACC TCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTT GTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTG ATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGG C FTP • • File Transfer Protocol. A method for transferring files between computers. The standard way to download bulk data from someone else’s server. – More secure methods are generally used in private transactions: SSCP and SSH, for example. • • • • This is mostly for anonymous FTP, where you don’t need a password and anyone can download data. Sometimes you get a compressed file, ending in .zip, .gz, .tar, .tgz or something similar. To convert this to a plain text flat file, you need to decompress (extract) it with a program like WinZip, Pkzip, or some such. Other times you get a plain text flat file, that you can open with any word processor or spreadsheet, or with whatever fancier software you like. NCBI FTP site: http://www.ncbi.nlm.nih.gov/ – Go to Data & Software, then FTP:GenBank under Downloads, then click on the genbank hypertext link. This brings up an FTP directory; near the bottom are some smaller files. Click on them to download. KEGG KEGG • KEGG = Kyoto Encyclopedia of Genes and Genomes – http://www.genome.jp/kegg/ – Also look at MetaCyc, another metabolic pathways database http://metacyc.org/ • Primarily used for metabolic reaction pathways, which are manually curated from published materials – http://www.genome.jp/kegg/pathway.html – Also collections of genes, a hierarchical classification scheme – ATLAS: Overall metabolic picture for individual organisms: http://www.genome.jp/kegg/atlas/ • Currently 344 reference pathways in the database • Bad problem: widespread use of 3-letter organism identifiers, not listed in alphabetical order. E.g. “bsu” = Bacillus subtilis. – http://www.genome.jp/kegg/catalog/org_list.html is their organism list, giving the codes. KEGG Pathway database • http://www.genome.jp/kegg/pathway.html • Breakdown into major categories: – metabolism (the most important one), – genetic information processing (including protein folding and sorting), – environmental information processing (including membrane transport and intracellular signaling), – cellular processes, – plus some others • Broken down into subcategories, e.g. carbohydrate metabolism, and then into individual pathways, e.g. glycolysis/gluconeogenesis – (http://www.genome.jp/kegg/pathway/map/map00010.html ) KEGG Map of Glycolysis • Reference pathway: a summary of all relevant reactions and enzymes from all organisms • Reaction intermediates are named • Enzymes given by E.C. number (linked: discussed on next slide) • Arrows for the pathways: note that some go both ways! • Boxes showing neighboring pathways (linked to pathway maps) • At the top: – “Pathway entry” link gives a description of the pathway – Dropdown pathway list for many organisms. • Other reference pathways just vary what the links go to. EC 5.3.1.1 • Connects glycerone-P to glyceraldehyde-3P in both directions – glycerone-P is also known as dihydroxyacetone phosphate • Goes to an Orthology page, giving the enzyme name as triosephosphate isomerase. – I don’t find the orthology page too useful, so click on EC number again to get to Enzyme page • Enzyme page lists synonyms for the reaction catalyzed by this enzyme, the pathways it appears in, and specific entries for the reaction and the substrates and products. – Also individual genes: search for “bsu” on the page. These links give you amino acid and nucleotide sequences for the genes. – The compound pages (substrates and products) give chemical structures and they list all reactions and pathways the compound is found in. KEGG Pathway for an Individual Organism • Homo sapiens (hsa). – The colored boxes mean that enzyme is found in this species; the links go to the individual gene pages – Uncolored boxes: not present in this organism, no link • Sulfolobus acidocaldarius (an archaea, near the bottom of the menu) – Pathway isn’t complete: very little from glucose to glyceraldehyde 3-phosphate. • Perhaps this organism doesn’t utilize glucose in the same way that most organisms do • Perhaps the missing enzymes are present, but too widely diverged to be recognizable by sequence homology to the better-known enzymes of this type UniProt UniProt • • http://www.uniprot.org/ UniProt is a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics, and the Protein Information Resource (PIR). – – – – • The main product is UniProtKB (Uniprot Knowledge Base, i.e a database). – • The SIB used to put out “Swiss-Prot”, which was a curated database of protein sequences. EBI used to put out TrEMBL, an uncurated database of nucleotide sequences translated into proteins. PIR also had a protein database, PSD, along with a set of curated protein families. They pooled their resources, reducing 3 websites to one. Also a set of sequence clusters called UniRef Main tools: text search and BLAST search – There’s also a ClustalW multiple alignment tool Text Search: Triose phosphate isomerase • Search was done with AND connecting each word. – 2311 results • Accession number: link to the gene itself – Entry name: I don’t know what the point is • Status: yellow star = curated; gray star = uncurated . “Curated” means that a person has gone over the data and decided that it represents a real, properly annotated gene, while ”uncurated” means that it is just a translation of a nucleotide sequence that may be a complete gene, a gene fragment, or even a region that never gets expressed at all. – The curated ones come out on top by default – You can show only curated or uncurated if you like • Protein and gene names: lots of minor name variations, lots of abbreviations that vary from species to species, and it’s hard to know whether what you want is going to be the protein name or the gene name. Sorting Search Results • The results can be sorted by any column, by clicking on the column header. – Organism: look at Bacillus • Most are 251-253 AA long • Lots of short ones: but note that they are listed as “Fragments” in the protein name • Note several other proteins get in there: enolase, phosphoglycerate mutase – Could sort by protein name: see all the “putative” triose phosphate isomerases, plus “Tpi protein” and “Triosephosphatisomerase” (at very end) – Note the link for “Restrict term” to protein name: reduces results to 849 entries • You can narrow the search with the “Fields” link on the top: try “Protein Existence” – AND takes precedence over OR, but you can use parentheses to alter this Customizing the Display • Click on this: there are several other pieces of information you can add or remove, and you can change the column order as well (with the Up and Down buttons) – You can also display more rows. I usually set this to 100, which is the maximum – Press “Save” to make it happen – Try “Protein existence” and “Features” – “Matched text” shows what in the entry matched your query—clears up a lot of the mystery entries Browsing by Taxonomy • Try narrowing it down towards humans: – Eukaryota, Bilateria, Euteleostomi, Tetrapoda, Eutheria, Euarchontoglires, Catarrhini, Hominidae (done by a combination of following big numbers, vague knowledge of taxonomy, and trial-and error) • The “Search in” box has a Taxonomy database that goes through these terms, but I would open it in a separate tab. (Euteleostomi) • Or to Bacillus: – Bacteria, Firmicutes, Bacillales, Bacillaceae • Browsing by keyword can be quite useful as well – Other things: gene ontology, enzyme class (- EC number), pathway—maybe specialized uses UniRef • clusters of similar sequences – UniRef100 = identical sequences – UniRef90 = sequences 90% or more identical – UniRef50 = sequences 50% or more identical • Represented by a single sequence, with the number of sequences in the cluster listed • The idea is, by displaying UniRef clusters, you eliminate redundancy – E.g. for “triose phosphate isomerase” there are 849 entries. UniRef100: 679 entries; UniRef90: 440 entries; UniRef50: 115 entries Individual Accession Entries • Let’s look at P60174, TPI from humans • Recommended and alternate names (will this be consistent across databases? I doubt it) • Info on protein processing, subunit structure, isoforms (alternative splicings), natural variants, features of the protein sequence • Has the amino acid sequence: note the FASTA link, to get a nice FASTA-formatted version • Literature references • Other databases (even Wikipedia!) SEED Viewer The SEED Viewer • • http://pubseed.theseed.org/seedviewer.cgi A whole genome analysis site dedicated to prokaryotes – • For eukaryotes, most species have their own individual web site. You don’t need a login ID to use the site, but only if you want to use a private genome. – Registering is quite easy. This site is not meant to be exclusive. – The data here is organized by two guiding principles: 1. Subsystems: metabolic pathways and structures (e.g. glycolysis, ribosome) 2. Localization: genes involved in the same subsystem tend to stay near each other on the chromosome across species lines (partly due to operons) Organism Overview • To start, select a species: Bacillus subtilis (for which there is only 1 strain) • You get to the Organism Overview page. • Basic info: strain name, genome size, number of genes, etc. • Pie chart of subsystems. You can drill down and get to specific subsystems this way, or use the “Features in Subsystems” table, but I rarely do. The Genome Browser works better for me. • This is the central page for this organism. Links to various things: – Genome Browser (aka Feature Table): for individual genes – Subsystems (under Navigate or Organism) – Comparative Tools: Function-based, Sequence-based, KEGG, BLAST Genome Browser • Has a nice graphic of the chromosome, showing all 6 reading frames, with the gene of interest in red. – Clicking on a gene gives summary info about it, and you can get to its page: “Details page”. • Table is very useful, because you can sort and limit it • Let’s look for “triose phosphate isomerase” as a function. Nothing. Try just “isomerase”. Get 29, and halfway down the list is “triosephosphate isomerase”. Subsystem is “Glycolysis and Gluconeogenesis” • ID = fig|224308.1.peg.3398. – The “fig|” just means it is in the SEED/FIG/RAST database – 224308.1 is the organism ID – Peg stands for “protein encoding gene”. There are also RNA genes (Type column) – 3398 is the actual gene number – Click on it to go to that gene’s page Annotation Details Page • Basic links from here: – – – – – Feature evidence Sequence (can get DNA or amino acid) Subsystems this gene is in Compare Regions: view of the chromosome region in related species Run tool: mostly for predicting transmembrane regions and cellular location • Try TmHMM or CELLO • Compare Regions. – – – – – – Your gene is #1, in red, transcribed left to right Genes below the line overlap their neighbors Other genes in color match genes seen in the window in other species Gray genes don’t match anything else on the display You can change the number of species seen, which changes the gene colors Operons are adjacent genes all transcribed in the same direction, usually with related functions: – Mouse over to get gene summary information – Clicking on another gene refocuses the display on it, and brings you to its page. Feature Evidence • Mostly the graphic display of closely related genes. – Red is most similar, green is least. – Length of bars indicated relative gene length, with the white bar inside showing the region of matching – You can select various genes and then download all sequences as a single FAST-formatted file. Note the “Include query” check box. – You can also do a multiple alignment on the genes with the “Align Selected” button. • There is also a table giving all this information Subsystems • You can click on the individual subsystem for this gene, or you can go through the Subsystems page. – Take a look at Subsystems page briefly • Diagram: a map similar to a KEGG map, but labeled differently – Some mouse hovering works here – Note alternatives found in different organisms • • Additional Notes often give useful information and references about this subsystem Functional Roles: list of all gene functions needed for this subsystem and its variants. – Note that sometimes one gene does more than one role, and other times more than one gene fulfills the same role (i.e. duplicate genes or enzymes more multiple subunits). – Links are to KEGG pages • Subsystem spreadsheet shows which roles are present in different organisms (sortable and limitable) – Coloring by cluster on the chromosome – Links to the genes – Variant codes (explained in Additional Notes) Comparative Tools • Go back to the main Organism page or to an individual gene page • KEGG map. Starts out very high level, but clicking on the map drills down to regular KEGG map. – Comparison between species here isn’t too enlightening • Function-based comparison: comparing subsystems between your species and another one – See functions present in one or the other species, or both – Try Anabena variablis and limit it to Glycolysis subsystem • See many functions in both A and B genome • A few things in one or the other only • “Find” button does a search for that gene in the species where it is missing More Comparative Tools • Sequence-based comaprison. – Need to select both a reference organism and one or more comparison organisms. Try B. subtils and B. cereus ZK. It takes a while! – Get a circular map of the genome showing how related genes are between the species, and a table showing this. • The map is most interesting for gaps: areas where the two genomes have nothing is common. – Also a BLAST dotplot showing hwo different areas are related: you see regions with many syntenic genes, regions that are inverted, regiosns with no synteny. • BLAST search tool. Paste in FASTA sequences and get BLAST against your genome, with links to the gene pages Protein Data Bank • http://www.rcsb.org/pdb/home/home.do • Experimentally-determined three-dimensional protein structures – Good expandable help menus • Try a search for triose phosphate isomerase – Nice general information under “Web Page hits” – Structures (45 of them). Click on one. • You can examine the 3-D structure • Sequence Details lays out secondary structure along the length of the sequence • Biology and Chemistry has useful details about the protein NCBI NCBI • • • http://www.ncbi.nlm.nih.gov/ National Center for Biological Information, which is part of the National Library of Medicine, which is in turn part of the National Institutes of Health, which is in turn part of the US Department of Health and Human Services. The main bioinformatics database in the US – – – – – – – • PubMed, citations and abstracts for biomedical articles GenBank, primary repository for DNA sequences dbEST: Expressed Sequence Tag database Genome: whole genome sequences, annotations, links to projects Structure: 3-dimensional protein domains GEO: gene expression data of several types OMIM: Online Mendelian Inheritance in Man. also several important tools: – BLAST sequence alignment tool – CDD: Conserved domains database (and search tool) • • And more, especially tools and databases for human genetics “Entrez” (pronounced “on-tray”) is the overall collection of databases at NCBI, which can be searched by a common mechanism. The “All databases” link. PubMed • • • • this is the primary search engine many of us use to find important articles contains abstracts of articles in biology and medicine back to 1948 in many journals often with links to free full text articles, especially through PubMed Central try triose phosphate isomerase Arabidopsis (or Bacillus) get a list of articles, with a separate tab for review articles – – click on one: get the abstract: 90% of the time, this is all you need plus (often) a link to the full text article • • • sometimes these require a fee, like $30. Rarely worth it: try NIU interlibrary loan instead often the articles are free, especially if you are coming to the site through NIU (also check out the link for “off Campus Authentication” at the NIU library site http://www.ulib.niu.edu/ ). – Also, a set of Related articles – also links to individual authors and journals You want to find a reference for BLAST, and you know the first author is Altschul – – – – get 329 hits, sorted with most recent first. Perusing the first page, Altschul’s initials are probably SF Search for SF Altschul gives 46 articles. Sort by first author: not far down is the 1990 article “Basic local alignment search tool”. A quick glance at the abstract confirms it. Unfortunately it was published in a journal that doesn’t even allow access to 20 year old articles! Database Search • Search all Databases brings up a list of all the Entrez databases and how many hits in each. – More usually, you select the database first: Protein is a good choice, Nucleotide for RNA-only genes, Genome is you want to see the gene in context with neighboring genes. • Limiting the search: use the Preview/Index tab. – Add search terms at the bottom: select the field (Organism, Protein name, text word, etc.), type in the word you want, hit AND, OR, or NOT. Then hit Preview – Results come up in Most Recent Queries. Click on the Results number to get to the index of entries – Erase unwanted terms in the text box at the top GenBank • Storage of all nucleotide sequences – sharing all info with DDBJ (DNA DataBank of Japan) and EMBL (European Molecular Biology Laboratory) – December 2008: 99,116,431,942 bases, from 98,868,465 reported sequences – new release that you can download every 2 months: current release is 169.0 • The main storage area is “Nucleotide”, with EST, GSS, and a few others also involved – Search for triose phosphate isomerase brings up many whole genomes: try adding “Bacillus” to the search, and sorting by taxID – Look at B. megaterium ID M87647. A region of the chromosome with several relevant genes. – List of features and their coordinates, translations of all genes, nucleotide sequence at the bottom • “CDS” = coding sequence, the part of the gene that can be translated into protein – Clicking on CDS, gene, etc. gives all the info at the top, plus just the specific sequence you clicked on. – Clicking on protein gives you a different ID, and more info about that protein. This is part of the Protein database. GenBank Records • http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_ 001087007.1 • A column of keywords and a column of values. • Basic info: gene name, accession number, taxonomy, references to literature. • List of features found in the sequence – The sequence itself is at the bottom, labeled ORIGIN – The numbers refer to this sequence, sometimes with gaps and/or reverse-complement (feature on the opposite strand) – “Gene” refers to the entire transcribed sequence – “CDS” is just the protein-coding portion of the sequence – “misc_feature” can be a wide variety of things dbEST – for sequences (usually partial and single-pass) of messenger RNA (i.e., cDNA derived from mRNA) – Name search works, but BLAST is probably better, since not all ESTs have been named • Note that you can search non-human, non-mouse ESTs separately – Get data on exons and on where/when it is expressed Genome • whole genome sequences, annotations, links to projects – – – – – – • Whole genome sequence List of genes (links to sequences). Proetin-coding and RNA GenBank records for individual genes, proteins, or the whole genome Genome (map) viewer BLAST search for individual species FTP download of complete genome, genes, GenBank and other descriptions Pick a group (Bacteria). Big table. Try Bacillus anthracis str. Ames – Note some genomes have plasmids, and eukaryotes have mitochondria, etc. – You can sort by taxonomy (with a helpful tree diagram) – Name is link to the Taxonomy Browser, general info about the species, with links to sequences of various types – Accession number is link to info about the sequencing project results – Numbers for proteins and RNAs link to a table of those genes – Number for genes links to a list of GenBank records for each More Genome • Click on Accession number. Goes to Overview page for this genome – Link to GenBank record for the complete genome. • By default, the sequences aren’t shown: uncheck the little boxes at the top to get them – link to BLAST search for this organism – link to FTP site for downloads: the .fnn file is the full length genomic sequence as a fasta file. – links to BLAST Homologs: a collection of potentially interesting forms of analysis. I rather like Gene Plot, which compares all genes in 2 genomes. Try doing it against Bacillus halodurans. – Genome Project link has information about the specific sequencing project. – A map viewer to see the context of a gene. The Genes link goes to a list of all genes, with the map viewer focused on it. Structure • 3-D protein structures – Need RasMol or Cn3D to view them properly • Also shows gene domain structures GEO • • • • • • • • http://www.ncbi.nlm.nih.gov/geo/ Gene Expression Omnibus Various kinds of gene expression data: microarrays, RT-PCR, SAGE, mass spectrometry, in situ hybridizations, etc. Text search (query) and browser Description of experiment and the platform used to run it on Links to datasets you can download and analyze yourself Series vs. DataSet. An experimenter submits data as a series: the results of several experimental treatments and controls. The NCBI staff then curates this into a DataSet, which is then analyzed by the GEO tools Various tools: expression profiles (comparing expression levels under different conditions), clustering, etc. OMIM • • • • • Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim Comprehensive entries for all human genetic disorders. Originally started by Dr. Victor A. McKusick in the 1960’s, as a book covers about 12,000 genes today – genes and their diseases often have separate entries • essentially a comprehensive literature review divided into several sections – Clinical features, inheritance, cytogenetics, mapping, molecular genetics, population heterogeneity, animal models, gene structure, genotype/phenotype relations, etc. • Also a list of known human variant alleles for each gene • try searching for triose phosphate isomerase: there is an entry for the gene and a deficiency syndrome look at “cystic fibrosis”: both the disease and the gene (CFTR) • Taxonomy • • • • http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Root A simple display of organisms and the taxonomy leading to them, for all organisms that have at least one GenBank entry Note that they have a disclaimer: the taxonomies listed here are not authoritative. They try to keep up with the taxonomic literature, but that is an active field with a lot of changes occurring. You can search for a name, e.g. “Oryza”, which is the genus of rice. – • You can also just click on the tree and drill down – • You can also misspell it and try the “phonetic search” option: try “Oriza” clicking on a taxon will sometimes expand the tree from that point and other times give you information about the taxon Taxon information: – – – – – the full lineage, synonyms and other naming information which genetic code it uses (there are several variants--look at the Genetic codes link on the left of the Taxonomy main page), links to various parts of GenBank unfortunately, very little descriptive info: try Google and Wikipedia CDD • Conserved Domains Database • A way of searching for protein function that works on multiple sequence alignments BLAST • The main sequence alignment tool in use today • The two main ways to access GenBank and many other sequence databases are text search (searching annotation) and BLAST (searching the sequences). • We will cover this is much more detail soon.