Introduction to Bioinformatics Databases

Introduction to Bioinformatics
Databases
The Problem
• Current size of Genbank (June 2011): 129,178,292,958 bp (1.3 x
1011, or 129 tera-base pairs) in 140,482,268 entries. This doesn’t
include unprocessed genomic sequences, which would double the
size.
– This would be hard to deal with if it were all written on pieces of paper
stored in file cabinets.
– In 1982, Genbank contained 690,338 bp in 606 entries, which fit on two
5 ¼ inch floppy disks (360 kB capacity), which Genbank mailed to you.
• At its simplest, a database is a way to store information and retrieve
it efficiently.
– However, nearly all databases add value to the information by processing it in
different ways.
• We want to introduce some ideas about how databases work, and
what kinds of databases are available.
Database Background
•
•
The general concepts of storing and retrieving data go back to
very beginnings of writing.
How to record data in a uniform fashion, and how to file it where
you can find it again later. Concepts like:
–
–
–
–
–
forms,
alphabetical order,
serial numbers,
filing cabinets with separate drawers and folders within drawers
Use of machines to accurately tabulate information (typewriter,
adding machine, etc.)
• Computers allowed even larger amounts of
information to be stored
– Computers are like being blind: they deal with
information one small chunk at a time, and they can’t
see what’s coming next.
Sumerian land purchase
records from about
2400 BCE
Codd’s Normal Forms
•
In 1970, IBM researcher E. F. Codd published the seminal paper “A
relational model of data for large shared data banks”, which specified the
basic design principles still used today for designing databases.
– “relational” databases, that is. There are other ways of doing databases, not as
widely used.
•
•
Codd’s fundamental insight was to put data into multiple tables connected
together by unique keys. This is opposed to the typical spreadsheet idea of
having all the data together in one big table.
Making a database adhere to Codd’s principals is called “normalization”,
and the principles themselves are called “normal forms”.
– At present there are 6-8 normal forms: defining them is part of database
theoretical research. “relational algebra”
•
Databases that conform to the normal forms are:
– Easy and fast to search, and give the correct unique results
– Easy to update: each piece of data is stored in a single location
– Easy to extend to new types of data
Basic Structure
•
Databases are composed of tables of data.
– Tables hold logically related sets of data. A table is essentially the same thing as
a spreadsheet: a set of rows and columns.
•
Each table has several records:
– a record stores all the information for a given individual
– Records are the rows of a data table
•
Each record has several fields:
– A field is an individual piece of data, a single attribute of the record.
– Fields are the columns of a data table
•
Each record (row) has a unique identifier, the primary key.
–
•
•
the primary key serves to identify the data stored in this record across all the
tables in the database.
Databases are manipulated with a language called SQL (Structured Query
Language). It’s a “baby English” type of language: uses real words, but
rigid in terms of the order and placement.
Various database software: Oracle, MS Access, MySQL, etc.
Tables in the GL database
•
Example: a pet grooming business called Grooming Lodge.
– thanks to L. Jennifer Bosch, who created this example
•
•
I am using database software called MySQL. It is running on one of our
departmental servers, using a “command line” interface. The computer
shows me a prompt: mysql> and I type in a command: show tables; I
then hit Enter, and the data prints onto my monitor.
This database uses 3 tables:
1. Charges: a list of each bill sent to a customer
2. Clients: contact information for each customer
3. Pets: a list of each individual pet seen at Grooming Lodge.
mysql> show tables;
+--------------+
| Tables_in_GL |
+--------------+
| charges
|
| clients
|
| pets
|
+--------------+
GL Data Table
•
•
Here is part of the “clients” table. Each record (row) is a client, and each client has
several attributes or fields (columns).
Note that each client has a unique identifier, the client_id. This is a very important
aspect of a good database table: each record represents one unique individual, with
no duplicates and none left out.
– Names themselves are often not unique, so an identifier number is used.
– The unique ID for each record in the table is the primary key.
mysql> select client_id,name_last,name_first, phone,last_visit,balance from clients;
+-----------+------------+------------+--------------+------------+---------+
| client_id | name_last | name_first | phone
| last_visit | balance |
+-----------+------------+------------+--------------+------------+---------+
|
1 | Bosch
| Linda
| 123-234-3456 | 2011-03-18 |
134 |
|
2 | Harford
| Cornelius | 234-354-2987 | 2011-01-10 |
0 |
|
3 | Perkins
| Laura
| 815-823-9000 | 2010-05-01 |
0 |
|
4 | Gramme
| Barbara
| 898-555-2008 | 2011-02-05 |
23 |
|
5 | Granillo
| Richard
| 323-543-3328 | 2011-03-01 |
23 |
|
6 | Hambourger | Colleen
| 959-456-2345 | 2010-10-15 |
0 |
|
7 | Harrell
| Kenneth
| 324-888-5555 | 2010-03-08 |
0 |
+-----------+------------+------------+--------------+------------+---------+
Another Table
•
•
•
•
On the “pets” table, each pet gets its own record and ID: pet_id is the primary key
on this table.
Pets are associated with their owner through the client_id, which is the same as
the client_id primary key in the “clients” table.
Note that all the data fro each pet and for each client is only entered once. All
connections between the tables go through the primary keys.
The “charges” table has these fields: client_id, pet_id, job_code, and fee.
mysql> select pet_id,client_id,name,species,age,sex,breed from pets;
+--------+----------+---------+---------+------+------+-----------------+
| pet_id | client_id | name
| species | age | sex | breed
|
+--------+----------+---------+---------+------+------+-----------------+
|
1 |
1 | Merlin | feline |
3 | Mn
| DSH
|
|
2 |
1 | Azzy
| feline |
4 | Fs
| DSH
|
|
3 |
1 | Brach
| feline |
4 | Mn
| DSH
|
|
4 |
2 | Wheezy | canine |
7 | Mn
| Yorkie mix
|
|
5 |
3 | Foghorn | avian
|
12 | F
| Amazon - Mexica |
|
6 |
4 | Pepper | feline |
13 | Fs
| DLH
|
|
7 |
5 | Arlyn
| canine |
5 | Fs
| Afghan hound
|
|
8 |
5 | Taylor | canine |
7 | Ms
| Afghan hound
|
|
9 |
6 | Creeper | reptile |
2 | M
| iguana
|
|
10 |
6 | Jasper | canine |
10 | Mn
| Dobie/Greyhound |
|
11 |
7 | Maxwell | canine |
2 | Mn
| Dauschund
|
+--------+----------+---------+---------+------+------+-----------------+
First Normal Form
•
•
•
As an example of how database tables are made, here are Codd’s original
concepts, the first normal form. He also came up with second and third
normal forms, in later publications.
1NF is a way of putting data tables into a regular order, so they are easy to
process.
The first normal form (1NF):
– The order of rows and columns is irrelevant (so you can’t have a row that refers
to the “previous” row, etc.).
• This allows random access to the data, rather than needing to read it from top to
bottom.
– No duplicate rows (this way, the data are stored only a single time)
– No repeating columns
– Each cell (row-column intersection) must contain a single item of data.
Grooming Lodge Data and 1NF
•
What would the data look like as a spreadsheet?
– One logical approach would be to list each client on a separate line along with
their contact information. Since some people have more than one pet, this would
lead to multiple data items in a single cell, or multiple identical columns. Hard to
update accurately or search quickly.
– Another approach would be to list each pet on a separate line along with its
owner. This would lead to multiple copies of owner data, which will be difficult to
update.
– Another approach would be to list each owner a single time, with multiple pets
referring to that line. This requires that the data be read in a specific order, top
to bottom.
•
By using 2 tables, one for clients and one for pets, all the data is in 1NF and
is easy to search and update.
+-----------+------------+------------+--------------+------------------+
| client_id | name_last | name_first | phone
| pets
|
+-----------+------------+------------+--------------+------------------+
|
1 | Bosch
| Linda
| 123-234-3456 | Merlin (feline), |
|
Azzy (feline),
|
|
Brach (feline)
|
|
2 | Harford
| Cornelius | 234-354-2987 | Wheezy (canine)
|
|
3 | Perkins
| Laura
| 815-823-9000 | Foghorn (avian)
|
|
5 | Granillo
| Richard
| 323-543-3328 | Arlyn (canine),
|
|
| Taylor (canine)
|
+-----------+------------+------------+--------------+------------------+
Multiple data items in a single cell. Very difficult to search efficiently.
+-----------+------------+------------+--------------+------------+------+
| client_id | name_last | name_first | phone
| pet
| species |
+-----------+------------+------------+--------------+------------+------+
|
1 | Bosch
| Linda
| 123-234-3456 | Merlin | feline |
|
1 | Bosch
| Linda
| 123-234-3456 | Azzy
| feline |
|
1 | Bosch
| Linda
| 123-234-3456 | Brach
| feline |
|
2 | Harford
| Cornelius | 234-354-2987 | Wheezy | canine |
|
3 | Perkins
| Laura
| 815-823-9000 | Foghorn | avian
|
|
5 | Granillo
| Richard
| 323-543-3328 | Arlyn
| canine |
|
5 | Granillo
| Richard
| 323-543-3328 | Taylor | canine |
+-----------+------------+------------+--------------+------------+------+
Each pet gets its own line, resulting in multiple rows for a single client.
+----+------------+------------+--------------+-----------------+---------------+---------------+
|id | name_last | name_first | phone
| pet
| pet
| pet
|
+----+------------+------------+--------------+-----------------+---------------+---------------+
| 1 | Bosch
| Linda
| 123-234-3456 | Merlin (feline) | Azzy (feline) | Brach (feline)|
| 2 | Harford
| Cornelius | 234-354-2987 | Wheezy (canine) |
|
|
| 3 | Perkins
| Laura
| 815-823-9000 | Foghorn (avian) |
|
|
| 5 | Granillo
| Richard
| 323-543-3328 | Arlyn (canine) |Taylor(canine) |
|
+----+------------+------------+--------------+-----------------+---------------+---------------+
Multiple columns for pets. Frequent need to expand the table, lots of columns with no data.
+-----------+------------+------------+--------------+---------+---------+
| client_id | name_last | name_first | phone
| pet
| species |
+-----------+------------+------------+--------------+---------+---------+
|
1 | Bosch
| Linda
| 123-234-3456 | Merlin | feline |
|
see 1 |
|
|
| Azzy
| feline |
|
see 1 |
|
|
| Brach
| feline |
|
2 | Harford
| Cornelius | 234-354-2987 | Wheezy | canine |
|
3 | Perkins
| Laura
| 815-823-9000 | Foghorn | avian
|
|
5 | Granillo
| Richard
| 323-543-3328 | Arlyn
| canine |
|
see 5 |
|
|
| Taylor | canine |
+-----------+------------+------------+-------------+---------+----------+
Rows refer to each other, requiring that the data be read from top to bottom.
mysql> SELECT name_last, name_first,
Searching
•
Retrieving information
is a matter of specifying
what you want:
– which fields,
– which tables,
– which records you
want, based on the
values of given fields
(i.e. the conditions for
selection).
mysql> SELECT name, pet_id
FROM pets WHERE
species='feline';
+--------+--------+
| name
| pet_id |
+--------+--------+
| Merlin |
1 |
| Azzy
|
2 |
| Brach |
3 |
| Pepper |
6 |
+--------+--------+
“which pets are cats?”
phone,balance FROM clients WHERE balance>0;
+-----------+------------+--------------+---------+
| name_last | name_first | phone
| balance |
+-----------+------------+--------------+---------+
| Bosch
| Linda
| 123-234-3456 |
134 |
| Gramme
| Barbara
| 898-555-2008 |
23 |
| Granillo | Richard
| 323-543-3328 |
23 |
+-----------+------------+--------------+---------+
“who owes money?”
mysql> SELECT name_last, name_first, pets.name
FROM clients, pets
WHERE clients.client_id=pets.client_id
AND species='feline';
+-----------+------------+--------+
| name_last | name_first | name
|
+-----------+------------+--------+
| Bosch
| Linda
| Merlin |
| Bosch
| Linda
| Azzy
|
| Bosch
| Linda
| Brach |
| Gramme
| Barbara
| Pepper |
+-----------+------------+--------+
“which clients have cats?” (this question uses
data from two different tables, connected by the
client_id).
mysql> SELECT client_id,name_last,name_first,phone,last_visit,balance FROM clients;
+-----------+------------+------------+--------------+------------+---------+
| client_id | name_last | name_first | phone
| last_visit | balance |
+-----------+------------+------------+--------------+------------+---------+
|
1 | Bosch
| Linda
| 123-234-3456 | 2002-03-18 |
134 |
|
2 | Harford
| Cornelius | 234-354-2987 | 2002-01-10 |
0 |
|
3 | Perkins
| Laura
| 815-823-9000 | 2000-05-01 |
0 |
|
4 | Gramme
| Barbara
| 898-555-2008 | 2002-02-05 |
23 |
|
5 | Granillo
| Richard
| 323-543-3328 | 2002-03-01 |
23 |
|
6 | Hambourger | Colleen
| 959-456-2345 | 2001-10-15 |
0 |
|
7 | Harrell
| Kenneth
| 324-888-5555 | 2001-03-08 |
0 |
+-----------+------------+------------+--------------+------------+---------+
mysql> SELECT pet_id,client_id,name,species,age,sex,breed FROM pets;
+--------+----------+---------+---------+------+------+-----------------+
| pet_id | client_id | name
| species | age | sex | breed
|
+--------+----------+---------+---------+------+------+-----------------+
|
1 |
1 | Merlin | feline |
3 | Mn
| DSH
|
|
2 |
1 | Azzy
| feline |
4 | Fs
| DSH
|
|
3 |
1 | Brach
| feline |
4 | Mn
| DSH
|
|
4 |
2 | Wheezy | canine |
7 | Mn
| Yorkie mix
|
|
5 |
3 | Foghorn | avian
|
12 | F
| Amazon - Mexica |
|
6 |
4 | Pepper | feline |
13 | Fs
| DLH
|
|
7 |
5 | Arlyn
| canine |
5 | Fs
| Afghan hound
|
|
8 |
5 | Taylor | canine |
7 | Ms
| Afghan hound
|
|
9 |
6 | Creeper | reptile |
2 | M
| iguana
|
|
10 |
6 | Jasper | canine |
10 | Mn
| Dobie/Greyhound |
|
11 |
7 | Maxwell | canine |
2 | Mn
| Dauschund
|
+--------+----------+---------+---------+------+------+-----------------+
Indexing
•
•
•
•
A database index is a data structure
that improves the speed of data
retrieval, but at a cost of slower
changes to the data and increased
storage space.
Any column (field) of any table can
be indexed. The index is just a list of
which records (rows) have given
values for that field.
You make an index that matches a
frequent query. For instance, “which
pets are cats? “ or “which pets
belong to which client?”
Note that you don’t need an index:
you can search for anything in the
database. But, the index makes the
search go faster.
Indexed on “species” column
feline : 1, 2, 3, 6
canine: 4, 7, 8, 10, 11
avian: 5
reptile: 9
Indexed on “client_id” column
1:
2:
3:
4:
5:
6:
7:
1, 2, 3
4
5
6
7, 8
9, 10
11
Search Trees
•
•
•
From the computer’s point of view, the simplest way to
find a given record in a file is to start at the beginning of
the file and read every record until it finds the right one.
– This is obviously not very efficient for big files.
It is much faster to keep the file sorted, and then search
it in a binary fashion: as with finding a word in a
dictionary.
The basic concept is the binary tree. You find a given
record by dividing the problem into halves repeatedly.
– To find “tree” in the dictionary, you open it halfway, then
determine your word is in the second half, then open that
part at the halfway point, etc.
– This is efficient: on average you look at logN records with a
binary tree, and N/2 records with a linear search, where N
is the number of records.
•
Computers use a generalization of this concept: the Btree, which uses multiple levels of index, and more than
2 child nodes for each parent node.
Computational Complexity
•
Different programs use different amounts of time,
computer memory, number of calculations, etc.,
which greatly affect your ability to get results.
– Time, space, and number of computations are all
interrelated.
•
“Big O notation” is a way of quantifying this. The
basic idea is how does the time, space, complexity
scale with the number of objects being examined
(n)?
– O(1) = constant time. E.g. determine whether a
number is even or odd.
– O(log n) = logarithmic time. E.g. binary search
– O(n) – linear time. E.g. finding something in an
unsorted list
– O(n log n). E.g. fast sorting algorithms
– O(n2). Quadratic time. Simple sorting “bubble sort”
– -O(2n) = exponential time. Very bad scaling!
Travelling salesman problem, other NP-complete
problems.
Travelling salesman problem:
--how to plan a route between
many cities minimizing the
total distance travelled
--trying all possibilities: the number
increases exponentially
--going to the closest remaining
city each time doesn’t work.
End of Database Theory
• Lots more to database theory and practice: proper table design,
search queries, indexing. Writing SQL statements.
• Mostly, users don’t need to worry about this. The user interface
takes your input and writes the appropriate SQL, retrieves the
information you want, and formats it for you. Indexes have been
created to speed up common query types.
• Of course this means you don’t have access to the raw data and you
can’t make custom queries. And it usually means you can’t get all
the information in the database downloaded onto your own
computer to play with as you see fit.
• However, many online databases allow downloading of large
amounts of information by a process called FTP, which we will
discuss in a bit.
Plain Text vs. Formatted Text
•
•
A basic bioinformatics problem is that your data analysis will probably use
several different applications. Also, you may want to do some direct
manipulation or analysis of the data. If every application inputs and outputs
different arrangements of the data, you will have a tough time getting them
to work together.
The common language for trading information between applications without
having to deal with compatibility is the flat file written in plain text.
– As opposed to formatted text or a binary file or a file written in a proprietary
format.
•
•
Plain text is the lowest common denominator: just the letters, numbers, and
punctuation marks that are on the keyboard.
Formatted text contains codes that affect the appearance of the text, but
which don’t appear on the screen.
– HTML is a good example of formatted text
– Word processors (like MS Word) insert formatting codes that you can’t see.
Often the codes are proprietary.
– All word processors these days are WYSIWYG: “what you see is what you get”.
What appears on the screen is what will print out, but there is more in the file
than just the characters.
HTML Formatting
•
•
•
•
•
The format codes are
enclosed in angle brackets,
called tags: like <b>.
The text to be formatted is
between the opening tag and
the closing tag (which is the
same as the opening tag
except that it has a slash: like
</b>.
Thus, all text between <b> and
</b> is written in boldface.
<h2> and <h4> are header
tags: they start new sections
of text.
<p> is a paragraph tag.
<h2>Some Text</h2>
<h4>Formatted by HTML</h4>
<p>This is a paragraph with <b>Bold</b>
or <i>italicized</i> words.</p>
Some Text
Formatted by HTML
This is a paragraph with Bold or italicized
words.
In plain text this would appear as:
Some text
Formatted by HTML
This is a paragraph with Bold
or italicized words.
Text Encoding
•
Computers store and process information as binary
bits: 0’s and 1’s. However, most information is actually
processed in bytes, which are groups of 8 bits.
– A byte has
•
•
28
= 256 possible states.
Text files store each character as a single byte.
ASCII (American Standard Code for Information
Interchange) is a way of encoding the English alphabet,
plus numbers and punctuation, plus some control
characters (like TAB and EOT: “end of transmission”)
and graphics elements (like ¬ ).
– ASCII is really a 7-bit code, with 128 possibilities. The
eighth bit was used for error-checking: the code was
originally developed for data transmission.
• A file written in ASCII (or a variant) is a text file.
– Text files are readable with Notepad
• Files of compiled computer code, or in some
proprietary format, are called binary files.
– Look like gibberish in Notepad.
Printout of an
executable binary
file.
^L
ZY[å]Ê
÷Ãðÿu0øÑËÑËÑË
ÑË©
´BÍ!r#ÇÖ¸)
Å^W3Àë
V´òô
2ä@Hð»)
ASCII Oddities
•
ASCII separates capital and small letters: why sorts
sometimes put all capital letter file names before
the small letter names.
– Windows file names are not case sensitive, but Unix
file names are. Thus MyFile.txt and myfile.txt are the
same name in Windows, but different in Unix.
•
The most troublesome control character is
“newline”: at the end of each line of text.
– On a typewriter (or teletype) this requires both a “line
feed” and a “carriage return”, which seems a bit
wasteful.
– Unix just wants a line feed.
– Macintosh just wants a carriage return
– Windows wants both.
– Sending files as “text” rather than “binary” or “default”
sometimes cleans up this problem.
•
Often written in octal (base 8) or hexadecimal
(base 16). Hexadecimal uses ABCDEF to
represent the single “digits” 10-15.
ASCII Extensions and Unicode
• ASCII was developed for the US, and many countries use entirely
different alphabets and special symbols.
– ASCII contains some “national symbols” that vary between countries.
• US national symbols: $ # @ % & = ?
– 8 bit (256 character) ASCII: ISO-8859-1 (Latin 1) uses the
regular 128 ASCI characters as the first 128, then another set for
the second 128. Used widely for internet applications
• Things like Ù ę þ ¿
• Unicode covers most of the world’s writing systems. It currently has
about 107,000 characters in 90 sets, plus rules for converting them,
adding to them, etc.
– You can even encode ancient Egyptian hieroglyphics!
• Unicode can be in 1 byte (UTF-8), 2 byte (UTF-16), or 4 byte (UTF32) encoding.
– The first 256 Unicode characters are the same as the 8 bit (ISO-8895-1) ASCII
Flat Files
•
•
A flat file is written in plain text, in a standard defined format
General text files written in plain text can be written by programs like
Notepad (for Windows) or TextEdit (Mac) or vi (Unix). They are often given
a .txt extension on their file names.
– .txt helps Windows figure out what kind of file they are. Unix and Macs don’t use
the extension for anything, but it helps the user remember what the file is.
– The word processing software just renders the characters in whatever font is
uses as a default.
•
For database tables, each record is on a separate line, and the fields are
separated by delimiters, usually tabs or commas.
– Excel and other spreadsheet programs can easily open a flat file in this format,
no matter which operating system is being used.
– Spreadsheet programs can also save the files in tab-delimited (or comma) format
•
DNA and protein sequences are written in FASTA format.
FASTA Format
•
•
•
•
•
•
•
•
A FASTA file can contain one sequence, or several sequences (sometimes
called multi-FASTA).
Each sequence start with a single comment line that starts with a >,
followed by one or more lines of sequence.
Comments usually start with a short unique identifier for the sequence, a
space, and then (optionally) some other descriptive information.
The sequence should contain only sequence characters: no punctuation,
spaces, or numbers allowed.
Either upper or lower case is acceptable,
The sequence can be written all on one line (which can be quite long), or on
multiple lines.
Protein sequences use the single letter codes.
Note that FASTA format is not rigidly defined, and it is always worth
checking sequence files for what exactly is being used as a format, and the
information for any new program for what is an acceptable format.
FASTA Examples
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
LNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNK
DYECIQPHTALTGWEPSNVPALWKYVGEGTGGGTPTPDTTPPTVPAGLT
SSLVTDTSVNLTWASTDNVGVTGYEVYRNGTLVANTSTTTAVVTGLTAGT
TYVFTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKWLIGYWHNFDN
GSTNIKLRNVSTAYDVINVSFAEPISPGSGTLAFTPYNATVEEFKSDIAYLQ
SQGKKVLISMGGANGRIELTDATKKRQQFEDSLKSIISTYGFNGLDIDLEGS
SLSLNAGDTDFRSPTTPKIVNLINGVKALKSHFGANFVLTAAPETAYVQGG
YLNYGGPWGAYLPVIHALRNDLTLLHVQHYNTGSMVGLDGRSYAQGTAD
FHVAMAQMLLQGFNVGGSSGPFFSPLRPDQIAIGVPASQQAAGGGYTAP
AELQKALNYLIKGVSYGGSYTLRQLRAMSVSRAL
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACC
TCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTT
GTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTG
ATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGG
C
FTP
•
•
File Transfer Protocol.
A method for transferring files between computers. The standard way to
download bulk data from someone else’s server.
– More secure methods are generally used in private transactions: SSCP and
SSH, for example.
•
•
•
•
This is mostly for anonymous FTP, where you don’t need a password and
anyone can download data.
Sometimes you get a compressed file, ending in .zip, .gz, .tar, .tgz or
something similar. To convert this to a plain text flat file, you need to
decompress (extract) it with a program like WinZip, Pkzip, or some such.
Other times you get a plain text flat file, that you can open with any word
processor or spreadsheet, or with whatever fancier software you like.
NCBI FTP site: http://www.ncbi.nlm.nih.gov/
– Go to Data & Software, then FTP:GenBank under Downloads, then click on the
genbank hypertext link. This brings up an FTP directory; near the
bottom are some smaller files. Click on them to download.
KEGG
KEGG
• KEGG = Kyoto Encyclopedia of Genes and Genomes
– http://www.genome.jp/kegg/
– Also look at MetaCyc, another metabolic pathways database
http://metacyc.org/
• Primarily used for metabolic reaction pathways, which
are manually curated from published materials
– http://www.genome.jp/kegg/pathway.html
– Also collections of genes, a hierarchical classification scheme
– ATLAS: Overall metabolic picture for individual organisms:
http://www.genome.jp/kegg/atlas/
• Currently 344 reference pathways in the database
• Bad problem: widespread use of 3-letter organism
identifiers, not listed in alphabetical order. E.g. “bsu” =
Bacillus subtilis.
– http://www.genome.jp/kegg/catalog/org_list.html is their
organism list, giving the codes.
KEGG Pathway database
• http://www.genome.jp/kegg/pathway.html
• Breakdown into major categories:
– metabolism (the most important one),
– genetic information processing (including protein folding and
sorting),
– environmental information processing (including membrane
transport and intracellular signaling),
– cellular processes,
– plus some others
• Broken down into subcategories, e.g. carbohydrate
metabolism, and then into individual pathways, e.g.
glycolysis/gluconeogenesis
– (http://www.genome.jp/kegg/pathway/map/map00010.html )
KEGG Map of
Glycolysis
• Reference pathway: a summary
of all relevant reactions and
enzymes from all organisms
• Reaction intermediates are
named
• Enzymes given by E.C. number
(linked: discussed on next slide)
• Arrows for the pathways: note
that some go both ways!
• Boxes showing neighboring
pathways (linked to pathway
maps)
• At the top:
– “Pathway entry” link gives a
description of the pathway
– Dropdown pathway list for many
organisms.
• Other reference pathways just
vary what the links go to.
EC 5.3.1.1
• Connects glycerone-P to glyceraldehyde-3P in both
directions
– glycerone-P is also known as dihydroxyacetone phosphate
• Goes to an Orthology page, giving the enzyme name as
triosephosphate isomerase.
– I don’t find the orthology page too useful, so click on EC number
again to get to Enzyme page
• Enzyme page lists synonyms for the reaction catalyzed
by this enzyme, the pathways it appears in, and specific
entries for the reaction and the substrates and products.
– Also individual genes: search for “bsu” on the page. These links
give you amino acid and nucleotide sequences for the genes.
– The compound pages (substrates and products) give chemical
structures and they list all reactions and pathways the compound
is found in.
KEGG Pathway for an Individual
Organism
• Homo sapiens (hsa).
– The colored boxes mean that enzyme is found in this
species; the links go to the individual gene pages
– Uncolored boxes: not present in this organism, no link
• Sulfolobus acidocaldarius (an archaea, near the
bottom of the menu)
– Pathway isn’t complete: very little from glucose to
glyceraldehyde 3-phosphate.
• Perhaps this organism doesn’t utilize glucose in the same
way that most organisms do
• Perhaps the missing enzymes are present, but too widely
diverged to be recognizable by sequence homology to the
better-known enzymes of this type
UniProt
UniProt
•
•
http://www.uniprot.org/
UniProt is a collaboration between the European Bioinformatics
Institute, the Swiss Institute of Bioinformatics, and the Protein
Information Resource (PIR).
–
–
–
–
•
The main product is UniProtKB (Uniprot Knowledge Base, i.e a
database).
–
•
The SIB used to put out “Swiss-Prot”, which was a curated database
of protein sequences.
EBI used to put out TrEMBL, an uncurated database of nucleotide
sequences translated into proteins.
PIR also had a protein database, PSD, along with a set of curated
protein families.
They pooled their resources, reducing 3 websites to one.
Also a set of sequence clusters called UniRef
Main tools: text search and BLAST search
–
There’s also a ClustalW multiple alignment tool
Text Search: Triose phosphate isomerase
• Search was done with AND connecting each word.
– 2311 results
• Accession number: link to the gene itself
– Entry name: I don’t know what the point is
• Status: yellow star = curated; gray star = uncurated . “Curated”
means that a person has gone over the data and decided that it
represents a real, properly annotated gene, while ”uncurated”
means that it is just a translation of a nucleotide sequence that
may be a complete gene, a gene fragment, or even a region that
never gets expressed at all.
– The curated ones come out on top by default
– You can show only curated or uncurated if you like
• Protein and gene names: lots of minor name variations, lots of
abbreviations that vary from species to species, and it’s hard to
know whether what you want is going to be the protein name or
the gene name.
Sorting Search Results
• The results can be sorted by any column, by clicking on
the column header.
– Organism: look at Bacillus
• Most are 251-253 AA long
• Lots of short ones: but note that they are listed as “Fragments” in
the protein name
• Note several other proteins get in there: enolase, phosphoglycerate
mutase
– Could sort by protein name: see all the “putative” triose phosphate
isomerases, plus “Tpi protein” and “Triosephosphatisomerase” (at very
end)
– Note the link for “Restrict term” to protein name: reduces results to 849
entries
• You can narrow the search with the “Fields” link on the
top: try “Protein Existence”
– AND takes precedence over OR, but you can use parentheses
to alter this
Customizing the Display
• Click on this: there are several other pieces of
information you can add or remove, and you can
change the column order as well (with the Up
and Down buttons)
– You can also display more rows. I usually set this to
100, which is the maximum
– Press “Save” to make it happen
– Try “Protein existence” and “Features”
– “Matched text” shows what in the entry matched your
query—clears up a lot of the mystery entries
Browsing by Taxonomy
• Try narrowing it down towards humans:
– Eukaryota, Bilateria, Euteleostomi, Tetrapoda,
Eutheria, Euarchontoglires, Catarrhini, Hominidae
(done by a combination of following big numbers,
vague knowledge of taxonomy, and trial-and error)
• The “Search in” box has a Taxonomy database that goes
through these terms, but I would open it in a separate tab.
(Euteleostomi)
• Or to Bacillus:
– Bacteria, Firmicutes, Bacillales, Bacillaceae
• Browsing by keyword can be quite useful as well
– Other things: gene ontology, enzyme class (- EC
number), pathway—maybe specialized uses
UniRef
• clusters of similar sequences
– UniRef100 = identical sequences
– UniRef90 = sequences 90% or more identical
– UniRef50 = sequences 50% or more identical
• Represented by a single sequence, with the
number of sequences in the cluster listed
• The idea is, by displaying UniRef clusters, you
eliminate redundancy
– E.g. for “triose phosphate isomerase” there are 849
entries. UniRef100: 679 entries; UniRef90: 440
entries; UniRef50: 115 entries
Individual Accession Entries
• Let’s look at P60174, TPI from humans
• Recommended and alternate names (will this be
consistent across databases? I doubt it)
• Info on protein processing, subunit structure,
isoforms (alternative splicings), natural variants,
features of the protein sequence
• Has the amino acid sequence: note the FASTA
link, to get a nice FASTA-formatted version
• Literature references
• Other databases (even Wikipedia!)
SEED Viewer
The SEED Viewer
•
•
http://pubseed.theseed.org/seedviewer.cgi
A whole genome analysis site dedicated to prokaryotes
–
•
For eukaryotes, most species have their own individual web
site.
You don’t need a login ID to use the site, but only if you
want to use a private genome.
–
Registering is quite easy. This site is not meant to be
exclusive.
–
The data here is organized by two guiding principles:
1. Subsystems: metabolic pathways and structures (e.g. glycolysis,
ribosome)
2. Localization: genes involved in the same subsystem tend to stay
near each other on the chromosome across species lines (partly
due to operons)
Organism Overview
• To start, select a species: Bacillus subtilis (for which
there is only 1 strain)
• You get to the Organism Overview page.
• Basic info: strain name, genome size, number of genes,
etc.
• Pie chart of subsystems. You can drill down and get to
specific subsystems this way, or use the “Features in
Subsystems” table, but I rarely do. The Genome Browser
works better for me.
• This is the central page for this organism. Links to
various things:
– Genome Browser (aka Feature Table): for individual genes
– Subsystems (under Navigate or Organism)
– Comparative Tools: Function-based, Sequence-based, KEGG,
BLAST
Genome Browser
• Has a nice graphic of the chromosome, showing all 6 reading
frames, with the gene of interest in red.
– Clicking on a gene gives summary info about it, and you can get to its
page: “Details page”.
• Table is very useful, because you can sort and limit it
• Let’s look for “triose phosphate isomerase” as a function. Nothing.
Try just “isomerase”. Get 29, and halfway down the list is
“triosephosphate isomerase”. Subsystem is “Glycolysis and
Gluconeogenesis”
• ID = fig|224308.1.peg.3398.
– The “fig|” just means it is in the SEED/FIG/RAST database
– 224308.1 is the organism ID
– Peg stands for “protein encoding gene”. There are also RNA genes
(Type column)
– 3398 is the actual gene number
– Click on it to go to that gene’s page
Annotation Details Page
•
Basic links from here:
–
–
–
–
–
Feature evidence
Sequence (can get DNA or amino acid)
Subsystems this gene is in
Compare Regions: view of the chromosome region in related species
Run tool: mostly for predicting transmembrane regions and cellular location
• Try TmHMM or CELLO
•
Compare Regions.
–
–
–
–
–
–
Your gene is #1, in red, transcribed left to right
Genes below the line overlap their neighbors
Other genes in color match genes seen in the window in other species
Gray genes don’t match anything else on the display
You can change the number of species seen, which changes the gene colors
Operons are adjacent genes all transcribed in the same direction, usually with
related functions:
– Mouse over to get gene summary information
– Clicking on another gene refocuses the display on it, and brings you to its page.
Feature Evidence
• Mostly the graphic display of closely related
genes.
– Red is most similar, green is least.
– Length of bars indicated relative gene length, with the
white bar inside showing the region of matching
– You can select various genes and then download all
sequences as a single FAST-formatted file. Note the
“Include query” check box.
– You can also do a multiple alignment on the genes
with the “Align Selected” button.
• There is also a table giving all this information
Subsystems
•
You can click on the individual subsystem for this gene, or you can go
through the Subsystems page.
– Take a look at Subsystems page briefly
•
Diagram: a map similar to a KEGG map, but labeled differently
– Some mouse hovering works here
– Note alternatives found in different organisms
•
•
Additional Notes often give useful information and references about this
subsystem
Functional Roles: list of all gene functions needed for this subsystem and its
variants.
– Note that sometimes one gene does more than one role, and other times more
than one gene fulfills the same role (i.e. duplicate genes or enzymes more
multiple subunits).
– Links are to KEGG pages
•
Subsystem spreadsheet shows which roles are present in different
organisms (sortable and limitable)
– Coloring by cluster on the chromosome
– Links to the genes
– Variant codes (explained in Additional Notes)
Comparative Tools
• Go back to the main Organism page or to an individual
gene page
• KEGG map. Starts out very high level, but clicking on
the map drills down to regular KEGG map.
– Comparison between species here isn’t too enlightening
• Function-based comparison: comparing subsystems
between your species and another one
– See functions present in one or the other species, or both
– Try Anabena variablis and limit it to Glycolysis subsystem
• See many functions in both A and B genome
• A few things in one or the other only
• “Find” button does a search for that gene in the species where it is
missing
More Comparative Tools
• Sequence-based comaprison.
– Need to select both a reference organism and one or more
comparison organisms. Try B. subtils and B. cereus ZK. It takes
a while!
– Get a circular map of the genome showing how related genes
are between the species, and a table showing this.
• The map is most interesting for gaps: areas where the two genomes
have nothing is common.
– Also a BLAST dotplot showing hwo different areas are related:
you see regions with many syntenic genes, regions that are
inverted, regiosns with no synteny.
• BLAST search tool. Paste in FASTA sequences and get
BLAST against your genome, with links to the gene
pages
Protein Data Bank
• http://www.rcsb.org/pdb/home/home.do
• Experimentally-determined three-dimensional
protein structures
– Good expandable help menus
• Try a search for triose phosphate isomerase
– Nice general information under “Web Page hits”
– Structures (45 of them). Click on one.
• You can examine the 3-D structure
• Sequence Details lays out secondary structure along the
length of the sequence
• Biology and Chemistry has useful details about the protein
NCBI
NCBI
•
•
•
http://www.ncbi.nlm.nih.gov/
National Center for Biological Information, which is part of the National Library
of Medicine, which is in turn part of the National Institutes of Health, which is in
turn part of the US Department of Health and Human Services.
The main bioinformatics database in the US
–
–
–
–
–
–
–
•
PubMed, citations and abstracts for biomedical articles
GenBank, primary repository for DNA sequences
dbEST: Expressed Sequence Tag database
Genome: whole genome sequences, annotations, links to projects
Structure: 3-dimensional protein domains
GEO: gene expression data of several types
OMIM: Online Mendelian Inheritance in Man.
also several important tools:
– BLAST sequence alignment tool
– CDD: Conserved domains database (and search tool)
•
•
And more, especially tools and databases for human genetics
“Entrez” (pronounced “on-tray”) is the overall collection of databases at NCBI,
which can be searched by a common mechanism. The “All databases” link.
PubMed
•
•
•
•
this is the primary search engine many of us use to find important articles
contains abstracts of articles in biology and medicine back to 1948 in many journals
often with links to free full text articles, especially through PubMed Central
try triose phosphate isomerase Arabidopsis (or Bacillus) get a list of articles, with a
separate tab for review articles
–
–
click on one: get the abstract: 90% of the time, this is all you need
plus (often) a link to the full text article
•
•
•
sometimes these require a fee, like $30. Rarely worth it: try NIU interlibrary loan instead
often the articles are free, especially if you are coming to the site through NIU (also check out the link
for “off Campus Authentication” at the NIU library site http://www.ulib.niu.edu/ ).
–
Also, a set of Related articles
–
also links to individual authors and journals
You want to find a reference for BLAST, and you know the first author is Altschul
–
–
–
–
get 329 hits, sorted with most recent first.
Perusing the first page, Altschul’s initials are probably SF
Search for SF Altschul gives 46 articles.
Sort by first author: not far down is the 1990 article “Basic local alignment search tool”. A
quick glance at the abstract confirms it. Unfortunately it was published in a journal that
doesn’t even allow access to 20 year old articles!
Database Search
•
Search all Databases brings up a list of all the Entrez databases and how
many hits in each.
– More usually, you select the database first: Protein is a good choice, Nucleotide
for RNA-only genes, Genome is you want to see the gene in context with
neighboring genes.
•
Limiting the search: use the Preview/Index tab.
– Add search terms at the bottom: select the field (Organism, Protein name, text
word, etc.), type in the word you want, hit AND, OR, or NOT. Then hit Preview
– Results come up in Most Recent Queries. Click on the Results number to get to
the index of entries
– Erase unwanted terms in the text box at the top
GenBank
•
Storage of all nucleotide sequences
– sharing all info with DDBJ (DNA DataBank of Japan) and EMBL (European
Molecular Biology Laboratory)
– December 2008: 99,116,431,942 bases, from 98,868,465 reported sequences
– new release that you can download every 2 months: current release is 169.0
•
The main storage area is “Nucleotide”, with EST, GSS, and a few others
also involved
– Search for triose phosphate isomerase brings up many whole genomes: try
adding “Bacillus” to the search, and sorting by taxID
– Look at B. megaterium ID M87647. A region of the chromosome with several
relevant genes.
– List of features and their coordinates, translations of all genes, nucleotide
sequence at the bottom
• “CDS” = coding sequence, the part of the gene that can be translated into protein
– Clicking on CDS, gene, etc. gives all the info at the top, plus just the specific
sequence you clicked on.
– Clicking on protein gives you a different ID, and more info about that protein.
This is part of the Protein database.
GenBank Records
• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_
001087007.1
• A column of keywords and a column of values.
• Basic info: gene name, accession number, taxonomy,
references to literature.
• List of features found in the sequence
– The sequence itself is at the bottom, labeled ORIGIN
– The numbers refer to this sequence, sometimes with gaps and/or
reverse-complement (feature on the opposite strand)
– “Gene” refers to the entire transcribed sequence
– “CDS” is just the protein-coding portion of the sequence
– “misc_feature” can be a wide variety of things
dbEST
– for sequences (usually partial and single-pass) of messenger RNA (i.e., cDNA
derived from mRNA)
– Name search works, but BLAST is probably better, since not all ESTs have been
named
• Note that you can search non-human, non-mouse ESTs separately
– Get data on exons and on where/when it is expressed
Genome
•
whole genome sequences, annotations, links to projects
–
–
–
–
–
–
•
Whole genome sequence
List of genes (links to sequences). Proetin-coding and RNA
GenBank records for individual genes, proteins, or the whole genome
Genome (map) viewer
BLAST search for individual species
FTP download of complete genome, genes, GenBank and other descriptions
Pick a group (Bacteria). Big table. Try Bacillus anthracis str. Ames
– Note some genomes have plasmids, and eukaryotes have mitochondria, etc.
– You can sort by taxonomy (with a helpful tree diagram)
– Name is link to the Taxonomy Browser, general info about the species, with links
to sequences of various types
– Accession number is link to info about the sequencing project results
– Numbers for proteins and RNAs link to a table of those genes
– Number for genes links to a list of GenBank records for each
More Genome
• Click on Accession number. Goes to Overview page for
this genome
– Link to GenBank record for the complete genome.
• By default, the sequences aren’t shown: uncheck the little boxes at
the top to get them
– link to BLAST search for this organism
– link to FTP site for downloads: the .fnn file is the full length
genomic sequence as a fasta file.
– links to BLAST Homologs: a collection of potentially interesting
forms of analysis. I rather like Gene Plot, which compares all
genes in 2 genomes. Try doing it against Bacillus halodurans.
– Genome Project link has information about the specific
sequencing project.
– A map viewer to see the context of a gene. The Genes link goes
to a list of all genes, with the map viewer focused on it.
Structure
• 3-D protein structures
– Need RasMol or Cn3D to view them properly
• Also shows gene domain structures
GEO
•
•
•
•
•
•
•
•
http://www.ncbi.nlm.nih.gov/geo/
Gene Expression Omnibus
Various kinds of gene expression data: microarrays, RT-PCR, SAGE, mass
spectrometry, in situ hybridizations, etc.
Text search (query) and browser
Description of experiment and the platform used to run it on
Links to datasets you can download and analyze yourself
Series vs. DataSet. An experimenter submits data as a series: the results
of several experimental treatments and controls. The NCBI staff then
curates this into a DataSet, which is then analyzed by the GEO tools
Various tools: expression profiles (comparing expression levels under
different conditions), clustering, etc.
OMIM
•
•
•
•
•
Online Mendelian Inheritance in Man
http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
Comprehensive entries for all human genetic disorders.
Originally started by Dr. Victor A. McKusick in the 1960’s, as a book
covers about 12,000 genes today
– genes and their diseases often have separate entries
•
essentially a comprehensive literature review divided into several sections
– Clinical features, inheritance, cytogenetics, mapping, molecular genetics,
population heterogeneity, animal models, gene structure, genotype/phenotype
relations, etc.
•
Also a list of known human variant alleles for each gene
•
try searching for triose phosphate isomerase: there is an entry for the gene
and a deficiency syndrome
look at “cystic fibrosis”: both the disease and the gene (CFTR)
•
Taxonomy
•
•
•
•
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Root
A simple display of organisms and the taxonomy leading to them, for all organisms
that have at least one GenBank entry
Note that they have a disclaimer: the taxonomies listed here are not authoritative.
They try to keep up with the taxonomic literature, but that is an active field with a lot
of changes occurring.
You can search for a name, e.g. “Oryza”, which is the genus of rice.
–
•
You can also just click on the tree and drill down
–
•
You can also misspell it and try the “phonetic search” option: try “Oriza”
clicking on a taxon will sometimes expand the tree from that point and other times give you
information about the taxon
Taxon information:
–
–
–
–
–
the full lineage,
synonyms and other naming information
which genetic code it uses (there are several variants--look at the Genetic codes link on the
left of the Taxonomy main page),
links to various parts of GenBank
unfortunately, very little descriptive info: try Google and Wikipedia
CDD
• Conserved Domains Database
• A way of searching for protein function that
works on multiple sequence alignments
BLAST
• The main sequence alignment tool in use
today
• The two main ways to access GenBank
and many other sequence databases are
text search (searching annotation) and
BLAST (searching the sequences).
• We will cover this is much more detail
soon.