BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and proteins) The National Institutes of Health (NIH) definition of bioinformatics: “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data.” How is bioinformatics important to biochemistry? The tools of bioinformatics include algorithms and computer programs for analysis of molecular sequences that reveal the structure and function of macromolecules. Bioinformatics analysis gives valuable information that can guide experimental work. AMINO ACID SEQUENCE ALIGNMENT A way to compare 2 or more sequences; The sequences are lined up (“aligned”), one above the other, so that each residue of one sequence can be compared to the corresponding residue of the other sequence; Sometimes one sequence must be “cut,” and a gap introduced, in order to make this sequence align in the optimal way with the other sequence. An example of a pairwise amino acid sequence alignment (2 sequences): sequence_1 sequence_2 1 MLFMCHQRVMKKEAEEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCA 50 .|||||..: ||:::||||.||||||. 1 MEEKLKKTK-----------IIFVVGGPGSGKGTQCE 26 All the residues that are identical in the two sequences are indicated with the “|” symbol between them; residues that are chemically similar are indicated with the “:” or “.” symbol, such as W and F (both have aromatic side chains). Note that a gap (----- region) was introduced into sequence_2 in order to make it align optimally with sequence_1. BLAST– Basic Local Alignment Search Tool A bioinformatics tool that allows users to compare a protein or DNA sequence to databases of other protein or DNA sequences from many organisms. A web-based version is available free of charge at the National Center for Biotechnology Information (NCBI) website: http://www.ncbi.nlm.nih.gov/BLAST/ The output from a “BLAST search” is a series of sequence alignments. EXAMPLE OF A BLAST SEARCH Suppose you have the sequence of a human protein and want to know if there is a homologous protein in the fruit fly Drosophila melanogaster. The amino acid sequence of the human protein will be the “query” for the BLAST search. The BLAST algorithm compares the query sequence to all proteins in the Drosophila genome. The BLAST output will show a list of the Drosophila proteins that have statistical sequence similarity to the human query protein. These Drosophila proteins can be referred to as “BLAST hits.” Below this list of BLAST hits, there will be a series of sequence alignments between the human query protein and each Drosophila protein that is in the list of BLAST hits. The first alignment will be between the query and the Drosophila protein that is most similar in sequence; the second alignment will be between the query and the Drosophila protein that is the second best match in terms of sequence similarity… and so on. The next slide shows just one of these alignments from a BLAST search. The last 2 slides explain some of the features of the alignment. Query = a human protein Subject (sbjct) = the Drosophila protein that is most similar to this human protein Sample from BLAST output (see explanation on next 2 slides): >gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster] Length = 229 Score = 179 bits (453), Expect = 1e-45 Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%) Query: 2 Sbjct: 15 Query: 51 Sbjct: 75 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50 EEKLK + II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74 SARGKKLSEIMEKGQLVPLETVLDMLRDAMVAKVNTSKGFLIDGYPREVQQGEEFERRIG 110 S +G++L +M G LV + VL +L DA+ +SKGFLIDGYPR+ QG EFE RI SDKGRQLQAVMASGGLVSNDEVLSLLNDAITRAKGSSKGFLIDGYPRQKNQGIEFEARIA 134 Query: 111 QPTLLLYVDAGPETMTQRLLKRGETSG--RVDDNEETIKKRLETYYKATEPVIAFYEKRG 168 L LY + +TM QR++ R S R DDNE+TI+ RL T+ + T ++ YE + Sbjct: 135 PADLALYFECSEDTMVQRIMARAAASAVKRDDDNEKTIRARLLTFKQNTNAILELYEPKT 194 Query: 169 IVRKVNAEGSVDSVFSQVCTHLDAL 193 + +NAE VD +F +V +D + Sbjct: 195 LT--INAERDVDDIFLEVVQAIDCV 217 First you will see sequence identification information for the subject (Drosophila) protein in the alignment. This protein is called “Adenylate kinase-1”: >gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster] Next you will see the total length of the subject protein, 229 amino acid residues: Length = 229 Looking at the sequence alignment itself, you will see that it wraps around, taking up 3 ½ “rows.” One “row” is shown at the bottom of this slide. Residues 2 to 193 of the query protein are aligned with residues 15 to 217 of the Drosophila protein (see the numbers on the right and left sides of the previous slide). The “middle” line of each row (the line between the query and subject lines) is called the “consensus sequence.” Whenever there is a residue that is identical for the query protein and the subject protein, it is indicated in this middle line. Whenever there is a residue that is chemically similar (a conservative substitution) for the query and the subject, it is marked with a ‘+’ symbol. If one of the sequences must be “cut” in order to align it with the other, this is indicated with a “-” symbol. This is referred to as a “gap” in the alignment. Query: 2 Sbjct: 15 EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50 EEKLK + II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74 Just above the sequence alignment itself you will see statistical information for the alignment (essentially telling you “how similar” the two sequences are): Score = 179 bits (453), Expect = 1e-45 Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%) This tells you that of the 205 amino acid residues that are aligned, 96 are identical between the query protein and the subject protein. Of the 205 aligned residues, 131 are either identical OR similar (have “+” symbol). 15 gaps were introduced into the sequences (have “-” symbol). The expected-value (1x10-45 in this case; a very small number!) is the probability that this alignment could occur by chance between two unrelated sequences from a database of the size that was searched. The bottom line: the smaller the expected-value, the more similar the two sequences.