SV-computational

advertisement
COMPUTATIONAL BIOLOGY
OUTLINE
Proteins
 DNA
 RNA
 Genetics and evolution
 The Sequence Matching Problem
 RNA Sequence Matching
 Complexity of the Algorithms

DEFINITION

Computational Biology encompasses all
computational methods and theories applicable
to molecular biology and areas of computer based
techniques for solving biological problems.
PROTIENS
Building blocks of living organism
 Large molecule that is composed of sequences of
amino acids
 There are 20 amino acids which are divided into
classes
hydrophobic(h-phob)
hydrophillic(h-phil)
polar(pos,neg)

Amino
acid
Sym
Class
Amino
Acid
Sym
Class
Alanine
A
h-phob
Leucine
L
h-phob
Arginine
R
pos
Lysine
K
pos
Asparagi
ne
N
h-phill
Metheioni
ne
M
h-phob
Aspartic
acid
D
neg
Phenylala
nine
F
h-phob
Cysterine
C
h-phill
Proline
P
h-phob
Glutamin
e
Q
h-phill
Serine
S
h-phill
Glutamic
acid
E
neg
Threonin
e
T
h-phill
Glycine
G
h-phob
Tryptoph
an
W
h-phob
Histidine
H
pos
Tyrosine
Y
h-phill
Isoleucine
I
h-prob
Valine
V
h-prob
DNA
Blueprint of living organisms
 DNA is composed of two strands hold by a weak
hydrogen bond
 Each strand is a sequence of nucleotides
 DNA has four bases which are classified as two
chemical types

Base
Symbol
Type
Adenine
A
Purine
Thymine
T
Purine
Cytosine
C
Pyrimidine
Guanine
G
Pyrimidine
DNA DOUBLE HELIX
RNA
RNA is chemically very similar to DNA
 There are two important differences
 Four bases present in RNA are
adenine(A)
guanine(G)
cystosine(C)
uracil(U)
 RNA nucleotides contain a different sugar
molecule(ribose)

GENETICS AND EVOLTION

Mutation

Natural selection

Genetic drift
SEQUENCE MATCHING PROBLEM
Matching DNA,RNA, or Protein sequence
between a diseased organism and a healthy
organism
 Proteins are longer and DNA strands are even
longer
 We match them by breaking them in to shorter
subsequences
 Breaking and matching is done by notion of
alignment.

SEQUENCE MATCHING EXAMPLE

Consider two amino acid sequences:
ACCTGAGAG
ACGTGGCAG
sequence alignment
ACCTGAG–AC
ACGTG–GCAC
FINITE STATE MACHINES IN BLAST
It is used to find out which of the sequences in a
database are related to the new given sequence
using BLAST
 The BLAST system is a three step process
1. Examine the query string and select set of
substrings of length w(between 4 and 20) which
are good for producing matches
2. Build a DFSM that uses set of substrings and
find the sequences with the highest local
matches in the database
3. Examine the matches found in step2 and try
to build a longer matching sequences

REGULAR EXPRESSIONS SPECIFY PROTEIN MOTIF

Aligning collection of related proteins we can
define a motif
Example:
E S G HDT
Y Y NKNR
M DTTTTT S W Q S
R G SDTTT
P D M T
A G P TT
W R N T
Once an motif is defined we can search for the
occurrences of it in other protein sequence by
using regular expressions
HMM FOR SEQUENCE MATCHING
HMM’s are used when sequences become fairly
diverse
 We can capture the variations among the
members of the family and the probabilities
associated with them
 So by using HMM’s we can find the best
alignment between two sequences and from
which family does a given new sequence belongs
to







HMM profile is given by
M = (K,O,π,A,B)
K is a set of n states, one for each position in the
sequence
O is the output alphabet
Π contains the initial state probabilities
A contains the transition probabilities
B contains the output probabilities
EXAMPLE OF HMM DESCRIBING PROTEIN
SEQUENCE FAMILY
RNA SEQUENCE MATCHING AND SECONDARY
STRUCTURE PREDICTION USING THE TOOLS OF
CONTEXT-FREE LANGUAGES
In RNA a change to a single nucleotide in a stem
region could completely alter the molecules shape
and its function
 So an change in the stem must be matched by a
corresponding change in the paired nucleotide
 Context free languages are used describe these
nested dependencies and secondary structure

EXAMPLE
COMPLEXITY OF ALGORITHMS USED IN
COMPUTATIONAL BIOLOGY
Approaches to many of the problems described
here are computational like breaking up of large
protein and DNA molecules into substrings
 NP-hard
 Conversion to decision problem

SHOERTEST-SUPERSTRING(<S,K> : S is a set of strings
and there exists some superstring T such that every
element of S is a substring of T and T has length less than
or equal to K) – NP-complete
REFERENCE


Automata, computability, and complexity|Theory
and Applications [book] by Elaine Rich.
http://en.wikipedia.org/wiki/Computational_biology
Thank you
Download