secondary structure

advertisement
Bioinformatics
Daniel Svozil
Laboratoř Chemie a informatiky
daniel.svozil@gmail.com
Not only small molecules and
QM, MM techniques rule the
world.
Central dogma of molecular biology
• Term is due to Francis Crick
• The conversion DNA →
protein is not direct, RNA is
involved
• DNA is the information store,
RNA is messenger (mRNA),
transporter (tRNA),
biomolecular nanomachine
(rRNA)
source: wikipedia.org
Nucleic acids
• four letters (DNA, RNA)
• sequence - AACTAACG (5’ → 3’)
• DNA – double helix
• RNA – “single stranded” helix, folding (double helical
regions, C2’ -OH → secondary and tertiary motifs)
nucleotide
nucleoside
A
B
Z
B-DNA
A-DNA
Z-DNA
RNA secondary motifs
Nowakowski and Tinoco, Seminars in Virology 8, 153, 1997.
RNA
source: http://complex.upf.es/~josep/RNA.jpg, http://www.biosci.ki.se/groups/ljo/images/phe_trna_large.jpg, http://rna.ucsc.edu/rnacenter/images/70s_atrna.jpg
Proteins
• 20 letters
• primary structure - sequence AMNTSSTVG (N-end → C-
end)
Alberts, Molecular Biology
of the Cell, 5th Ed.
• secondary structure (random coil, -helix,
β-sheet, loops)
• several secondary structure elements
form motifs
• e.g. greek key, β-α-β, HTH
• tertiary structure
(the arrangements of motifs into domain/s)
• quartenary structure (multimeric complexes)
Proteins
source:http://calstate.fullerton.edu/news/arts/2003/photos/protein-art.jpg
Proteins
source: Petsko, Ringe – Protein structure and function
http://www.cellsignal.com/reference/pathway/NF_kappaB.html
Systems biology
• focuses on the systematic study of complex interactions in
biological systems using a new perspective - holism
instead of reductionism
• holism – the properties of a system cannot be determined or
explained by its component parts alone
• one of the goals of systems biology is to discover new
emergent properties
• new field, boom since 2000, very little covered in CZ
Systems biology
source: wikipedia.org
Systems biology
• based on mathematical modelling of systems, control
theory, cybernetics
• engineering view on complex biological systems
• e.g. answers questions about robustness of the given
system when one of its part fails
• or about response of a systems upon the change of the
environmental conditions
quantum chemistry
molecular dynamics
bioinformatics
systems biology
Bioinformatics
• application of information technology to the field of
molecular biology, genomics and related biological
disciplines
• tremendous amount of data
• the creation and advancement of databases, algorithms,
computational and statistical techniques, and theory to
solve problems arising from the management and
analysis of biological data
Podle definičního třídění ruských vědců rozlišujeme
dva obory paranormálních jevů: bioinformatika a
bioenergetika. Bioinformatika (tzn. mimosmyslové
vnímání, ESP) zahrnuje získávání a výměnu
informací mimosmyslovou cestou (nikoli normálními
smyslovými orgány). V podstatě rozlišujeme
následující formy bioinformace: hypnózu (kontrolu
vědomí), telepatii, dálkové vnímání, prekognici,
retrokognici, mimotělní zkušenost, "vidění" rukama
nebo jinými částmi těla, inspiraci a zjevení.
zdroj: http://www.esoterika.cz/clanek/2992-mimosmyslova_spionaz_dalkove_pozorovani_i_.htm
Bioinformatics
• sequence analysis (sequence bioinformatics)
• structural analysis (structural bioinformatics)
• functional analysis (systems biology)
• genetic code
• gene
• genome, genomics
• large data sets
• high throughput
• human genome
• DNA localized mainly in nucleus, each nucleus carries the
whole genetic information
• 3.2 billions bp
• 25 000 – 30 000 genes
• ca 1,5 % codes for proteins, the rest - junk DNA
• what is proteome?
• proteomics
• Is it more difficult to study genome or proteome?
Sequential bioinformatics
• reconstruction of sequence fragments
• searching of genes and other interesting regions in
the genome
• junk DNA – 95% of human genome is made by non-coding
sequences, either no function, or not yet understood
• querying huge genomes for a given sequence
• comparison of genes within a specie – similarities
between protein functions
• comparison of genes between species – organism's
evolutionary relationships (phylogenetic analysis)
Sequence alignment
• Procedure of comparing sequences
• Point mutations – easy
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
gapless alignment
• More difficult example
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
• However, gaps can be inserted to get something like this
insertion × deletion
indel
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
gapped alignment
Flavors of sequence alignment
pair-wise alignment × multiple sequence alignment
Flavors of sequence alignment
global alignment × local alignment
global
local
align entire sequence
stretches of sequence with
the highest density of
matches are aligned,
generating islands of
matches or subalignments in
the aligned sequences
Scoring systems I
• DNA and protein sequences can be aligned so that the
number of identically matching pairs is maximized.
A T T G - - - T
A – - G A C A T
• Counting the number of matches gives us a score (3 in
this case). Higher score means better alignment.
• This procedure can be formalized using substitution
matrix.
A
Identity
matrix
T
C
A
1
T
0
1
C
0
0
1
G
0
0
0
G
1
Scoring systems II
• For nucleotide sequences identity matrix is usually good
•
•
•
•
enough.
For protein sequences identity matrix is not sufficient to
describe biological and evolutionary proceses.
It’s because amino acids are not exchanged with the same
probability as can be conceived theoretically.
For example substitution of aspartic acids D by glutamic acid E
is frequently observed. And change from aspartic acid to
tryptophan W is very rare.
Why is that?
1.
Triplet-based genetic code
GAT (D) → GAA (E), GAT (D) → TGG (W)
2.
Both D and E have similar properties, but D and W differ
considerably. D is hydrophylic, W is hydrophobic, D → W mutation
can greatly alter 3D structure and consequently function.
Zvelebil, Baum, Understanding bioinformatics.
Substitution matrices
Positive score – frequency of
substitutions is greater than would
have occurred by random chance.
Zero score – frequency is equal to
that expected by chance.
small, polar
Negative score – frequency is less
than would have occurred by random
chance.
small, nonpolar
polar or acidic
basic
large, hydrophobic
aromatic
Sequence database search
BLAST
Google of
sequence world
Phylogenetic analysis
Structural bioinformatics
• the function of chemical moiety is given by its structure
• while DNA structure is “given” (double-helix), RNA and
proteins can accommodate very different conformations
(i.e. specific arrangements of atoms in 3D space)
• structural bioinformatics covers
• analysis of the NA and proteins structure
• prediction of structure from the sequence
Protein structure prediction
• secondary structure prediction
• the conformational state of each residue is predicted as H (helix), E
(extended, β-sheet), C (coil)
• accuracy: 80%
• tertiary structure prediction
• why?
• many sequences are known, not that many 3D structures has been
solved
• some proteins (e.g. transmembrane) are difficult to characterize
experimentally
• many proteins have known function, but unknown structure (which is
however needed to understand their behavior in detail)
• ab initio, threading, homology modelling
CASP
• Critical Assessment of Structure Prediction
• http://predictioncenter.org/
• since 1994, every 2 years, CASP10 in
preparation
• predict solved, but not publicly released
structures
• competition of individual groups in 3D prediction:
• human groups – answer in 14 days
• software (automated prediction) – answer in 48 hours
Download