BioTechnological Center
TU Dresden Biotec
n Molecular biology primer n The role of computer science n Phylogeny n Sequence Searching n Protein structure n Clinical implications n Read chapter 1
By Michael Schroeder, Biotec, 2
n 1953: Watson and Crick discover the structure of DNA n 2000: Draft of human genome is published n “The most wondrous map ever produced by human kind” n “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”
By Michael Schroeder, Biotec 3
n Microarrays n Measure activity of thousands of genes at the same time n Example: n Cancer n Compare activity with and without drug treatment n Result: Hundreds of candidate drug targets n RNAi (Noble prize 2004, Fire and Mello) n Knock-down genes and observe effect n Example: n Infectious diseases n Which proteins orchestrate entry into cell?
n Result: Hundreds of candidate proteins n Atomic force microscopes (Noble prize Binnig) n Pull protein out of membrane and measure force n Example: n Eye diseases resulting fomr misfolding n Result: Hundreds of candidate residues
By Michael Schroeder, Biotec 4
n Challenge: Longer time to market, fewer drugs, exploding costs n Approach: Use of compound libraries and highthroughput screening
By Michael Schroeder, Biotec, 5
n High-throughput technologies have completely changed the work of biomedical researchers n Challenge: Interpret (often large) results of screens n Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information
By Michael Schroeder, Biotec 6
>1.000.000
Sequences
By Michael Schroeder, Biotec
>30.000
3D Structures
Number of PubMed Abstracts
14,000,000
12,000,000
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
1960
>16.000.000
Articles
1970 1980
Year
1990 2000
Molecular Biology Database List at Nucleic Acids Research
800
700
600
500
400
300
200
100
0
>700
DBs/Tools
2000 2001 2002 2003 year
2004 2005
7
2010
n How to analyse data, how to integrate data?
n Comptuer science to the rescue…
By Michael Schroeder, Biotec 8
n Human genome is a string of length 3.200.000.000
n Shotgun sequencing: Break multiple copies of string into shorter substrings n Example: n shotgunsequencing shotgunsequencing shotgunsequencing n cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un n Computing problem: Assemble strings
By Michael Schroeder, Biotec 9
n sh n sho n shot n otgu n tg n n gun un n ns n n seq sequ n equ n n uenc encing n n en cing n ing
QUESTION: How can you handle long repetitive sequences?
Heeeeelllllllllllooooooo
QUESTION: Why was a draft announced? When was the final version ready?
By Michael Schroeder, Biotec 10
Yersinia pestis
Arabidopsis thaliana
Buchnerasp.
APS
Aquifex aeolicus
Archaeoglobus fulgidus
Borrelia burgorferi
Mycobacterium tuberculosis
Caenorhabitis elegans
Campylobacter jejuni
Chlamydia pneumoniae
Vibrio cholerae
Drosophila melanogaster
Escherichia
Thermoplasma coli acidophilum
Helicobacter pylori
Mycobacterium leprae mouse
Neisseria meningitidis
Z2491
Plasmodium falciparum
Pseudomonas aeruginosa
Ureaplasma urealyticum rat
Rickettsia prowazekii
Saccharomyces cerevisiae
Salmonella enterica
Bacillus subtilis
Thermotoga maritima
Xylella fastidiosa
By Michael Schroeder, Biotec 11
Break through of the year 2000
Next quest:
Sequencing a genome for 1000$
By Michael Schroeder, Biotec 12
n Understand integrative aspects of the biology of organisms n Interrelate sequence, three-dimensional structure, interactions, function of proteins, nucleic acids and protein-nucleic acid complexes n Travel in time n backward (deduce events in evolutionary history ) and n forward ( deliberate modification of biological systems) n Applications in medicine, agriculture, and other scientific fields
By Michael Schroeder, Biotec 13
n New virus (e.g. SARS) and goal to develop treatment n Scientists isolate genetic material of virus n Screen genome for relationships with previously studied viruses [10] n From virus’ DNA they compute the proteins it produces [1] n Compute proteins’ three-dimensional structure and thereby obtain clues about their functions n Screen for similar proteins sequences with known structure [15] n If any are found n Then interpret difference (homology modelling) [25] n Else predict structure from sequence [55] n Identify or design small molecule blocking relevant active sites of the protein [50] n Design antibodies to neutralize the virus [50] n Index of problem difficulty: n <30: solution exists already, n >30: we cannot solve this (yet)
By Michael Schroeder, Biotec 14
n Life n A biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of matter, energy and information n Time n Species evolve through n natural mutation, n recombination of genes in sexual reproduction, or n direct gene transfer n Read the past in contemporary genomes n Space n Species occupy local ecosystems n Species are composed of organisms n Organisms are composed of cells n Cells are composed of molecules
By Michael Schroeder, Biotec 15
By Michael Schroeder, Biotec,
n 20 naturally occurring amino acids in proteins n Non-polar n G glycine, A alanine, P proline, V valine n I isoleucine, L leucine, F phenylalanine, M methionine n Polar n S serine, C cysteine, T threonine, N asparagine n Q glutamine, H histidine, Y tyrosine, W tryptophan n Charged n D aspartic acid, E glutamic acid, K lysine, R arginine n Other classification n H,F,Y,W are aromatic and play role in membrane proteins n Distinguish n atg = adenine-thymine-guanine and n ATG = Alanine-Threonine-Glycine
By Michael Schroeder, Biotec, 17
First
Position
Second
Position
(5Õ end) T C A G
T
TTT Phe TCT Ser
TTC Phe TCC Ser
TAT Tyr TGT Cys
TAC Tyr TGC Cys
C
TTA
TTG
Leu
Leu
TCA
TCG
Ser
Ser
TAA Stop TGA Stop
TAG Stop TGG Trp
CTT Leu CCT Pro CAT His
CTC Leu CCC Pro CAC His
CGT
CGC
Arg
Arg
CTA Leu CCA Pro CAA Gln CGA Arg
CTG Leu CCG Pro CAG Gln CGG Arg
ATT Ile ACT Thr AAT Asn AGT Ser
A
G
ATC Ile
ATA Ile
ACC Thr AAC Asn AGC Ser
ACA Thr AAA Lys AGA Arg
ATG Met* ACG Thr AAG Lys AGG Arg
GTT Val GCC Ala GAT Asp GGT Gly
GTC Val GCC Ala GAC Asp GGC Gly
GTA Val GCA Ala GAA Glu GGA Gly
GTG Val GCG Ala GAG Glu GGG Gly
Third
Position
(3Õ end)
T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
By Michael Schroeder, Biotec, 18
n DNA: n Nucleotides are very similar and hence the structure of
DNA is very uniform n Proteins: n Great variety in threedimensional conformation to support diverse structure and functions n If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds
By Michael Schroeder, Biotec 19
n Translation from DNA sequence to amino acid sequence n is very simple to describe , n but requires immensely complicated machinery
(ribosome, tRNA) n The folding of the protein sequence into its threedimensional structure n is very difficult to describe n But occurs spontaneously
By Michael Schroeder, Biotec 20
n
n
n
By Michael Schroeder, Biotec 21
n Databases in molecular biology cover n Nucleic acid and protein sequences, n Macromolecular structures and functions n Archival databanks of biological information n DNA and protein sequences including annotations n Nucleic acid and protein structures including annotations n Protein expression patterns n Derived Databases n Sequence motifs (“signatures” of protein families) n Mutations and variants in DNA and protein sequences n Classification or relationships (e.g. hierarchy of structures) n Bibliographic databases (PubMed with 17M abstracts) n Collections n of links to web sites n of databases
By Michael Schroeder, Biotec 22
n Bioinformatics is the marriage of biology and information technology n Bioinformatics is an integrated multidisciplinary field n Covers computational tools and methods for managing, analysing and manipulating sets of biological data n Disciplines include: n biochemistry, genetics, structural biology, artificial intelligence, machine learning, software engineering, statistics, database theory, information visualisation, algorithm design
By Michael Schroeder, Biotec, 23
n Has three components n Creation of databases n Development of algorithms to analyse data n Use of these tools for analysing biological data
By Michael Schroeder, Biotec, 24
n 1. Given a sequence (fragment), find sequences in the database that are similar to it n 2. Given a protein structure (or fragment), find protein structures in the database that are similar to it n 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar threedimensional structures n 4. Given a protein structure , find sequences in the database that correspond to similar structures.
By Michael Schroeder, Biotec, 25
n 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures.
But How?
n Easy: Find similar sequences with known structure!
n But: There might be similar structures, whose sequence is not similar!
n 4. Given a protein structure , find sequences in the database that correspond to similar structures.
But How?
n Easy: Find similar structures and hence sequences n But: There are so many more sequences with unknown structure that the above method will have only very limited success n 1 and 2 are solved, 3 and 4 are active fields of research
By Michael Schroeder, Biotec, 26
n E.g. for which proteins of known structure involved in disease of disrupted purine biosynthesis in humans, are there related proteins in yeast?
n Solution: Virtual databases that provide transparent access to a number of underlying data sources and query and analysis tools
By Michael Schroeder, Biotec, 27
n Problems: n Given that there are primary and secondary databases, n how to control updates , n how to propagate change , n how to maintain consistency ?
n Contents (experimental results, annotations, supplementary information) all have there own source of error n Older data were limited by older techniques
By Michael Schroeder, Biotec, 28
n Experimental data (e.g. raw DNA sequence) needs to be enriched with annotations n Source of data n Investigators responsible n Relevant publication n Feature tables (e.g. coding regions) n Problems: n (often) lack of controlled and coherent vocabulary n Computer parseable n Automated annotation needed n SwissProt = ca. 540.000 annotated sequences n TrEMBL = ca. 40 Mio unannotated sequences n Maintanence of annotations (what if error detected?)
By Michael Schroeder, Biotec, 29
n Relevant areas: n Artificial Intelligence n Machine Learning n Neural networks, rulebased learning n Datamining n Association rules n Software Engineering n Design, implementation, testing of software n Programming n Object-oriented C++,
Java n Imperative: C, Modula,
Pascal, Cobol, Fortran n Logic: Prolog n Funtional: ML n Scripting: Perl, Python n Statistics n Database theory n Design and maintenance of databases n How to index sequences, time series, 3D strucutres n Information Visualisation n Graph drawing, diagrams, cartoons, 3D graphics n Algorithm design n Complexity of algorithms n Efficient data structures
By Michael Schroeder, Biotec, 30
n We will use Python n Scripting language n Supports string processing well n Widely used in bioinformatics
By Michael Schroeder, Biotec, 31
n Back in 18 th century, Linnaeus, a Swedish naturalist, classified living things according to a hierarchy:
Kingdom, Phylum, Class, Order, Family, Genus,
Species n Generally only genus and species are used for identification n Homo sapiens n Drosophila melanogastor n Bos taurus n Linnaeus’ classification based on observed similarity n Widely reflects biological ancestry
By Michael Schroeder, Biotec, 32
n Kingdom: n Phylum: n Class: n Order: n Family: n Genus: n Species:
Animalia
Chordata
Mammalia
Primata
Hominidae
Homo sapiens
Animalia
Chordata
Insecta
Diptera
Drosophilidae
Drosophila melanogastor
By Michael Schroeder, Biotec, 33
n Characteristics derived from a common ancestor are called homologous n E.g. eagle’s wing and human’s arm n Other apparently similar characteristics may have arisen independently by convergent evolution n E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings n Homologous characters may diverge functionally n E.g. bones in human middle and jaws of primitive fish
By Michael Schroeder, Biotec, 34
n Sequence analysis gives unambiguous evidence for relationship of species n For higher organisms sequence analysis and the classical tools of comparative anatomy, palaeontology, and embryology are often consistent n For microorganisms there are problems n Classical methods: how to describe features n Sequence analysis: lateral gene transfer
By Michael Schroeder, Biotec, 35
n Ribosomal RNA is present in all organisms n Based on 15S ribosomal RNAs life is divided n Bacteria n No nucleus (procaryote) n E.g. tuberculosis and E. coli n Archaea n No nucleus (procaryote) n few organisms living in hostile environments (termophiles, halophiles, sulphur reducers, methanogens) n Eukarya n Has a nucleus contained in membrane n Nucleus contains chromosomes n Internal compartments called organelles for specialised biological processes n Area outside nucleus and organelles called cytoplasm n E.g. yeast and human beings
By Michael Schroeder, Biotec, 36
By Michael Schroeder, Biotec, 37
By Michael Schroeder, Biotec, 38
Use ExPASy (www.expasy.ch) to search for pancreatic ribonuclease for horse (Equus caballus), minke whale (Balaenoptera acutorostrata), red kangaroo ( Macropus rufus )
>sp|P00674|RNP_HORSE Ribonuclease pancreatic
(EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF
VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY
PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST
Use sequence alignment to determine evolutionary relationship
By Michael Schroeder, Biotec, 39
1.
Global match : align all of one with all of the other sequence (mismatches, insertions, deletions)
And.
--so ,.from.hour.to.hour.we.r
ipe .and.r
ipe
|||| |||||||||||||||||||||||| ||||||
And.
then ,.from.hour.to.hour.we.r
ot.and.r
ot-
2.
Local match : find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored)
My .care.is.
loss .of.care,.by.
old .care.
d on e,
||||||||| ||||||||||||| |||||| ||
Your .care.is.
gain .of.care,.by.
new .care.
w on
By Michael Schroeder, Biotec, 40
3. Motif search : find matches of short sequence in long sequence
Option: perfect,
1 mismatch, mismatches+gaps+insertions+deletions m atch
|||| for the w atch to babble and to talk is most tolerable
By Michael Schroeder, Biotec, 41
4. Multiple sequence alignment
No.sooner.---met.--------.but.they.look’d
No.sooner.look’d.--------.but.they.lo-v’d
No.sooner.lo-v’d.--------.but.they.sigh’d
No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason
No.sooner.knew.the.reason.but.they.-------------sought.the.remedy
No.sooner. .but.they.
By Michael Schroeder, Biotec, 42
Use sequence alignment to determine evolutionary relationship…
Example: horse, whale and kangaroo
Expected: horse and whale are placental mammals, kangaroo is marsupial
Multiple alignment with CLUSTAL-W
(http://www.genome.jp/tools/clustalw) multiple sequence alignment computer program main parameters: gap opening/extension penalty
By Michael Schroeder, Biotec, 43
>sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5)
(RNase 1) (RNase A) - Equus caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF
DASVEVST
>sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5)
(RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual).
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF
DNSV
>sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5)
(RNase 1) (RNase A) - Macropus rufus (Red kangaroo)
(Megaleia rufa).
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE
NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA
YV
By Michael Schroeder, Biotec, 44
CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
-ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59
*:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:*: ****::***:*.* : **:** *..****** *:**: :::******* ******
DASVEVST 128
DNSV---- 124
DAYV---- 122
* *
By Michael Schroeder, Biotec, 45
Horse and Minke whale: 95
Minke whale and Red kangoroo: 82
Horse and Red kangoroo: 75
Conclusion: Horse and whale share the most identical residues
By Michael Schroeder, Biotec, 46
Mitochondrial cytochrome b from
Siberian woolly mammoth (Mammuthus primigenius) preserved in arctic permafrost
African elephant (Loxodonta africana)
Indian elephant (Elephans maximus)
Q: To which one is the Mammuth more closely related?
By Michael Schroeder, Biotec, 47
Indian elephant: sp|P24958|CYB_LOXAF
Mammoth: sp|P92658|CYB_MAMPR
African elephant: sp|O47885|CYB_ELEMA
MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
*** ** ***:**:**********************************************
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
************************************************************
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
**************************************:*********************
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240
FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
:********:***********************************************:**
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300
******************************************************:*****
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360
LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360
**:*************************: *** **********:***************
IILAFLPIAGVIENYLIK 378
IILAFLPIAGMIENYLIK 378
IILAFLPIAGMIENYLIK 378
**********:*******
By Michael Schroeder, Biotec, 48
Mammoth and African elephant have 10 mismatches,
Mammoth and Indian elephant 14.
Significant?
Q1: can we tell from these sequences alone that they are closely related?
Q2: differences are small – do they come from selection, random noise or drift
Strategies needed difference judging of similiarities
By Michael Schroeder, Biotec, 49
Important difference:
Similarity is the measurement of resemblance of sequences
Homology: common ancestor
Similarity is gradual, homology is either true or false
Similarity = now, homology = past events
Homology is only very rarely directly observed (e.g. lab population, clinical study of viral infection)
Homology is inferred from sequence similarity
By Michael Schroeder, Biotec, 50
The assertion that the cytochrome b sequences are
homologues means that there is a common ancestor
BUT:
1. Maybe cytochrome b functionally requires so many conserved residues and will hence occur in many species ( In fact, This is not the case here)
2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution )Mammoth are homolgues – are also ribonuclease sequences homologues? Difference is much bigger
3. Maybe mammoth and african elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster
4. Maybe all of them acquired cytochrome b through a virus
( horizontal gene transfer )
By Michael Schroeder, Biotec, 51
Classical methods confirm that for pancreatic ribonuclease (Horse – whale - kangoroo) inferring homology from similarity is justified
But to answer whether Mammoth are closer to African or
Indian elephants is too close to call (non-significant)
Problems with inferring phylogeny from gene and protein sequence comparison
Wide range of variation (possibly below statistical significance)
Different rates of evolution for different branches of the evolutionary tree
Even if relationship - which sequence came first?
By Michael Schroeder, Biotec, 52
Pylogeneticist’s dream of features:
‘all-or-none’ character
Irreversible appearance
Solution:
SINES and LINES (Short and Long Interspersed
Nuclear Elements)
Repetitive, non-coding sequences in eukaryotic genomes
>30% in human genome, >50% in some plants
SINES = 70-500 base pairs long, up to 10 6 copies
LINES up to 7000 base pairs, up to 10 5 copies
They enter genome by reverse transcription of RNA
By Michael Schroeder, Biotec, 53
The picture shows a Southern blot of DNA from different family members, probed using a mini-satellite.
You can work out which of F1 and F2 is the father of child C, by observing which bands they have in common.
(Reproduced from "Essential Medical Genetics" by M.Connor and
M.Ferguson-Smith, with permission from Blackwell Science.)
By Michael Schroeder, Biotec, 54
Either present or absent
Inserted at random in non-coding portion of genome i.e. SINE has no important function so that convergent evolution can be excluded
Presence of a SINE in two species and absence in a third implies that first two species are more closely related
SINE insertion appears to be irreversible
Temporal order
Presence of a SINE in two species and absence in a third implies that ancestor of first two species is younger than ancestor of all three
By Michael Schroeder, Biotec, 55
Q: What is the closest land-based relative of the whales?
Classical palaeontology links Cetacea (whales, dolphins, porpoises) with Artiodactyla
(including e.g. cattle)
Belief that Cetaceans diverged before Artiodactyla split into suborder of
Suiformes (e.g. pigs),
Tylopoda (e.g. camels, llamas),
Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe)
By Michael Schroeder, Biotec, 56
Sequence comparison results
Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen, and others
Closest relatives of whales are hippopotamuses (share 4
SINES)
These two are closest to Ruminantia
By Michael Schroeder, Biotec, 57
False negatives :
300 out of 1000 are not found
Any search method for sequences should be
Sensitive: pick up distant relationships
Selective: reported relationships are true
Example: database with (among others) 1000 globin sequences
Globin familiy (oxygen transport) of proteins occurs in many species
Proteins have same function and structure
But there are pairs of members of the family sharing less than 10% identical residues
Sequence Database
1000 Globin
Sequences
900 Search results
True positives:
700 out of 900 are really globins
False positives :
200 out of 900 are not globins
By Michael Schroeder, Biotec, 58
How can we find distant relationships without increasing the false negatives?
PSI-BLAST:
Position Sensitive Iterated – Basic Linear Alignment
Sequence Tool
Identifies conserved patterns within the sequences
Improves Sens and Spec
Score via intermediaries may be better than score from direct comparison
A
50%
B
50%
C
Only 10%
By Michael Schroeder, Biotec, 59
Human PAX-6 gene (SwissProt ID P26367) has homologues in many different species
(human, Drosophila, etc.)
TF for eye development
Mutations in:
Human: no or deformed iris
Drosophila: no eyes, expressed in wing or leg ectopic eyes
PSI-Blast at NCBI site (www.ncbi.nlm.nih.gov)
By Michael Schroeder, Biotec, 60
By Michael Schroeder, Biotec, 61
• Description of sequence
• Max score – linked to data that show where sequences match
• Total score - includes scores from non-contiguous portions of the subject sequence that match the query
• Query coverage
• Identity - % of a sequence with the highest percentage of identical bases
• E-Value
• Accession number – linked to Gene bank record
By Michael Schroeder, Biotec, 62
BLASTP 2.2.28+
RID: 6D2U321501N
Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
33,121,465 sequences; 11,555,699,950 total letters
Query= gi|6174889|sp|P26367.2|PAX6_HUMAN RecName: Full=Paired box protein
Pax-6; AltName: Full=Aniridia type II protein; AltName:
Full=Oculorhombin
Length=422
Score E
Sequences producing significant alignments: (Bits) Value ref|NP_000271.1| paired box protein Pax-6 isoform a [Homo sap... 870 0.0 ref|XP_004264012.1| PREDICTED: paired box protein Pax-6 isofo... 869 0.0 ref|XP_003910122.1| PREDICTED: paired box protein Pax-6 isofo... 869 0.0 ref|XP_004683008.1| PREDICTED: paired box protein Pax-6 isofo... 869 0.0 ref|XP_005064880.1| PREDICTED: paired box protein Pax-6 isofo... 868 0.0 ref|NP_001035735.1| paired box protein Pax-6 [Bos taurus] >re... 868 0.0 gb|AAA59962.1| oculorhombin [Homo sapiens] 868 0.0 ref|NP_037133.1| paired box protein Pax-6 [Rattus norvegicus]... 868 0.0 gb|EAW68233.1| paired box gene 6 (aniridia, keratitis), isofo... 869 0.0
...
By Michael Schroeder, Biotec, 63
Proteins play a variety of roles:
Structural (viral coat proteins, horny outer layer of human and animal skin, cytoskeleton)
Catalysis of chemical reactions (enzymes)
Transport and Storage (e.g. haemoglobin)
Regulation (e.g. hormones)
Receptor and signal transduction
Genetic transcription
Recognition (cell adhesion molecules)
Antibodies and other proteins of the immune system
By Michael Schroeder, Biotec, 64
Are large molecules
Only small part – the active site – is functional
Evolve by structural changes produced by mutations in the amino acid sequence
Ca. 21.000 human proteins structures are now known
Overall 90.000 protein structures in PDB
Can be obtained by X-ray crystallography or nuclear magnetic resonance (NMR)
By Michael Schroeder, Biotec, 65
Backbone and side chain
Residue i-1
, Residue i
, Residue i+1
,
S i-1
S i
S i+1
Side chain (variable)
| | |
…N-C
α
-C-N-C
α
-C-N-C
α
-C… Main chain (constant)
|| || ||
O O O
Polypeptide chain folds into a curve in space
Common structural feature
Alpha-helix
Beta-sheet
Turns and Loops
By Michael Schroeder, Biotec, 66
Primary structure : Amino acid sequence
Secondary structure : Helices, sheets, loops, hydrogen-bonding pattern of main chain
Tertiary structure : Assembly and interactions of helices, sheets, etc.
Quaternary structure : Assembly of monomers
Evolution can merge proteins
E.g.: 5 enzymes in E. coli = 1 protein in fungi Aspergillus nidulans catalyze successive steps in biosynthesis of aromatic amino acids
E.g.: Globins form tetramers in mammalian haemoglobin and dimers in ark clam Scaoharca inaequivalvis
By Michael Schroeder, Biotec, 67
DHAP to GAP in Glycolyse
Triosephosphate isomerase from Bacillus stearothermophilus
Highly efficient enzyme appearing in most species
By Michael Schroeder, Biotec, 68
Alpha-helix hairpin
Beta hairpin
Beta-alpha-beta unit
= Patterns of interaction between helices and sheets
By Michael Schroeder, Biotec, 69
Supersecondary structures:
Alpha-helix hairpin
Beta hairpin
Beta-alpha-beta unit
Domains:
Compact unit, single chain, independent stability
Modular proteins:
Multi-domain
Copies of related domains or “mix-and-match”
By Michael Schroeder, Biotec, 70
All Alpha : mostly alpha helices
All Beta : mostly beta sheets
Alpha+Beta : Helices and sheets in different parts of the molecule, no beta-alpha-beta units
Alpha/Beta : Helices and sheets assembled from beta-alpha-beta units
Alpha/Beta linear
Alpha/Beta barrel
Little or no secondary structure
By Michael Schroeder, Biotec, 71
top
CLASS
All alpha (284) All Beta (174) Alpha+Beta (376) Alpha/Beta (147)
FOLD
Trypsin-like serine proteases (1)
SUPERFAMILY
= evolutionary related, similar structure, not necessarily similar sequence
Immunoglobulin-like (23)
Transglutaminase (1) Immunoglobulin (6)
FAMILY
= set of domains with similar sequence
By Michael Schroeder, Biotec,
C1 set domains
(antibody constant)
V set domains
(antibody variable)
72
By Michael Schroeder, Biotec, 73
Engrailed homeodomain (1enh)
Transcription factor important in development
Used to study protein folding
Utrophin calmodulin homology domain (1bhd)
Actin binding
Closely relatd to dystrophin, whose lack causes muscular dystrophies (weak muscles)
Cytochrome c, rice (1ccr)
Electron transport across mitochondrial membrane
By Michael Schroeder, Biotec,
DNA-binding domain of HIN recombinase (1hcr)
74
By Michael Schroeder, Biotec,
Engrailed homeodomain (1enh)
75
Fibronectin III domain (1fna)
Found on cell surface
Mannose-binding protein (1npl)
Barnase (1brn)
Cleaves RNA and is lethal if intracellular and not inhibited by barstar
By Michael Schroeder, Biotec,
TATA-box-binding protein (1cdw)
76
OB-domain from Lys-tRNA synthetase (1bbw)
Scytalone dehydratase (3std)
Alcohol dehydrogenase, NADbinding domain (1ee2)
Break down of alcohol into simpler compounds
By Michael Schroeder, Biotec,
Adenylate kinase (3adk)
Energy production
77
Chemotaxis receptor methyltransferase (1af7)
By Michael Schroeder, Biotec,
Pancreatic spasmolytic polypeptide (2psp)
Thiamine phosphate synthase (2tps)
78
If sequence of amino acids contains enough information to specify three-dimensional structure of proteins, it should be possible to devise algorithm for prediction
Secondary structure prediction : Which segments of the sequence are helices, which strands?
Fold recognition : Given library of known structures with their sequences and a sequence with unknown structure , can we find the structure that is most similar
Homology modelling
Given two homologous sequences , one with one without structure.
If between 30 and 50% of the residues are identical, the structure can serve as a model
By Michael Schroeder, Biotec, 79
Chicken lysozyme KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS
Baboon alpha-lactalbumin KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES
Chicken lysozyme TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS
Baboon alpha-lactalbumin TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD
Chicken lysozyme DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRL-
Baboon alpha-lactalbumin I--KGIDYWIAHKALC-TEKL-EQWL--CE-K
By Michael Schroeder, Biotec, 80
Fast and reliable diagnosis of disease and risk:
Easy diagnosis (with symptoms)
In advance of appearance (e.g. Huntington)
In utero diagnosis (e.g. cystic fibrosis: thick secretions in lung)
Genetic counselling
Customized treatment (predict response to therapy/side effects)
E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine.
Small fraction of patients used to die as they lack enzyme thiopurine methyltransferase.
Identify drug targets
Nowadays targets are: ½ receptors, ¼ enzymes, ¼ hormones
7% have unknown targets
Gene therapy
Replace defective genes or supply gene products (insulin for diabetes and Blood Factor VIII for haemophilia)
However : Most diseases do not have a single genetic cause!
By Michael Schroeder, Biotec, 81
By now you should
Have read chapter 1
Know the main data sources (sequence and structure)
Know the role that bioinformatics plays
Understand the difference between homology and similarity
Understand what sequence comparison and alignment are
Understand how they can be useful for phylogenetic studies
Understand primary, secondary, tertiary structure
Be able to assess the assumptions made and the quality of data
By Michael Schroeder, Biotec, 82