Introduction, Basic Biology

advertisement
Computational Biology
Introduction, Basic Biology
Q
Nives Skunca
Slides prepared by Dr. Christophe Dessimoz
19/21 September 2012
This week
Course introduction
Basic Biology
perturbation
Reality
observation
Catalogue
observation
observation
Nature
Georg Dionysius Ehret's illustration of Linnaeus's
sexual system of plant classification, 1736
Model
formulate/select
recreate life
“synthetic biology”
take it apart
“in vitro”
obs.
obs.
obs.
perturb.
Validate on
real data
obs.
perturb.
f(x)
Validate
Estimate
by simulation
prediction
Learning Outcomes
•
Understand basic concepts of molecular
biology
•
Understand and apply fundamental
models, algorithms, data structures, and
computational techniques to answer
biological questions
•
Wide range of topics, but special focus on
biological sequences and their evolutionary
context.
Topics
Molecular Genetics
Gene Evolution
Genome Evolution
Mass Spectrometry
Codon Bias
x
Modeling
Dynamic programming
Markov models
Least squares
Maximum Likelihood
Optimization
Heuristics
Simulation
Organization
•
Lecture
•
•
•
Wed 13-14 (CAB G52), Fri 13-15 (ML F34)
Prof. Gonnet will hold the lectures
Exercises:
•
•
Thu 14-16 (CAB H56), starting this week
If you do not have a nethz account, ask
Stefan Zoller as soon as possible.
Teaching Assistants
•
•
Stefan Zoller
Nives Skunca
Date
Sept. 19/21
Topic
Course Introduction; Basic Molecular
Biology
Sept. 26/28 Markov models/String Alignment I
Oct. 3/5
String Alignment II (indels, estimating
distances)
Oct. 10/12
Substitution Matrices
Oct. 17/19
Approximate Alignment Methods;
Statistics of Pairwise Alignments
Oct. 24/26
Phylogeny I
Oct.31/Nov.2 Phylogeny II
Nov. 7/9
Phylogeny III
Nov. 14/16 Multiple Sequence Alignments
Nov. 21/23 Synthetic Evolution; Evaluation of
Estimators
Nov. 28/30 Current research; Mass profiling
Dec. 5/7
Dec. 12/14
Dec. 19/21
Orthology/Lateral Gene Transfer
Codon bias
Genome Rearrangements
Lecturer
NS
GHG
GHG
GHG
GHG
GHG
GHG
GHG
AS
DD/GHG
Guests/
GHG
NS
SZ
GHG
Course Grade & Credits
•
Participation in the exercises is strongly
encouraged, but not mandatory
•
Written Exam
•
•
•
During winter session
3 hours
Only support materials are 2 A4 pages
(4 sides), personally handwritten.
Course Homepage
http://www.cbrg.ethz.ch/education/CompBiol
•
•
•
•
Course details
Schedule
Slides
Exercises
Darwin
•
•
Interpreted language based on Maple
•
Available for download mac and linux
(http://www.cbrg.ethz.ch/darwin)
Environment for bioinformatics, can do
sequence management, mathematics,
alignments, trees, drawing, etc.
Biorecipes
www.biorecipes.com
•
A collection of real
problems with coded
solutions in the
Darwin language
•
•
Darwin input in green
Darwin output in red
Other materials
•
Slides can be downloaded from the
course homepage.
•
Additional notes and references will be
made available as well.
Basic Biology
Slides of this part are largely
based on material from
Dr. Gina Cannarozzi
Basic Principles
•
•
•
•
Universality of life on earth: water,
carbon-based biochemistry; genetic
material; genetic code (largely) universal.
→ common origin!
Life is compartmentalized: cells are
fundamental units of structure,
function, organization
Self-replicating
Capable of Darwinian evolution
10 µm
Cryptomonadales
Encyclopedia of Life
(eol.org)
So what is life?
“Living organisms undergo metabolism,
maintain homeostasis, possess a capacity
to grow, respond to stimuli, reproduce
and, through natural selection, adapt to
their environment in successive
generations.”
•
What about endospores? viruses? mules? priests?
prions? computer viruses?
•
In biology, there are exceptions to almost every rule.
Inside a Cell
Prokaryote
http://www.osovo.com/diagram/prokaryoticcelldiagram.htm
~2 µm
Eukaryote
http://www.biologycorner.com/resources/cell.gif
10-30 µm
Relevant components
•
•
•
•
Ribosomes translate mRNA into proteins.
Mitochondria (eukaryotes) have their own
DNA and are a result of early inclusion of αproteobacteria into a eukaryotic cell.
Chloroplasts (plants, protists) have their own
DNA as a result of early inclusion of
cyanobacteria into a eukaryotic cell.
Plasmids (bacteria) are short pieces of circular
DNA in multiple copies; nonessential; get
transferred between bacteria.
Genome
chromosome
chromatin
histone
•
Genome: all the genetic
material of an organism.
•
The genome consists of
genes and non-coding
regions.
•
Genes consist of
regulatory regions,
intron, exons,
untranslated regions
http://www.scfbio-iitd.res.in/tutorial/geneorganization.html
Escherichia coli
Homo sapiens
23 chromosome pairs
1 circular chromosome
1 plasmid (multiple copies)
~4.6 million base pairs
~3.9 million
coding bases (85%)
4132 protein-coding genes
172 RNA (tRNA, rRNA,etc)
578 pseudogenes
~3 billion base pairs
~50 million coding bases (1.5%)
~21,000 protein-coding genes
~294,000 exons
~60,000 different transcripts
~6,000 pseudogenes
~4,800 RNA genes
~2,900 RNA pseudogenes
DNA
Deoxyribonucleic acid
•
•
•
•
Double helix
Backbones: phosphate and
deoxyribose , directed
(5’ → 3’), antiparallel
34 Å
(3.4 nm)
Connection: 4 bases Adenine,
Thymine, Cytosine, Guanine.
A-T and C-G are paired by
hydrogen bonds (relatively weak)
3.3 Å
(0.33 nm)
Wikipedia
DNA Bases
PuRines
PYrimidines
C ···· G: 3 H-bonds
A ···· T: 2H-bonds
Wikipedia
Hydrogen Bond
•
X-H ···· Y where X,Y is
an electronegative
atom (typically N,O,F)
•
Responsible for high
boiling point of water
(each H20 can have
up to 4 H bonds)
“Central dogma of
molecular biology”
Wikipedia
DNA Replication
Wikipedia
Polymerase can only add bases from 5’→3’
(DNA is read 3’ → 5’)
Movie time!
Replication visualized:
http://www.wehi.edu.au/education/wehitv/molecular_visualisations_of_dna/
End of day 1
RNA
•
•
•
•
•
Single stranded (can form structure)
•
microRNA: short nucleotides (~22 nts)
which regulate gene function
Uracil instead of Thymine
mRNA: messenger RNA, for translation
rRNA: subunit of ribosome
tRNA: specific for one amino-acid,
selectively bind to codon via ribosome.
http://www.pdb.org/pdb/static.do?
p=education_discussion/
molecule_of_the_month/pdb15_2.html
Transcription
•
Transcription factors bind to promoter sites at
the 5’ regulatory region.
•
•
RNA polymerase, binds to the complex.
•
Genes can be on either strand, but direction of
growing mRNA sequence is always 5’ → 3’
Working together, they open the DNA double
helix.
Roger Kornberg
Nobel Prize Chemistry 2006
The chain shown in grey is RNA polymerase,
with the portion that clamps on the DNA
shaded in yellow. The DNA helix being
unwound and transcribed by RNA
polymerase is shown in green and blue, and
the growing RNA stand is shown in red.
http://med.stanford.edu/featured_topics/nobel/kornberg/release.html
Post-transcriptional
modifications (Eukaryotes)
•
•
•
5’ Cap
Poly-A tail
Splicing (removal of introns)
Research questions: Where are the introns? Where are the
coding sequences? Where are the stop and start of
transcription? Where are the binding sites for the transcription
factors that control when transcription takes place?
Alternative Splicing
•
•
Humans: >50% of genes have splice variants.
Dscam gene in D. melanogaster: 95 alternative
exons can express 38,016 different mRNAs through
alternative splicing.
Translation
Wikimedia
Commons
The Genetic Code
Proteins
•
•
Participate in most (all?) cellular processes
•
Encoded in DNA
Made of 20 amino-acids (+ occasionally a
cofactor, such as metal ion, heme, ATP, etc.)
Alberts et al., “Essential cell biology: an introduction to the
molecular biology of the cell”, Garland 1996
Functions of Proteins
...
Amino
Acids
•
Only sidechains
differ (red)
•
Sidechains have
diverse chemical
properties
(charge, size, pH,
hydrophobicity, ...)
Wikimedia
Commons
Peptide Bond
G. Cannarozzi
Proteins
have a 3D
structure
Wikimedia Commons
Biological sequences
How are they identified?
Where are they stored?
Next Generation Sequencing
Unidentified protein
extracted from gel
Proteomics
MDISTLTASEEIE
MEIDAEEIEIMAT
IDLAEDLISLFM
DDMFSSIDLESI
NFEIFNSSDIDSI
NIDLESIEEIEIMF
EEIEIMATIFNSS
DIDIMMDIMMD
SINFEIFNSSDIDI
MMDATIDLAED
LISLFMDDMFSS
IDLESINFEIFNSS
Split into fragments
of 5-10 amino acids
e
. . . AEDLISLFMDDM . . .
Determine mass
using MS (Mass
Spectrometry)
Determine amino
acid sequence and
compare with sequence database
Sequence
Database
Jiang Long, Science Creative
Quarterly Image Bank
Protein Identified
Growth of sequence databases
Number of sequences x 10^7
2.0
Protein Data Bank
8QL3URW.%6ZLVVí3URW
UniProtKB/TrEmbl
1.5
1.0
0.5
0
2000
2002
2004
2006
Year
2008
2010
2012
Getting Sequences
Ensembl
...
e.g. GenBank File
e.g. GenBank
File
e.g. GenBank File
Evolution
Darwinian Evolution
•
•
Start from an initial population
Repeat:
• reproduce and “mutate” randomly
• natural selection: fittest individuals
survive and have descendants
→ selects “good” mutations
• sometimes: a “branching” occurs (e.g.
speciation, duplication)
Not only the “good”
characters survive
•
Genetic drift (random sampling)
•
•
•
Population bottleneck
Founder effect
Genetic hitchhiking (neutral or mildly
deleterious alleles linked to positively
selected gene)
Species Evolution
Diane Dodd’s fruit fly experiment
•
Speciation: the
evolutionary process by
which new species arise
•
Can occur from
geographic isolation or
barriers, new niche
entered, animal
husbandry
http://evolution.berkeley.edu/evolibrary/article/_0_0/evo_45
Genome Rearrangements
e.g. Human vs. Dog
Krzywinski et al. Circos: an information aesthetic for comparative genomics. Genome Research (2009) vol. 19 (9) pp. 1639-45
Example: recombination
among E. coli strains
Mau et al. Genome Biology 2006 7:R44
Whole genome duplications
Gene Evolution
Point mutations
Kunkel, 2004, The Journal of Biological Chemistry
Point mutations
Purines
Pyrimidines
Insertion/deletion
Lateral Gene
Transfer
Wikipedia
http://www.scq.ubc.ca/attack-of-the-superbugs-antibiotic-resistance/
Recombination
Gene Evolution
•
•
•
•
•
•
Mutation (base substitution)
Insertion/Deletion
Transposition (horizontal transfer)
Recombination
Gene loss or gene duplication
Splicing pattern mutations
Evolutionary Distances
How can we quantify the amount of evolution
between two subjects?
•
•
•
•
•
Time since divergence
Number of common traits.
Edit distance (minimum # of elementary
operations to transform one object into the
other)
...
Desirable properties
• distance estimable without knowing history
• metric properties (e.g. triangle inequality)
Markovian Evolution
Markov Model: every site evolves independently,
probability of mutation only depends on present
state (no memory), probabilities of mutation are
expressed by transition matrix.
A
M1=
A
C
G
T
C G
T
0.900
0.033
0.033
0.033
0.033
0.900
0.033
0.033
0.033
0.033
0.900
0.033
0.033
0.033
0.033
0.900
After “one unit” of evolution, the
probability that an A mutates into a
C is given by the corresponding
entry in the matrix:
p(A→C | d=1) = M1[A→C] = 0.033
http://gi.cebitec.uni-bielefeld.de/people/boecker/bilder/tree_of_life_new.gif
Augustin Augier,
Arbre Botanique
(1801)
Lamarck, Philosophie Zoologique , 1809
Darwin, Notebook B, 1837
Edward Hitchcock, Elementary Geology, 1840
Haeckel, The Evolution of Man, 1879
rRNA was used by Woese (1987) to group early life forms into
three kingdoms
NO
CS
J
C
O
R
JK
L
F C EIX
FRRALAM
X
S
ST ACAAAC 3
S
I
TH TRRC C
EF A O 1
YW
CH
LA
B
CC
HHLF
CHCHL CHLLCVF
LM TA
P
UR N
BBIF
OA IFLAOA
C
TR
OW
A 8T
ARRT
TAS2
T
TH D
ET DEIG
28 EIRD
A
LEPB
LEPIC
J
IN L
MAGSM
SA
SR FA P
OOC CV
RH N MY CSSJKS
YYC
MM S2
C
MY
N
CE
P
HY
AU
TTFO
YCCCTB P
M
MYYC
M
MY
UA
1MYC
AA
CP
MY
RH
OB
A
PS
EP PS
K
EA
PS PSEE
E
B
P EU2 4
PSSE14
E
PSSEM
PSE F5
PF AL PSE
C
MBAS U5
HA
HRCAV
SAC
H
DC
2 HRS
D
N
THIC
METITOC
R
A
HC
ALH
ALLH
EH X
Y
L
XYL F T
FA
XA
NC 8
XXAANNC5P
XANO AC
R
M
I
D F
R RE GBL
COCO OR
C
PR
PA
RU
W
CARR
P
MAGMM
ZYMM
O
RHOR
T
SPHAL
V
H
CN
SO C DI
VE TM
RU
APHL
PPC
EEGEGG U
R LL L XB
YACKT CO
YDIB
PS
SIAC
PA
AC
NOV
A
ERY D
LH
GRA
GBLCU
OX
RHIERH
MES
RHRIM
HE
IL3 C ILO SB
BARBK
BARQU
BAR
HE
BR
UO
BR
2U
US
BR
UM
BR
EB
UA
2
BRAJA
BRASO
RHOPA
BRASB
RHOP2
RHOPS
RHOP5
RHOPB
NITWN
NITHX
R
GB
WI
AGRT5
P
CB
BU
R
CAUC
B
PELUPM
OTLR
G
RW
OL
RR
HR
WW
EEH
CJ
EHRCR
P Z EH R
ANAMM
ANA
SM
NEOCN
E
RIC
ICF
RY
T
RIC PR BR
IC
RIC
ITB R
OR
CC
O
IL
ID
PS
Y
VI IN
SH
V I BV U
Y
SHEDO
VB
IBPA
EF
CH
SHSH VIB
N
E E PHF1
SH
SH
EO LPAM OP
S
R
H
E
S
N
HE SM
C
SH ES
SR
PSOLP
ES A
EH 3
W
T
PS
EA
6
PARDP
S1
RHOS4
JANSC
SILPO
SILST
ROSDO
HYPNA
MARMM
C
BU
M
P
CA AI
BU UCOPB
L
B
OF B L
BL
S4H
RH
R
AE
AE
ERYE
W RE
SCOT 8
D
G
TPIA
CLHY
SLALT
SSAA
EANPPS
RRPREPRPRP
YYEYEYEYE
LL
1T
LKU 6I5
8O
ICFO
DOOSLL57
EHEIC
HSFCICSSO
SSHSEIEIBC
SSHH
O
PH
BA
UC
PA H H
SMAE
HHHAA
HAEAE
U S1
EIEIGNI
I8E
M
HA AC AN
ED TP SM
U 2
TO AC
SUL SUL O
S
SUL RAR
PY
AE
PYR
IL
Y
P RJ
C
PY R
PD
THE
D
BX
RU
KO U
PYR
PYYRRFAB
P YRHO
P
N
NEI
EIM
MA
G
F1B
CHRV
AZOSB O
AZOS
E DECAR
BORB
BO
RPR
RP
AE
BORA1
RALEH
RAL
EJ
RALM
EO
RALS
BURP1
BURPS
BURP0
BURMA
BURTA
ARCFU
METST
METS3
METTH
DEHSC
SYMTH DEHE1
CLOD6
CLOTH
CLOAB CLOT
E
CLONN
CL
CLOP
OPSE
1
METFK
N
AT
FR
O
H
AT
ATW
T1
FRAT
FR
AT
FR
HALSA
A
HALM D
NATP
HALWD
P
METT
METBU
METBF A
METM
METAC
UNCMA
METLZ
METMJ
METHJ
METKA METJAMETMP
CENSY
THEMA
BURCM
JANMA
BURCA
BURXL
BURCH
BURS3 HERAR
THIDA
VEREI
METPP
ACIAC
ACISJ
RHOFD
NITMU
EU
A J NIT
LNLS
POPO
NITEC
M
STA
O
EV
THHEACTO NEQ
T PIC NA
E
CL
MY
THETN
MYCS5
XD
MYNXADE
A
JEJFF
AMMJR
M
CCACA
SB
NIT
U
IBL
LS
ACUE
WO
L
SO
D
ELC
SLG P EBA
PS
GEO
DES
OM DBD
MS GEPELP
F
N
SYYNA
V
S
VH
DES DG
S IP
DEA
L W
STA
EQ
S
S
SSTTS
STTA
S
A
SAT
TBA
AAA8
A
ATA
RAA
AAA
S3C
NA
W
M
A
TF
EN
B
LN N
SUHID
T
RD
L
SA
S
FU
GU
ITH
TR AN
IE AS
SY I VPT
NY
SY 3
SY
NE
NP
L
6
SSYYN
NJ JA
GLB
OV
I
P PR
P RO OM
PR R O M 5 P
PR OMMS
PR OM 0
PR OM 9
OM A
OS
S T1
TL
SYYNP
U
SYNSCX
SY N
LE
N S
LELEI ITMRA SYN PW9
IBRIN YB S
OR
B 3PRO
AR
Y
P
PO ATHSA
SY ROMM
PT
NRENM3
R
3 CC
U
E
SORAR
ERIEU
MONDO
CH
ORNANICK
LOXAF
DASNO
MYOLU
CANFA
FELCA
BOVIN
TUPGB
OTOGA
HUMAN
PANTR
MACMU
RABIT
SPETR
DRO
ME
DR
ANO OPS
GA
AED
AE
ECHTE
NA
L
DE
L
LO
DE
B
HA
ST
CRY
NE
UST
MA
APIM
SCH
PO
PH
AN
O
AS
PF
U MAG BO
GR TC
YA
I
RL
I
CA
OE
RATN
MOUS
CAVPO
PIC
FUGRU
TETNG
GASAC
IN
CIO
CIOSA
ORYLA
AS
HG
O
YE
AS KLU
LA
T
NG
A
B
O
EN
O
M
UM
LE
BC6P8G
RM
RRRTFPPPP
R1PP3
TTTSP
PRSRTDTPR
SRTTS
TSSSR
SSTS
CAEBR
CAERE
CAEEL
PLAF7
CRYPV
DICDI
XENTR
DANRE
CA
LA L
C AC
PL B
PA
D
PE
S1
C 3
LAS CC
CSLA
315
D
1 RAA
TT2 TTR
TRR SS
SST
W
U
RM
2 V
ST SY
R
S
ST STR
RN26
RRPP
SSTT
M
CL S
LACLA
LA
NW
SY
HY
DES
RHZ
CA
OTA
MO
FK J
RA P
G FLA
UP
BA
RR
PA
OO
BB
RGA
E TRE
BTO
RED
E
H LT
LCLPDLCDH
L
CH
CHPE
GI
8OR
N
CT N V8ARDP 3
BA CFR
C P TH
A
BA B
CY
LA
E
NN
AQUA
P
LH
HE
Y
LP
HEELPLHPJH
H HE LA
HE
LLIS S S
I W
IN6 TATSAH
LLIISSSM
1 J
MOF
OC
EI
H
BBB
BAAACC
BBAAACC
CCCR
1
CCAAHZH
G NK
B G EO
BAACEOKTN
L
BA CSUD A
CH B
L L A D AC S
LLAAACCA CD
K
CJG C BA
OA
A
OA
NYYW
BP
PE
MYC
CT
MESMYC
FL MS
MYCG
A
UR
MYCEPA
PE
MYCPU
MYCMO
MMYC
YCH7
HJ
H2
MYCP
N
MYC
GE
Eukaryota
Archaea
Bacteria
Planctomycetes
Fusobacteria
candidate division TG1
Dictyoglomi
Verrucomicrobia
Aquificae
Acidobacteria
Deinococcus-Thermus
Thermotogae
Chloroflexi
Chlamydiae
Chlorobi
Bacteroidetes
Spirochaetes
Tenericutes
Cyanobacteria
Clostridia
Bacilli
Lactobacillales
Actinobacteria
Proteobacteria
F
P
BU AER
E
LA
KLU
ST
YEA
GA
CAN
PO
E
YN
A
CR
TM
US
IME
H
EC
TE
F
A
OX
L
K
IC
H
C N
NA
R
O
A
PL
A
CR
F7
Y
PV
D
I
CD
CA
I
C EB
CA AER R
E
EL E
O
R
YL
CI
O
C IN
I
O
SA
FU
G
TE RU
G TNG
A
S
AC
PO
AV
C
You are here.
R
AMT
ON
U
OS
E
NO
S
DA LU
O
MY NFA
A
CALC
AR O
E
R
N
F VI
O IEU ND
S
O
B GB
ER MO
P
N
TU GA MA TR
O U N U
OT H PA CM BIT TR
A RA E
M
SP
E
R
AN TR
N
E
OM
R
D PS
O
DR A
OG
AN DAE
AE
AP
D
EL
LO D
H
SC
XE
ST
PIC
HA
DEB
AL
CAN
Download