Slides - Department of Computer Science

advertisement
CS 5263 Bioinformatics
Lectures 1 & 2: Introduction to
Bioinformatics and Molecular Biology
Outline
•
•
•
•
•
Administravia
What is bioinformatics
Why bioinformatics
Course overview
Short introduction to molecular biology
Survey form
•
•
•
•
•
Your name
Email
Academic preparation
Interests
help me better design lectures and
assignments
Course Info
• Instructor: Jianhua Ruan
Office: S.B. 4.01.48
Phone: 458-6819
Email: jruan@cs.utsa.edu
Office hours: MW 2-3pm
• Web:
http://www.cs.utsa.edu/~jruan/teaching/cs
5263_fall_2008/
Course description
• A survey of algorithms and methods in
bioinformatics, approached from a
computational viewpoint.
• Prerequisite:
–
–
–
–
Programming experiences
Some knowledge in algorithms and data structures
Basic understanding of statistics and probability
Appetite to learn some biology
Textbooks
• An Introduction to Bioinformatics
Algorithms
by Jones and Pevzner
• Biological Sequence Analysis:
Probabilistic Models of Proteins and
Nucleic Acids
by Durbin, Eddy, Krogh and Mitchison
• Additional resources
– Papers
– Handouts
– See course website
Grading
• Attendance: 10%
– At most 2 classes missed without affecting grade
• Homeworks: 50%
– About 5 assignments
– Combination of theoretical and programming
exercises
– No exams
– No late submission accepted
– Read the collaboration policy!
• Final project and presentation: 40%
Why bioinformatics
• The advance of experimental technology
has generated huge amount of data
– The human genome is “finished”
– Even if it were, that’s only the beginning…
• The bottleneck is how to integrate and
analyze the data
– Noisy
– Diverse
Growth of GenBank vs Moore’s law
Genome annotations
Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006
What is bioinformatics
• National Institutes of Health (NIH):
– Research, development, or application of
computational tools and approaches for
expanding the use of biological, medical,
behavioral or health data, including those to
acquire, store, organize, archive, analyze, or
visualize such data.
What is bioinformatics
• National Center for Biotechnology
Information (NCBI):
– the field of science in which biology, computer
science, and information technology merge to
form a single discipline. The ultimate goal of
the field is to enable the discovery of new
biological insights as well as to create a
global perspective from which unifying
principles in biology can be discerned.
What is bioinformatics
• Wikipedia
– Bioinformatics refers to the creation and
advancement of algorithms, computational
and statistical techniques, and theory to solve
formal and practical problems posed by or
inspired from the management and analysis
of biological data.
Biology
Molecular Biology
Chemistry
Medicine
Bioinformatics
Mathematics
Statistics
Physics
Computer Science
Informatics
Course objectives
• Learn the basis of sequence analysis and other
computational biology algorithms
• Familiarize with the research topics in
bioinformatics
• Be able to
– Read / criticize bioinformatics research articles
– Identify subareas that best suit your background
– Communicate and exchange ideas with
(computational) biologists
What you will learn?
• Basic concepts in molecular biology and
genetics
• Algorithms to address selected problems in
bioinformatics
– Dynamic programming, string algorithms, graph
algorithms
– Statistical learning algorithms: HMM, EM, Gibbs
sampling
– Data mining: clustering / classification
• Applications to real data
What you will not learn?
• Designing / performing biological
experiments (duh!)
• Programming (in perl, etc).
• Building bioinformatics software tools (GUI,
database, Web, …)
• Using existing tools / databases (well, not
exactly true)
Covered topics
• Biology
• Sequence analysis
1 week
– Sequence alignment
• Pairwise, multiple, global, local, optimal, heuristic
– String matching
– Motif finding
•
•
•
•
Gene prediction
RNA structure prediction
Phylogenetic tree
Functional Genomics
– Microarray data analysis
– Biological networks
8 weeks
5 weeks
Computer Scientists vs
Biologists
(courtesy Serafim Batzoglou, Stanford)
Biologists vs computer scientists
• (almost) Everything is true or false in
computer science
• (almost) Nothing is ever true or false in
Biology
Biologists vs computer scientists
• Biologists seek to understand the
complicated, messy natural world
• Computer scientists strive to build their
own clean and organized virtual world
Biologists vs computer scientists
• Computer scientists are obsessed with
being the first to invent or prove something
• Biologists are obsessed with being the first
to discover something
Some examples of central
role of CS in bioinformatics
1. Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
~500 nucleotides
1. Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
A big puzzle
~60 million pieces
Computational Fragment Assembly
Introduced ~1980
1995: assemble up to 1,000,000 long DNA pieces
2000: assemble whole human genome
2. Gene Finding
Where are the genes?
In humans:
~22,000 genes
~1.5% of human DNA
2. Gene Finding
5’
Exon 1 Intron 1
Start codon
ATG
Exon 2 Intron 2
Splice sites
Exon 3
3’
Stop codon
TAG/TGA/TAA
Hidden Markov Models
(Well studied for many years
in speech recognition)
3. Protein Folding
• The amino-acid sequence of a protein determines the 3D
fold
• The 3D fold of a protein determines its function
• Can we predict 3D fold of a protein given its amino-acid
sequence?
– Holy grail of compbio—40 years old problem
– Molecular dynamics, computational geometry, machine learning
4. Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--|
|
|
|
|
|
|
|
|
|
|
|
|
x
|
|
|
|
|
|
|
|
|
|
|
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence Alignment
Introduced ~1970
BLAST: 1990, most cited paper in history
Still very active area of research
Efficient string matching algorithms
Fast database index techniques
query
DB
BLAST
Lipman & Pearson, 1985
…, comparison of a 200-amino-acid
sequence to the 500,000 residues in the
National Biomedical Research Foundation
library would take less than 2 minutes on
a minicomputer, and less than 10 minutes
on a microcomputer (IBM PC).
Database size today: 1012
(increased by 2 million folds).
BLAST search: 1.5 minutes
5. Microarray analysis
Clinical prediction of Leukemia type
• 2 types
– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatments & outcomes
• Predict type before treatment?
Bone marrow samples: ALL vs AML
Measure amount of each gene
Some goals of biology for the next 50 years
• List all molecular parts that build an organism
– Genes, proteins, other functional parts
•
•
•
•
•
•
Understand the function of each part
Understand how parts interact physically and functionally
Study how function has evolved across all species
Find genetic defects that cause diseases
Design drugs rationally
Sequence the genome of every human, use it for personalized
medicine
• Bioinformatics is an essential component for all the
goals above
A short introduction to molecular biology
Life
• Two categories:
– Prokaryotes (e.g. bacteria)
• Unicellular
• No nucleus
– Eukaryotes (e.g. fungi, plant, animal)
• Unicellular or multicellular
• Has nucleus
Prokaryote vs Eukaryote
• Eukaryote has many membrane-bounded
compartment inside the cell
– Different biological processes occur at different
cellular location
Organism, Organ, Cell
Organism
Chemical contents of cell
• Water
• Macromolecules (polymers) - “strings” made by linking
monomers from a specified set (alphabet)
–Protein
–DNA
–RNA
–…
• Small molecules
–Sugar
–Ions (Na+, Ka+, Ca2+, Cl- ,…)
–Hormone
–…
DNA
• DNA: forms the genetic material of all
living organisms
– Can be replicated and passed to descendents
– Contains information to produce proteins
• To computer scientists, DNA is a string
made from alphabet {A, C, G, T}
– e.g. ACAGAACGTAGTGCCGTGAGCG
• Each letter is a nucleotide
• Length varies from hundreds to billions
RNA
• Historically thought to be information
carrier only
– DNA => RNA => Protein
– New roles have been found for them
• To computer scientists, RNA is a string
made from alphabet {A, C, G, U}
– e.g. ACAGAACGUAGUGCCGUGAGCG
• Each letter is a nucleotide
• Length varies from tens to thousands
Protein
• Protein: the actual “worker” for almost all processes in
the cell
–
–
–
–
–
Enzymes: speed up reactions
Signaling: information transduction
Structural support
Production of other macromolecules
Transport
• To computer scientists, protein is a string made from 20
kinds of characters
– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP
• Each letter is called an amino acid
• Length varies from tens to thousands
DNA/RNA zoom-in
•
•
•
•
Commonly referred to as Nucleic Acid
DNA: Deoxyribonucleic acid
RNA: Ribonucleic acid
Found mainly in the nucleus of a cell (hence
“nucleic”)
• Contain phosphoric acid as a component (hence
“acid”)
• They are made up of a string of nucleotides
Nucleotides
• A nucleotide has 3 components
– Sugar ring (ribose in RNA, deoxyribose in
DNA)
– Phosphoric acid
– Nitrogen base
•
•
•
•
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T) or Uracil (U)
Monomers of RNA: ribo-nucleotide
• A ribonucleotide has 3 components
– Sugar - Ribose
– Phosphate group
– Nitrogen base
•
•
•
•
Adenine (A)
Guanine (G)
Cytosine (C)
Uracil (U)
Monomers of DNA: deoxy-ribo-nucleotide
• A deoxyribonucleotide has 3 components
– Sugar – Deoxy-ribose
– Phosphate group
– Nitrogen base
•
•
•
•
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T)
Polymerization: Nucleotides => nucleic acids
Nitrogen Base
Phosphate
Sugar
Nitrogen Base
Phosphate
Sugar
Nitrogen Base
Phosphate
Sugar
Free
phosphate
5’
A
5 prime
3 prime
5’-AGCGACTG-3’
G
C
AGCGACTG
G
DNA
A
Often recorded from 5’ to 3’, which is the
direction of many biological processes.
e.g. DNA replication, transcription, etc.
C
T
G
3’
5
Phosphate
4
Base
1
Sugar
3
2
Free
phosphate
5’
A
5 prime
3 prime
5’-AGUGACUG-3’
G
U
AGUGACUG
G
RNA
A
C
U
G
3’
Often recorded from 5’ to 3’, which is the
direction of many biological processes.
e.g. translation.
5’
A
3’
Base-pair:
A=T
G=C
G
Forward (+) strand
5’-AGCGACTG-3’
3’-TCGCTGAC-5’
C
G
A
AGCGACTG
TCGCTGAC
C
Backward (-)
strand
One strand is said to be reversecomplementary to the other
T
G
3’
5’
DNA usually exists in pairs.
DNA double helix
G-C pair is stronger than A-T pair
Reverse-complementary
sequences
• 5’-ACGTTACAGTA-3’
• The reverse complement is:
3’-TGCAATGTCAT-5’
=>
5’-TACTGTAACGT-3’
• Or simply written as
TACTGTAACGT
Orientation of the double helix
• Double helix is anti-parallel
–5’ end of each strand pairs with 3’ end of the other
–5’ to 3’ motion in one strand is 3’ to 5’ in the other
• Double helix has no orientation
–Biology has no “forward” and “reverse” strand
–Relative to any single strand, there is a “reverse
complement” or “reverse strand”
–Information can be encoded by either strand or both
strands
5’TTTTACAGGACCATG 3’
3’AAAATGTCCTGGTAC 5’
RNA
• RNAs are normally singlestranded
• Form complex structure by selfbase-pairing
• A=U, C=G
• Can also form RNA-DNA and
RNA-RNA double strands.
– A=T/U, C=G
Protein zoom-in
• Protein is the actual “worker” for almost all processes in
the cell
• A string built from 20 letters
– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH
• Each letter is called an amino acid
Side chain
R
|
H2N--C--COOH
|
Carboxyl group
Amino group
H
Generic chemical form of amino acid
Amino acid
• 20 amino acids, only differ at side chains
– Each can be expressed by three letters
– Or a single letter: A-Y, except B, J, O, U, X, Z
– Alanine = Ala = A
– Histidine = His = H
Amino acids => peptide
R
|
H2N--C--COOH
|
H
R
|
H2N--C--COOH
|
H
R
R
|
|
H2N--C--CO--NH--C--COOH
|
|
H
H
Peptide bond
Protein
R
H2N
R
R
R
R
R
…
N-terminal
•
•
•
•
COOH
C-terminal
Has orientations
Usually recorded from N-terminal to C-terminal
Peptide vs protein: basically the same thing
Conventions
– Peptide is shorter (< 50aa), while protein is longer
– Peptide refers to the sequence, while protein has 2D/3D structure
Protein structure
• Linear sequence of amino acids folds to
form a complex 3-D structure.
• The structure of a protein is intimately
connected to its function.
Genome and chromosome
• Genome: the complete DNA sequences in
the cell of an organism
– May contain one (in most prokaryotes) or
more (in eukaryotes) chromosomes
• Chromosome: a single large DNA
molecule in the cell
– May be circular or linear
– Contain genes as well as “junk DNAs”
– Highly packed!
Formation of chromosome
Formation of chromosome
50,000 times shorter than extended DNA
The total length of DNA present in one adult human is the
equivalent of nearly 70 round trips from the earth to the sun
Gene
• Gene: unit of heredity in living organisms
– A segment of DNA with information to make a
protein
Some statistics
Chromosomes Bases
Genes
Human
46
3 billion
20k-25k
Dog
78
2.4 billion ~20k
Corn
20
2.5 billion 50-60k
Yeast
16
20 million ~7k
E. coli
1
4 million
Marbled
lungfish
?
130 billion ?
~4k
Human genome
•
•
•
•
46 chromosomes: 22 pairs + X + Y
1 from mother, 1 from father
Female: X + X
Male: X + Y
Human genome
• Every cell contains the same genomic
information
– Except sperms and eggs, which only contain
half of the genome
• Otherwise your children would have 46 + 46
chromosomes
Cell division: mitosis
• A cell duplicates its
genome and
divides into two
identical cells
• These cells build up
different parts of
your body
Cell division: meiosis
• A reproductive cell
divides into four cells,
each containing only half
of the genomes
– Diploid => haploid
• Two haploid cells (sperm
+ egg) forms a zygote
– Which will then develop
into a multi-cellular
organism by mitosis
Central dogma of molecular biology
DNA replication is critical in both
mitosis and meiosis
DNA Replication
• The process of copying a double-stranded
DNA molecule
– Semi-conservative
5’-ACATGATAA-3’
3’-TGTACTATT-5’

5’-ACATGATAA-3’
5’-ACATGATAA-3’
3’-TGTACTATT-5’ 3’-TGTACTATT-5’
• Mutation: changes in DNA base-pairs
• Proofreading and error-correcting mechanisms
exist to ensure extremely high fidelity
Central dogma of molecular biology
Transcription
• The process that a DNA sequence is
copied to produce a complementary RNA
– Called message RNA (mRNA) if the RNA
carries instruction on how to make a protein
– Called non-coding RNA if the RNA does not
carry instruction on how to make a protein
– Only consider mRNA for now
• Similar to replication, but
– Only one strand is copied
Transcription
(where genetic information is stored)
DNA-RNA pair:
A=U, C=G
T=A, G=C
(for making mRNA)
Coding strand:
5’-ACGTAGACGTATAGAGCCTAG-3’
Template strand: 3’-TGCATCTGCATATCTCGGATC-5’
mRNA:
5’-ACGUAGACGUAUAGAGCCUAG-3’
Coding strand and mRNA have the same sequence, except
that T’s in DNA are replaced by U’s in mRNA.
Translation
• The process of making proteins from mRNA
• A gene uniquely encodes a protein
• There are four bases in DNA (A, C, G, T), and four in
RNA (A, C, G, U), but 20 amino acids in protein
• How many nucleotides are required to encode an amino
acid in order to ensure correct translation?
– 4^1 = 4
– 4^2 = 16
– 4^3 = 64
• The actual genetic code used by the cell is a triplet.
– Each triplet is called a codon
The Genetic Code
Third
letter
Translation
• The sequence of codons is translated to a
sequence of amino acids
• Gene: -GCT TGT TTA CGA ATT• mRNA: -GCU UGU UUA CGA AUU • Peptide: - Ala - Cys - Leu - Arg - Ile –
• Start codon: AUG
– Also code Met
– Stop codon: UGA, UAA, UAG
Translation
• Transfer RNA (tRNA) – a different type of RNA.
– Freely float in the cell.
– Every amino acid has its own type of tRNA that binds
to it alone.
• Anti-codon – codon binding crucial.
tRNA-Pro
Anti-codon
Nascent peptide
tRNA-Leu
mRNA
Transcriptional regulation
Transcription factor
RNA Polymerase
Transcription starting site
promoter
•
•
•
gene
Will talk more in later lectures
RNA polymerase binds to certain location on promoter to initiate
transcription
Transcription factor binds to specific sequences on the promoter to regulate
the transcription
– Recruit RNA polymerase: induce
– Block RNA polymerase: repress
– Multiple transcription factors may coordinate
Splicing
promoter
Transcription starting site
gene
transcription
Pre-mRNA
• Pre-mRNA needs to be “edited” to form mature mRNA
• Will talk more in later lectures.
intron
intron
Pre-mRNA
5’ UTR exon
exon 3’ UTR
exon
Splicing
Mature mRNA
(mRNA)
Open reading
frame (ORF)
Start codon
Stop codon
Summary
•
DNA: a string made from {A, C, G, T}
– Forms the basis of genes
– Has 5’ and 3’
– Normally forms double-strand by reverse complement
•
RNA: a string made from {A, C, G, U}
–
–
–
–
–
•
Protein: made from 20 kinds of amino acids
–
–
–
–
•
mRNA: messenger RNA
tRNA: transfer RNA
Other types of RNA: rRNA, miRNA, etc.
Has 5’ and 3’
Normally single-stranded. But can form secondary structure
Actual worker in the cell
Has N-terminal and C-terminal
Sequence uniquely determined by its gene via the use of codons
Sequence determines structure, structure determines function
Central dogma: DNA transcribes to RNA, RNA translates to Protein
– Both steps are regulated
Experimental techniques to manipulate DNA
DNA synthesis
• Creating DNA synthetically in a laboratory
• Chemical synthesis
– Chemical reactions
– Arbitrary sequences
– Maximum length 160-200
• Cloning: make copies based on a DNA template
– Biological reactions
– Requires template
– Many copies of a long DNA in a short time
in vivo DNA Cloning
• Connect a piece of DNA to bacterial DNA,
which can then be replicated together with
the host DNA
bacterial DNA
in vitro DNA Cloning
• Polymerase chain reaction (PCR)
5’
5’
denature
5’
5’
Primer (< 30 bases)
5’
5’
5’
5’
DNA Polymerase
dNTP
5’
5’
5’
5’
Some terms
• Denature: a DNA double-strand is separated into
two strands
– By raising temperature
• Renature: the process that two denatured DNA
strands re-forms a double-strand
– By cooling down slowly
• Hybridization: two heterogeneous DNAs form a
double-stranded DNA
– may have mismatches
– The rationale behind many molecular biological
techniques including DNA microarray
DNA sequencing technology
• Read out the letters from
a DNA sequence
1974, Frederick Sanger
GTGAGGCGCTGC
DNA sequencing: Basic idea
• PCR
primer extension
5’-TTACAGGTCCATACTA 
3’-AATGTCCAGGTATGATACATAGG-5’
• We need to supply A, C, G, T for the synthesis to
continue
• Besides A, C, G, T, we add some A*, C*, G*, and T*
– Very similar to ACGT in all aspects, except that
– The extension will stop if used
DNA sequencing, cont
DNA sequencing, cont
Advances in DNA sequencing
•
•
•
•
1969: three years to sequence 115nt DNA
1979: three years to sequence ~1650nt
1989: one week to sequence ~1650nt
1995: Haemophilus genome sequenced at
TIGR - 1,830,138nt
• 2000: Human Genome - working draft
sequence, 3 billion bases
• 2003: (near) completion of human genome
The bioinformatics landmark
• Completion of human genome sequencing is a success
embraced by
– Advancement in sequencing technology
– Speed of computation
– Algorithm development in bioinformatics
• HGP (Human Genome Project) strategy
– Hierarchical sequencing
– Estimated 15 years (1990 – 2005), completed in 13 years
– $3 billion
• Celera strategy
– Whole-genome shotgun sequencing
– Three years (1998-2001)
– $300 million
Now
• Over 300 genomes have been sequenced
• ~1011 - 1012 nt
2007
• Genomes of three individual human were
sequenced
– James Watson
– Craig Venter
– TBN Chinese
• Cost for sequencing Watson’s genome
– $3 million, 2 months
– Compared to $3 billion, 13 years for HGP
• Sequencing speed has been tremendously
improved
• High efficiency and relatively low cost
makes it possible to sequence the genome
of any individual from any species
What’s next?
Continue to sequence more species?
More individuals?
What to do with those sequences?
Coming next: biological sequence analysis
Download