The Human Genome

advertisement
Applied Bioinformatics
Dr. Jens Allmer
Week 1 (Introduction)
Your Instructor
• Education
– BSc: University of Münster 1996
– MSc: University of Münster 2002
– PhD: University of Münster 2006
• Worked at
–
–
–
–
–
Izmir Institute of Technology (since 2008)
Izmir University of Economics, Turkey (Feb 2007 – Aug 2008)
University of Muenster, Germany (Jan 2006 – Feb 2007)
University of Pennsylvania, USA (Jan 2004 – Dec 2005)
University of Jena, Germany (Nov 2002 – Dec 2003)
Areas of Interest
• Bioinformatics
– Sequences
– Alignments
• Mass Spectrometry
– De novo sequencing
– Pattern matching
• Annotation
– Integration
– Automatic assessments
• General Automation and Productivity
Course Rules
• Attendance
– Is essential and will be monitored strictly
– if(absence > 12h) Then NA;
• Make-up
– No make-up for homework
– Midterm and Final need medical report for make-up
Course Rules
• Lecture starts on time
– if late enter QUIETLY
– if more then 5 min late DO NOT ENTER wait for break
• Breaks are 10 min max
– if late after break enter QUIETLY
– if more then 5 min late DO NOT ENTER wait for next break
• Early leave
– Announce before course and leave if granted
Course Rules
• Homework
– Published on the website and/or as slides
– Deadline 6pm on the day before the next class
(you may submit early of course)
– No extention
– No make-up
– No extra homeworks
• Must be electronicly submitted to:
jensallmer.iyte@analysis.urkund.com
– Must be named HW00_first_last.eee or will not be accepted
– Formats include: doc, ppt, odx, txt, html, ...
– Not allowed are formats that may not be edited by me like
pdf, and similar formats that are not widespread
– Must be significantly different from your classmates
– Otherwise everyone involved will obtain zero for that assignment
Grading
• All information available on class website
• Grading individualized
–
–
–
–
–
Homework
Quizzes
Mind Maps
Midterm
Final
20%
20%
10%
20%
30%
Grading
•
I am responsible to evaluate you
– I am not responsible to pass everyone or give great grades
•
Make it easy for me
1. Show up and participate
2. Do homeworks and pre-course preparations
3. Midterm and Final will be easy for you if you adhere to 1. and 2.
Course Structure
–
–
–
–
–
–
–
–
–
–
Start 9:00
10 min quiz
35 min lecture
5 min mind mapping
10 min break
50 min practice
10 min break
40-50 min lecture
10 min break
30 min practice
Textbooks
Primary audience
Junior bio majors
Course home page:
http://www.biolnk.com/habf
ISBN:
978-605-133-297-0
http://www.idefix.com/kitap/biyoenformatik-1-dizi-kiyaslamalarijens-allmer/tanim.asp?sid=GUFFOI44R7FJ9CIR6STU
Textbooks
•
Primary audience
–
•
Course home page:
–
•
Junior bio majors
http://www.bio.davidson.edu/genomics
Taught by A. Malcolm Campbell
(Biology)
Textbooks
Everything you currently
need to know about Applied
Bioinformatics in regard to
practical problems you will
encounter during everyday
research.
Bioinformatics
Chemistry
Biology
Molecular
biology
Mathematics
Statistics
Bioinformatics
Computer
Science
Informatics
Medicine
Physics
Bioinformatics is
Multidisciplinary
Genomics
Drug Design
Computer
Science
Molecular
Life Sciences
Phylogenetics
Structural
Biology
Math
Statistics
BIOINFORMATICS
The Pyramid of Life (2000)
Metabolomics
1400
Chemicals
Proteomics
3,000 Enzymes
Genomics
30,000 Genes
The Pyramid of Life
Protein Interactions?
100,000 Proteins
30,000 Genes
1400
Chemicals
Bioinformatics
(or Computational Biology)
• Not just the study of DNA or protein sequence data
• Inclusive definition – concerns the storage, display,
reduction, management, analysis, extraction, simulation,
modeling, fitting or prediction of biological, medical or
pharmaceutical data
Basis of molecular life sciences
• Hierarchy of relationships (some exceptions):
Genome
Gene 1
Gene 2
Gene 3
Gene X
Protein 1
Protein 2
Protein 3
Protein X
Function 1
Function 2
Function 3
Function X
How can one use bioinformatics to link
diseases to genes?
•
Disease
Map
Gene
Function
Positional cloning of
genes
1. Find genetic markers
associated with disease
2. Sequence DNA next to
the markers
3. Compare DNA from
afflicted individuals to
DNA of normal
individuals (database)
4. Find abnormalities
5. Predict gene function
from sequence
information
Bioinformatics in the old days
• Close to Molecular Biology:
– (Statistical) analysis of protein and nucleotide structure
– Protein folding problem
– Protein-protein and protein-nucleotide interaction
• Many essential methods were created early on
– Protein sequence analysis (pairwise and multiple alignment)
– Protein structure prediction (secondary, tertiary structure)
Bioinformatics in the old days (Cont.)
•
Evolution was studied and methods created
– Phylogenetic reconstruction (clustering – e.g., Neighbor Joining
(NJ) method)
– Nowadays also part of Datamining
But then the big bang….
The Human Genome - 26 June 2000
Dr. Craig Venter
Celera Genomics
-- Shotgun method
Francis Collins (USA)/Sir
John Sulston (UK)
Human Genome Project
History of the Human Genome Project
1953
Watson,
Crick
DNA
structure
1972
Berg,
1st
recombinant
DNA
1977
Maxam,
Gilbert,
Sanger
sequence
DNA
1980
1982
1984
1985
1986
Botstein,
Sinsheimer DOE begins
Wada
MRC
Davis,
genome
proposes to publishes hosts
Skolnick
build
first large meeting to studies with
White
discuss HGP $5.3 million
automated genome
propose to sequencing Epstein-Barrat UCSanta
map human robots
virus (170 Cruz;
genome with
Kary Mullis
kb)
RFLPs
develops
PCR
1987
Gilbert announces
plans to start company
to sequence and
copyright DNA;
Burke, Olson, Carle
develop YACs; DonisKeller publish first
map (403 markers)
History of the Human Genome Project
(continued)
1987 (cont) 1988
1989
Hood
produces
first
automated
sequencer;
Dupont
develops
fluorescent
dideoxynucleotides
Proposal
Venter
Simon
Hood,
to
sequence
announces develops
Olson,
20
Mb
in
strategy to BACs; US
Botstein
model
sequence
and French
Cantor
propose organism by ESTs. He teams
2005;
plans to
publish first
using
Lipman,
patent
physical
STS’s to map
Myers
partial
maps of
the human
chromosome
genome publish the cDNAs;
BLAST
Uberbacher s; first
algorithm develops
genetic maps
GRAIL, a of mouse and
gene finding human
program
genome
published
NIH
supports the
HGP;
Watson
heads the
project and
allocates
part of the
budget to
study social
and ethical
issues
1990
1991
1992
1993
Collins is
named
director
of
NCHGR;
revise
plan to
complete
seq of
human
genome
by 2005
1995
Venter
publishes
first
sequence of
free-living
organism:
H. influenzae
(1.8 Mb);
Brown
publishes on
DNA arrays
1996
Yeast
genome is
sequenced
(S.
cerevisiae)
History of the Human Genome Project
(continued)
1997
Blattner,
Plunket
complete E.
coli
sequence; a
capillary
sequencing
machine is
introduced.
1998
SNP project
is initiated;
rice genome
project is
started;
Venter
creates new
company
called Celera
and proposes
to sequence
HG within 3
years; C.
elegans
genome
completed
1999
2000
NIH
proposes to
sequence
mouse
genome in 3
years; first
sequence of
chromosome
22 is
announced
Celera and
others
publish
Drosphila
sequence
(180 Mb);
human
chromosome
21 is
completely
sequenced;
proposal to
sequence
puffer fish;
Arabidopsis
sequence is
completed
2001
Celera
publishes
human
sequence in
Science; the
HGP
consortium
publishes the
human
sequence in
Nature
2003
Completed
genomes:
112 Microbial
18 Eukaryotes
1275 Viruses
Human DNA
• There are at least 3bn (3  109) nucleotides in the nucleus of
almost all of the trillions (3.2  1012 ) of cells of a human
body (an exception is, for example, red blood cells which
have no nucleus and therefore no DNA) – a total of ~1022
nucleotides!
• Many DNA regions code for proteins, and are called genes (1
gene codes for 1 protein as a base rule, but the reality is a lot
more complicated)
– Name examples
• Human DNA may contain ~27,000 expressed genes
– Problems?
• Deoxyribonucleic acid (DNA) comprises 4 different types of
nucleotides: adenine (A), thiamine (T), cytosine (C) and
guanine (G). These nucleotides are sometimes also called
bases
– Ambiguities?
Y-Chromosome
• 50% of the sequence consists of
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
• Not very meaningful
Human DNA (Cont.)
• All people are different
• but the DNA of different people only varies for
0.2% or less
• So, only up to 2 letters in 1000 are expected to be
different.
• Evidence in current genomics studies (Single
Nucleotide Polymorphisms or SNPs) imply that
• on average only 1 letter out of 1400 is different
between individuals.
• Over the whole genome, this means that 2 to 3
million letters would differ between individuals.
Modern bioinformatics is closely
associated with genomics
• The aim is to solve the genomics information
problem
• Ultimately, this should lead to biological
understanding how all parts fit (DNA, RNA,
proteins, metabolites) and how they interact
(gene regulation, gene expression, protein
interaction, metabolic pathways, protein
signaling, etc.)
Functional Genomics
From gene to function
Genome
Expressome
Proteome
Interactome?
TERTIARY STRUCTURE (fold)
TERTIARY STRUCTURE (fold)
Metabolome
How much of the genome is defined?
Unknown Function
What is bioinformatics?
Math
Physics
English
Bio
Comp
sci
Chem
Bioinformatics
Stats
•
•
•
•
•
•
•
•
•
•
•
•
•
Machine learning
Database systems
Data mining
Image processing
Modeling
Graph theory
Statistical analysis
Sequence
Structure
Interactions
Regulation
Genomes
Evolution
• E.g. Process the spots on a microarray, determine
which genes are differentially expressed, link spots to
sequence via a database, analyze the sequence using
predictive tools, link the genes to related genes to form
a network
What is a bioinformatician?
• Somebody who knows everything
What is a bioinformatician?
• A facilitator
– Typically has background in biology or CS, but is comfortable
with concepts from other disciplines
– Bring together ideas (or researchers) from different domains to
solve a biological problem
• Conceptualize the problem
– Use language appropriate to the domain
• Identify potential solutions
– Understanding of different fields helps to identify possible
approaches at a broad level
• Guide the development process
– Create in-house or find potential collaborators to work on
approaches in-depth
• Integrate results into overall solution
– Software/method, results of biological analysis
How is Bioinformatics Used?
Bioinformatics is used to help “focus”
the scientist on the bench top experiments
Bioinformatics isn’t going to replace
lab work anytime soon
Experimental proof is still the
“Gold Standard”.
Bioinformatics
• Is application of computational tools in Biology
Bioinformatics?
• Not really!
• In this course we will however only go into algorithmic
details rarely (like today ;)
Mind Mapping
• Have you ever studied a subject or brainstormed an
idea, only to find yourself with pages of information, but
no clear view of how pieces fit together?
•  Mind mapping
–
–
–
–
–
–
Learn more effectively
Improves memorization
Enhances creativity
Speeds up analyses
Gives structure to complex ideas
Records information for future use
Source: http://www.mindtools.com/pages/article/newISS_01.htm
An Example Mind Map for MicroRNAs
How to Mind Map
1. Identify the central
topic write in center
2. Write major parts of
the topic on lines in
all directions
3. Repeat 2. with ever
finer level of detail
until satisfied
Source: http://www.mindtools.com/pages/article/newISS_01.htm
Note Taking with Mind Maps
• Capture ideas organized into topics
– What if the central topic which I chose is not the central topic?
– Make a new mind map which captures the topic correctly
• Uses Cases
–
–
–
–
Note taking in class
Recapitulization after lecture
Analysis of a new topic
Structuring of any intended writing
• When
– During acquisition of new knowledge (faster than writing)
– For review 5m, 1h, 6h, 1d, 7d, 1m after note taking
Mind Mapping Tips
1. Use single words or very short phrases
2. Write clearly and readable
3. Use color!
4. Seperate ideas (color, lines, shading)
5. Draw symbols and images
6. Draw links among elements
A More Elaborate Mind Map
Source: http://www.mindtools.com/pages/article/newISS_01.htm
At the Heart of Bioinformatics
Genomic
>scaffold_1152
GGTGCGGCCGTCCTCCAGCTGCTTGCCGGCGAAGATCAGGCGCTGCTGGT
CCGGGGGGATGCCTGCATCCGGTGAGGAAACGCTCGTGTCAGACAAAGTG
GGTGGGCGCAGGAAGCAGCAATCAACACAGCCCAGTGCAGCTGCAAAGCG
CCCGCCTTACCACTGACCCGCCTGGCCACCCACCCCTACCCCCCGTAAGG
AAAGAGCCCCGACTCACCCTCCTTGTCCTGAATCTTGGCCTTCACGTTCT
CAATGGTGTCCGAAGACTCCACCTCGAGCGTGATGGTCTTGCCCGTCAGG
GTCTTGACGAAGATCTGCATGCCACCGCGCAGGCGCAGCACCAGGTGCAG
…
Translated
>RF1_scaffold_1152
GAAVLQLLAGEDQALLVRGDACIR$GNARVRQSGWAQEAAINTAQCSC
KAPALPLTRLATHPYPP$GKSPDSPSLS$ILARDVAHDFAKSSPR$YA
PLIPQNLRC$SIEMKQPASLLSPIGEGACASHLQCLEKCLLP$GAIVY
MIS$GSGRR$TSWVGIGGCNDGTEKRSEVDSRRGGKGNIHD
>RF2_scaffold_1152
VRPSSSCLPAKIRRCWSGGMPASGEETLVS AATAAKPQTWSPTAWEF
KVGGRRKQQSTQPSAAAKRPPYH$PAWPPTPTPRKERAPTHPPCPESW
SRSQWCPKTPPRA$WSCPSGS$RRSACHRAGAAPGAGSTPSGCCSQPG
CGRPPAACRRRSGAAGPGGCLCVGGGGEGACASHLQCLEGE
…
Try it for yourself
Sequence
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
Pattern
TGATGT
Your
Your Task
Task
You
You may
may only
only compare
compare 11 character
character at
at aa time
time
You
You may
may create
create helpful
helpful structures
structures
You
You should
should find
find the
the location
location of
of the
the Pattern
pattern in
in the
the
Sequence
Sequence with
with aa minimal
minimal number
number of
of comparisons
comparisons
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 1
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 2
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 3
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 4
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 6
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 7-16
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 17-22
Boyer-Moore Algorithm
•Preprocessing
•Good suffix matrix
•Bad character matrix
(m+1)
(m+1)
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 1
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 2
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 3-7
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 8
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 9-15
Questions
Define
Algorithm
Website
• http://mbg305.allmer.de
• Slides
• Homework
• Additional materials and challenges
• Grades
Website
• To see your grades you need to login
• Some material may need login as well
• Currently
– UserID = StudentID
– Password = StudentID
• Change now
– UserID = working email address
– Password = whatever you will remember
Login to mbg305.allmer.de
• We will now assist you to log in and to add your email
address and change your password.
Assignments
– Research about Mind Maps
• E.g.: http://en.wikipedia.org/wiki/Mind-map
• IYTE library
– Make sure to read the lecture notes for next week (Available
online on Monday)
• Prepare at least two proper questions that will be collected at the
beginning of the course next week
– Read Chapters 1 and 2 from our textbook
Download