Genoombrowsers

advertisement
• 3 hoofdonderdelen
• 1) Cel, DNA, genen, transcriptie, translatie
• 2) the genome, genes, differences
between people
• 3) genome browsers
(brugje naar onderdeel Martin)
Life, Cells, Proteins
• The study of life  the study of cells
• Cells are born, do their job, duplicate, die
• All these processes controlled by proteins
Molecular Biology Background
• Cells – general structure/organization
• Molecules – that make up cells
• Cellular processes – what makes the cell alive
Cells
• The cell is the fundamental working
unit of every living organism.
• Humans: trillions of cells (metazoa);
other organisms like yeast: one cell
(protozoa).
• Different types of cell:
– Skin, brain, red/white blood
– Different biological function
• Cells produced by cells
– Cell division (mitosis)
– 2 daughter cells
• Eukaryotic cells
– Have a nucleus
Two Cell Organizations
• Prokaryotes – lack nucleus, simpler internal structure,
generally quite smaller
• Eukaryotes – with nucleus (containing DNA) and various
organelles
Selected organelles…
• Nucleus – contains chromosomes/DNA
• Mitochondria – generate energy for the cell, contains
mitochrondrial DNA
• Ribosomes – where translation from mRNA to proteins
take place (protein synthesis machinery)
• Lysosomes – where protein degradation takes place
Cells can become specialized…
Three domains of life
• Prokarya
Bacteria
Archaea
• Eukarya
Eukaryotes
Universal phylogenetic tree.
Fig. 1 from:
N.R. Pace, Science 276
(1997) 734-740.
Nucleus and Chromosomes
• Each cell has nucleus
• Rod-shaped particles inside
– Are chromosomes
– Which we think of in pairs
• Different number for species
– Human(46),tobacco(48)
– Goldfish(94),chimp(48)
– Usually paired up
• X & Y Chromosomes
– Humans: Male(xy), Female(xx)
– Birds: Male(xx), Female(xy)
Chromosomes and DNA
13
DNA
• DNA is a molecule: deoxyribonucleic acid
• Double helical structure (discovered by
Watson, Crick & Franklin)
• Chromosomes are densely coiled and
packed DNA
Chromosome
DNA
SOURCE: http://www.microbe.org/espanol/news/human_genome.asp
DNA Strands
• Chromosomes are same in every cell of organism
– Supercoiled DNA (Deoxyribonucleic acid)
• Take a human, take one cell
– Determine the structure of all chromosonal
DNA
– You’ve just read the human genome (for 1
person)
– Human genome project
• 13 years, 3.2 billion chemicals (bases) in human
DNA
• A deoxyribonucleic acid or DNA molecule is a
double-stranded polymer composed of four
basic molecular units called nucleotides.
• Each nucleotide comprises a phosphate group,
a deoxyribose sugar, and one of four nitrogen
bases: adenine (A), guanine (G), cytosine (C),
and thymine (T).
• The two chains are held together by hydrogen
bonds between nitrogen bases.
• Base-pairing occurs according to the following
rule: G pairs with C, and A pairs with T.
17
Genes
• The human genome is distributed along 23 pairs of
chromosomes.
– 22 autosomal pairs;
– the sex chromosome pair, XX for females and XY for
males.
• In each pair, one chromosome is paternally inherited, the
other maternally inherited.
• Chromosomes are made of compressed and entwined
DNA.
• A (protein-coding) gene is a segment of chromosomal
DNA that directs the synthesis of a protein.
18
Central dogma
• The expression of the genetic information stored in the
DNA molecule occurs in two stages:
(i) transcription, during which DNA is transcribed into
mRNA;
(ii) translation, during which mRNA is translated to
produce a protein.
DNA  mRNA  protein
• Other important aspects of regulation: methylation,
alternative splicing, etc.
• The correspondence between DNA's four-letter alphabet
and a protein's twenty-letter alphabet is specified by the
genetic code, which relates nucleotide triplets to amino
acids.
19
Genetic and physical maps
20
21
DNA under electron microscope
22
3D model of a section of the DNA molecule
23
Genetic code
24
25
Replication
of
DNA
26
Transcription
• Process of making a single stranded mRNA
using double stranded DNA as template
• Only genes are transcribed, not all DNA
• Gene has a transcription “start site” and a
transcription “stop site”
• Ik wil er ook iets in dat de genen in beide
richtingen op het DNA kunnen liggen,
• Over coding strand, template strand etc
etc. OPZOEKEN!!
• Dit komt ook terug in oefeningen.
Gene structure
• Exons and Introns
– Introns are “spliced” out, and are not part of
mRNA
• Promoter (upstream) of gene
Gene expression
• Process of making a protein from a gene
as template
• Transcription, then translation
• Can be regulated
Gene Regulation
•
•
•
•
•
•
•
Chromosomal activation/deactivation
Transcriptional regulation
Splicing regulation
mRNA degradation
mRNA transport regulation
Control of translation initiation
Post-translational modification
Transcriptional regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
Transcriptional regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
READING FRAMES
The DNA is translated per codon = nucleotide-triplet.
The sequence: …ACGTACGTACGTACGTACGT…
Can thus be read as:
…-ACG-TAC-GTA-CGT-ACG-TAC-GT…
or:
…A-CGT-ACG-TAC-GTA-CGT-ACG-T…
or:
…AC-GTA-CGT-ACG-TAC-GTA-CGT-…
36
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
An open reading frame or ORF is a portion of
an organism's genome which contains a
sequence of bases that could potentially
encode a protein
In a gene, ORFs are located between the
start-code sequence (initiation codon) and
the stop-code sequence (termination codon).
37
Introduction to Bioinformatics
LECTURE 2: Section 2.3 Gene annotation: gene finding
OPEN READING FRAMES: ORF
38
Genetic code: exons/introns
39
Introduction to Bioinformatics
LECTURE 2: GENE FINDING
intron - exon
40
Translation
• Process of making an amino acid sequence from
(single stranded) mRNA
• Each triplet of bases translates into one amino
acid
• Each such triplet is called “codon”
• The translation is basically a table lookup
Genetic code: TRANSLATION
RNA → protein
42
SOURCE:
http://www.bioscience.org/atlases/genecode/genecode.htm
Differences in DNA
• DNA differentiates:
– Species/race/gender
– Individuals
• We share DNA with
– Primates,mammals
– Fish, plants, bacteria
• Genotype
– DNA of an individual
• Genetic constitution
• Phenotype
– Characteristics of the
resulting organism
• Nature and nurture
Evolution of Genes: Inheritance
• Evolution of species
– Caused by reproduction and survival of the
fittest
• But actually, it is the genotype which
evolves
– Organism has to live with it (or die before reproduction)
– Three mechanisms: inheritance, mutation and crossover
• Inheritance: properties from parents
– Embryo has cells with 23 pairs of
Evolution of Genes: Mutation
• Genes alter (slightly) during reproduction
– Caused by errors, from radiation, from toxicity
– 3 possibilities: deletion, insertion, alteration
• Deletion: ACGTTGACTC  ACGTGACTC
• Insertion: ACGTTGACTC 
AGCGTTGACTC
• Substitution: ACGTTGACTC 
ACGATGACTT
Evolution of Genes:
Crossover (Recombination)
• DNA sections are swapped
– From male and female genetic input to
offspring DNA
The Genome
DNase I sensitive site
Histone modification
Gene
Conserved
sequence
SNP
Genome
• The entire sequence of DNA in a cell
• All cells have the same genome
– All cells came from repeated duplications starting
from initial cell (zygote)
• Human genome is 99.9% identical among
individuals
• Human genome is 3 billion base-pairs (bp) long
Genome features
• Genes
• Regulatory sequences
• The above two make up 5%of human genome
• What’s the rest doing?
– We don’t know for sure
• “Annotating” the genome
– Task of bioinformatics
Some genome sizes
Organism
Virus, Phage Φ-X174;
Virus, Phage λ
Bacterium, Escherichia coli
Plant, Fritillary assyrica
Fungus,Saccharomyces cerevisiae
Nematode, Caenorhabditis elegans
Insect, Drosophila melanogaster
Mammal, Homo sapiens
Genome size (base pairs)
5387 - First sequenced genome
5×104
4×106
13×1010 Largest known genome
2×107
8×107
2×108
3×109
Note: The DNA from a single human cell has a length of ~1.8m.
A Bit of History
Sequenced genomes
•
•
•
•
•
•
•
•
1995
1996
1998
1999
2000
2001
2002
2004
Haemophilus influenzae
Yeast
C. elegans
Fruit fly
Arabidopsis
Human (draft)
Mouse
Human (“finished”)
1.8 Mb
12 Mb
100 Mb
125 Mb
115 Mb
2.6 Gb
3 Gb
A Bit of History
http://www.genomesonline.org/
Annotation
Wikipedia:
Genome annotation is the process of attaching biological
information to sequences. It consists of two main steps:
1. identifying elements on the genome, a process called
Gene Finding, and
2. attaching biological information to these elements.
Automatic annotation tools try to perform all this by computer
analysis, as opposed to manual annotation which involves
human expertise. Ideally, these approaches co-exist and
complement each other in the same annotation pipeline.
Genome browsing
why present the whole genome?
• Browse genes in their genomic context
• See features in and around a specific
gene
• Explore larger chromosome regions
• Search & retrieve information on a geneand genome-scale
• Investigate genome organization
• Compare genomes
What can we learn about
genomes?
• Within one genome: regulatory
elements, gene order, chromatin
structure…
• Through comparative studies:
Evolution, conserved regions,
rearrangements…
Gene quality and prediction.
Basic Genome Annotation
• Genomic location
• Gene model structures
– Exons
– Introns
– UTRs
• Transcript(s)
– Pseudogenes
– Non-coding RNA
• Protein(s)
• Links to other sources of information
Advanced Genome Annotation
•
•
•
•
•
•
•
Cytogenetic bands
Polymorphic markers
Genetic variation
Repetitive sequences
Expressed Sequence Tags (ESTs)
cDNAs or mRNAs from related species
Regions of sequence homology
Eukaryotic Genomes:
Not only collections of genes
• Protein coding genes
• RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA)
• Structural DNA (centromeres, telomeres)
• Regulation-related sequences (promoters, enhancers, silencers,
insulators)
• Parasite sequences (transposons)
• Pseudogenes (non-functional gene-like sequences)
• Simple sequence repeats
Challenges of genome browsers
• Increasing sequence information
198,879,188,987 nt
(Aug 2007)
Eukaryotic Genomes:
High fraction non-coding DNA
Bron: Mattick, NRG, 2004
•
•
•
Blue: Prokaryotes
Black: Unicellular eukaryotes
Other colors: Multicellular eukaryotes (red = vertebrates)
Het Human Genome Project
Idee voor het project kwam in 1988, men schatte dat het
ongeveer 20 jaar zou duren voordat het project ten einde zou
komen
In 2003 waren de 3.000.000.000 basenparen gesequenced
Slechts 2% van het genoom levert informatie over eiwitten. We
weten nog niet waarvoor die overige 98 % dient => is dit nutteloos
DNA???
We hebben ongeveer 20.000 genen in ons genoom. Dit is erg
weinig als je denkt dat een platworm met z’n 350 breincellen toch
amper minder genen heeft. De vraag is dan: hoeveel eiwitten
kunnen we echt coderen met die 20.000 genen?
De helft van de genen coderen voor eiwitten met een nog
onbekende functie,
www.bioinformatica-in-de-klas.nl
Human Genome
• 3 billion basepairs (3Gb)
• 22 chromosome pairs + X en Y chromosomes
• Chromosome length varies from ~50Mb to
~250Mb
• About 22000 protein-coding genes
– compare with ~14000 for fruitfly en ~19000 for
Nematode C. elegans
Human genome
Bron: Molecular Biology of the Cell (4th edition) (Alberts et al., 2002)
•
•
•
•
Only 1.2% codes for proteins, 3.5-5% is under selection
Long introns, short exons
Large spaces between genes
More than half consists of repetitive DNA
Variation Along Genome sequence
• Nucleotide usage varies
along chromosomes
– Protein coding regions tend to
have high GC levels
• Genes are not equally
distributed across the
chromosomes
– Housekeeping generally in
gene-dense areas
– Gene-poor areas tend to have
many tissue specific genes
Bron: Ensembl
Chromosome organisation
•
•
•
•
•
Bron: Lodish (4th edition)
DNA packed in chromatin
Active genes in less dense chromatin (beads-on-a-string)
Non-active genes often in densely packed chromatine (30-nm fiber)
Gene regulation by changing chromatin density, methylation/acetylation of the
histones
Limited availability of chromatin information in genome browsers (post
transcriptional modifications are currently under investigation with ChIP-onchip experiments
Genomic Sequence Conservation
• Not only protein coding parts are conserved in evolution
• Conserved non-coding genomic sequences can be
involved in gene regulation (enhancers, silencers,
insulators)
• With the UCSC browser one can examine genomic
conservation
Copy Number Variation
• People do not only vary at the nucleotide level
(SNPs); short pieces genome can be present in
varying number of copies (Copy Number
Polymorphisms (CNPs) or Copy Number
Variants (CNVs)
• When there are genes in the CNV areas, this
can lead to variations in the number of gene
copies between individuals
• With the UCSC browser CNVs can be examined
• Voorbeeld uitwerken
• Eventueel ook aangeven dat dit gebruikt
word in forensics
Single Nucleotide Polymorphisms (SNPs)
• Sequence variations within a species
• Similar to mutations, but are simultaneously
present in the population, and generally have
little effect
• Are being used as genetic markers (a genetic
disease is e.g. associated with a SNP)
• The Ensembl browser offers a nice SNP view
•
•
•
•
•
Hoeveel snps zijn er ?
Verschillen tussen mensen?
Paar voorbeelden uitwerken
Verschil met mutatie
SNPdb?
SNP’s en mutaties
www.bioinformatica-in-de-klas.nl
Alternative Transcripts
Source: Wikipedia (http://www.wikipedia.org/)
• Voorbeeld uitwerken, wellicht het
voorbeeld van de oefening die ze daarna
gaan doen?
Evolution
• A model/theory to explain the diversity of life
forms
• Some aspects known, some not
– An active field of research in itself
• Bioinformatics deals with genomes, which are
end-products of evolution. Hence bioinformatics
cannot ignore the study of evolution
Homologie: genoomanalyse
www.bioinformatica-in-de-klas.nl
Proefstuderen MLW
Wat is bioinformatica?
Homologie en evolutie
De mens en de aap verschillen maar in 1% in hun DNA-volgorde
En de mens en de hond slechts 7.5%
Proefstuderen MLW
Wat is bioinformatica?
“… endless forms most beautiful and most wonderful …”
- Charled Darwin
Evolution
•
•
•
•
All organisms share the genetic code
Similar genes across species
Probably had a common ancestor
Genomes are a wonderful resource to
trace back the history of life
• Got to be careful though -- the inferences
may require clever techniques
Genome browsers
UCSC
NCBI
Ensembl
http://genome.ucsc.edu/
http://www.ensembl.org/
Genome browsers can be used to examine many
kinds of data
– Genomic sequence conservation
– Duplications en deletions of pieces chromosome
(Copy Number Variations, CNVs)
– Single Nucleotide Polymorphisms (SNPs)
– Alternative splicing
The Ensembl gene set
• All Ensembl genes start from a known
protein or mRNA
Sequence
Assembly
Ensembl
gene set
mRNAs
protein
• An initial alignment of protein and mRNA to the genome
begins the ‘Genebuild’.
Ensembl Genes – biological basis
All Ensembl gene predictions are based on
proteins and mRNAs in:
• UniProt/Swiss-Prot (manually curated)
• UniProt/TrEMBL
• NCBI RefSeq (manually curated)
Protein/ mRNA
Sequence Assembly
Ensembl Genes
Download