The Biology, Technology and Statistical Modeling of High

advertisement
The Biology, Technology and
Statistical Modeling of Highthroughput Genomics Data
New Tools for Cell Biology
Biology has gone from "data poor" to "data rich"
practically overnight.
Naomi Altman
Dept. of Statistics
Penn State U.
May 25, 2010
1
New Tools for Cell Biology
2
New Tools for Cell Biology
Biology has gone from "data poor" to "data rich"
practically overnight.
4 technologies are driving this:
biological tools: microarrays
sequencing
informatics tools: computational tools
: internet data sharing
3
New Tools for Cell Biology
Biology has gone from "data poor" to "data rich"
practically overnight.
4 technologies are driving this:
biological tools: microarrays
sequencing
informatics tools: computational tools
: internet data sharing
There are lots of opportunities for statistical input
4
due to rapidly evolving technology.
Outline
Focus today:
Methods for measuring DNA and RNA
Biology - What are we measuring?
Why are we measuring it?
multiple objectives:
characterize organism
understand a particular process (e.g. tumor growth)
understand development
understand disease
infer evolutionary history
characterize a sample of mixed organisms
5
Technology - How are we measuring?
What are the sources of bias and variance?
What are we sharing?
Statistics - A few problems of great interest.
6
1
Biology
DNA 100
A Statistician’s Simplification
Every cell has the same genetic
material, stored in the double helix
of DNA.
Some Cell Biology
The Genome is the set of all DNA in
the organism.
http://www.bioteach.ubc.ca/MolecularB
iology/AMonksFlourishingGarden/
7
Biology
DNA 100
A Statistician’s Simplification
8
Biology
DNA 100
A Statistician’s Simplification
Every cell has the same genetic
material, stored in the double helix
of DNA.
Every cell has the same genetic
material, stored in the double helix
of DNA.
The Genome is the set of all DNA in
the organism (or in the nucleus).
The rungs are "base pairs" . Each
pair consists of 2 bound nucleotides
which are designated C, G, A, T.
C binds only to G.
A binds only to T.
http://www.bioteach.ubc.ca/MolecularB
iology/AMonksFlourishingGarden/
http://www.bioteach.ubc.ca/MolecularB
iology/AMonksFlourishingGarden/
9
Biology
DNA 100
A Statistician’s Simplification
Biology
Every cell has the same genetic
material, stored in the double helix
of DNA.
In a diploid population, most cells
have 2 copies of each chromosome
10
and so of each gene.
Some Questions of Interest
Sequence Analysis
Genetic sequence: What is the genetic code for
this species or strain?
Cells differ because different genes
are active.
http://www.bioteach.ubc.ca/MolecularB
iology/AMonksFlourishingGarden/
The fundamental problems:
•What is the sequence of the DNA?
•Which genes are active, where,
when and how?
11
12
2
Biology
Biology
Some Questions of Interest
Sequence Analysis
Genetic sequence: What is the genetic code for
this species or strain?
The primary data are dye intensities for labels for
each nt at each position.
After processing, the
data are stored as:
AGTCTAGGCT
There is also a quality
score.
Some Questions of Interest
Sequence Analysis
Genotyping - Where do genes differ among individuals
in the same species?
- What do these differences tell us about the
phenotype?
- What do these differences tell us about how the
species evolved?
- How do these differences evolve between species?
13
14
http://stat.fsu.edu/~lilei/lilei/research/sanger-c.gif
Biology
Biology
Some Questions of Interest
Sequence Analysis
Copy number variation - have additional copies of the
gene been inserted into the chromosomes (and
where)?
Some Questions of Interest
Sequence Analysis
Are there particular DNA sequences that have
function such as:
• genes
• exons/introns
• transcription factor binding sites
• other "regulatory" regions
– RNA binding sites
– methylation sites
15
Biology
16
Biology
Some Questions of Interest
Sequence Analysis
Gene Expression
Metagenomics:
• Can we identify species in a mixed sample by
sequencing the DNA?
• Can we recognize DNA for a target
(unsequenced) species from a contaminated
sample?
Expression = Transcription
The DNA unzips.
mRNA is created using
the DNA as a template.
The mRNA is processed
creating a transcript.
Protein Creation = Translation
17
18
http://www.phschool.com/science/biology_place/biocoach/images/transcription/euovrvw.gif
3
Biology
Biology
Gene Expression
Transcription factors
direction of transcription
3'
promoter
upstream
regulation
5'
Gene
3'
Gene
•transcription factors bind to the
promoter and bind RNA
polymerase
•transcription continues in the
5'-3' direction until the stop
codons are reached
19
http://www.phschool.com/science/biology_place/biocoach/images/transcription/euovrvw.gif
Biology
Introns
downstream
regulation
5'
Transcription
DNA 100
A Statistician’s Simplification
Exons
In protein coding genes,
introns are excised from the
pre-mRNA.
Different combinations of
exons form different splice
variants.
The "poly-A tail" is added and
marks this as mRNA.
Biology
mRNA Splice Variants
(isoforms)
AAAAA
3'
AAAAA
5'
AAAAA
transcripts
20
Transcription
Some genes encode other functional RNA
types.
These are important entities, but will not be
discussed here.
•The function of each cell is determined by which
proteins it produces.
•We might be directly measuring DNA - genotyping,
copy number, protein binding sites, methylation sites
•We might be measuring mRNA - gene expression,
splice variant expression
There are also tools for direct measurement of proteins
but we will not discuss these here
21
Biology
Questions about Transcription
& Expression
22
Biology
Questions about Transcription
& Expression
• Which genes are transcribing (expressing)?
• Which proteins are initiating or obstructing
transcription of which genes?
• What proteins are being transcribed (or splice
variants, isoforms)?
• Where are the protein binding sites?
• How much transcription is occurring?
23
• Which genes are being turned off by local
mechanisms (RNA binding and epigenetics methylation, DNA coiling )?
24
4
Biology
Questions about Transcription
& Expression
• Which specific cells are
expressing?
Biology
Questions about Transcription
& Expression
• Which genes co-express?
• What does gene expression tell us about tissue
development?
• Do homologous genes express in the same
treatments?
• Do homologous genes in different species express in
the same treatments? (e.g. developmental genes)
http://scienceblogs.com/pharyngula/upload/2006/
09/septuple_hox_lg.php
• Gene co-expression
networks
•http://www.biomedcentral.com/14712105/8/217/figure/F4?highres=y
25
Biology
26
Some Questions of Interest
Transcriptome Analysis
• What mechanisms are causing genes to turn on and
off?
• What mechanisms cause genes to express different
splice variants?
• Which proteins are regulated by transcription and
which by other cell mechanisms?
Some Important Technology
for Characterizing
RNA and DNA
27
Sample Preparation
Technology
28
Reverse Transcription PCR
RT-PCR is used to convert RNA (chemically unstable) to
complementary DNA (stable)
primer
DNA
mRNA
RT-PCR is used to convert RNA (chemically unstable) to
complementary DNA (stable)
in the test tube
in the cell
cDNA
mRNA mRNA
Sample Preparation
Technology
Reverse Transcription PCR
in the test tube
in the cell
cDNA
primer
primer cDNA
DNA
Nobel prize in chemistry (1993)
KARY B. MULLIS for his invention of the polymerase chain reaction (PCR) method.
29
mRNA
cDNA
mRNA mRNA
cDNA
primer cDNA
So we can use the same methods to measure
DNA and RNA
30
5
Sample Preparation
Technology
Technology
Quantitative PCR
A similar PCR reaction can also be used to
quantify the amount of RNA in a sample.
Sample Preparation
Chromatin
Immunoprecipitation
To capture the locations at which
molecules bind to a chromosome:
Allow molecules to bind.
Cross-link chemically to form more
stable but reversible bond.
Attach a tag to the protein that can
be captured chemically.
This is called RT-PCR or q-PCR.
It is considered the gold standard for gene
expression (although it also has error).
Fragment DNA.
Capture fragments bound to tag.
The quantification is based on curve fitting.
Release DNA fragments.
31
Technology
Sample Preparation
Chromatin
Immunoprecipitation
32
Technology
Measuring DNA
Instead of directly measuring quantities bound to the
DNA, we can use ChIP to find the DNA binding sites.
Because of PCR and ChIP technologies,
methods for measuring DNA can be used to:
This method can also be used for other chemical
modifications to the DNA such as methylation.
• measure DNA
• measure RNA
• find locations on chromosome where chemical
events occur
33
Technology
Microarrays
Measurement
A microarray is a substrate on which are attached
1000's of single strands of (c)DNA
complementary to the items you wish to detect.
There can be from a few thousand to a million
probes consisting of these single strands.
A labeled sample of DNA or cDNA is allowed to
hybridize (attach) to the probes.
Dye intensity for each probe is summarized by a
scanning microscope.
Intensity is expected to be proportional to the
amount of material in the labeled sample. 35
34
Technology
Microarrays
Measurement
Microarrays come in many formats:
Ewa Paszek
Affymetrix Chip-Basic Concepts
http://cnx.org/content/m12387/1.4/
Affymetrix GeneChip@
http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg
2-channel glass or
plastic slide
36
6
Technology
Measurement
Microarrays
Microarrays come in many formats:
Technology
Microarrays
Measurement
The most fundamental data is a digitized photo of the
array giving the label intensity.
bead array
http://www.illumina.com/Images/technology/beadarray_multi_sample_array_formats_lg.gif
Technology
37
38
Measurement
Microarrays
These days most of us use intermediate probe
summaries produced by the scanner.
Col
Row
Name
X
Y
Dia.
F635
F635
Median Mean
F635
SD
1
1 Pro25G
1120
13960
120
8281
7993
1182
2
1 Pro25G
1310
13960
130
8570
8260
1373
3
NegativeContr
1 ol
1490
13940
130
29
30
6
4
1 AT1G07480.1
1680
13960
120
372
373
51
5
1 AT2G41780.1
1870
13960
130
516
509
79
6
1 AT5G67530.1
2050
13960
120
1682
1598
325
7
1 AT3G30751.1
2250
13950
140
35
37
9
Technology
Microarrays
Microarrays come in many formats:
• bead arrays
• GeneChip@ (Affymetrix@)
• glass or plastic slides (1 or 2 channel)
Each format has strengths and weaknesses.
Different preprocessing methods are required to
obtain reasonably accurate quantification.
39
Technology
Microarrays
Measurement
The probes are designed to detect various
"items" by selecting parts of the gene or
transcript to match.
mRNA
• gene expression
• exon expression
• tiling
DNA
• SNPs (biological variation)
• protein binding sites
• methylation sites
• genes (for copy number)
41
Measurement
40
Technology
Microarrays
Measurement
In general, microarrays are species specific
(actually, genotype specific).
However, the same array can sometimes be used
for closely related species (e.g. human/chimp).
42
7
Technology
Microarrays
Measurement
Microarrays require known sequences to be used
as probes.
Genetic variation affects hybridization.
Technology
Massively Parallel
Sequencing
Measurement
The key to the genomics revolution has been the
development of fast accurate DNA sequencing.
"Next generation" sequencing technologies can "read"
the genomic sequence of up to millions of short
fragments of DNA.
The fragments can be genomic DNA or cDNA.
A priori sequence information is "not required".
43
Technology
Massively Parallel
Sequencing
Measurement
• New sequencing technologies can sequence 1 - 20
million short fragments of DNA per sample.
• Some common brand names – SOLiD
17 - 35 nt = A,G,C or T
– Illumina (Solexa) 17 - 100 nt
– 454
200 - 500 nt
Between methods - short is cheaper (per nt) than long
Within method - short is cheaper (per mRNA) than long
45
Technology
Massively Parallel
Sequencing
DNA
• "de novo" sequencing
• metagenomics
• resequencing (biological
variation)
• SNPs (biological variation)
• protein binding sites
• methylation sites
Measurement
RNA
• gene expression
• exon expression
• non-coding RNA
expression
• isoform discovery
• isoform expression
• microarray probe
construction
44
Technology
Massively Parallel
Sequencing
Measurement
Some data from Marioni et al, 2008
GGAAAGAAGACCCTGTTGGGATTGACTATAGGCTGG
GGAATTTAAATTTTAAGAGGACACAACAATTTAGCC
GGGCATAATAAGGAGATGAGATGATATCATTTAAGA
These are 36mers. They need to be matched to a
reference to determine what gene (if any) they
represent.
The file size is 11 Gb - a bit inconvenient on my
Windows computer!
Technology
Massively Parallel
Sequencing
46
Measurement
The most common method is "shot-gun" sequencing.
Many identical strands of DNA are
fragmented at random sites.
47
48
8
Technology
Massively Parallel
Sequencing
Measurement
The sequence of the fragments is
determined by a sequencer starting
from either the 3' or 5' end of the
fragment.
ACTTG--------ATCGA
ACGTT------------ACGAT
CTTAG---AATCA
Technology
Paired end sequencing uses longer
fragments and sequences from both
ends, with an unsequenced linker in
the center. It is twice as expensive,
but more informative for many 49
purposes.
Computation
Gene Assembly
Gene Assembly
For "de novo" sequencing, the fragments must be linked
back into the genomic sequence.
The most common method is "shot-gun" sequencing.
ACTAACCTGACT
ATCGAATCGATT
CATTGCATATTG
Technology
For assembly projects:
• longer is better than shorter
• paired end is better than single
The fragments are matched by sequence
single end sequencing
AAGCCTATTAGGCGTA-------------------------------------GGCGTACCTGATTAG---------------------------
assembled
AAGCCTATTAGGCGTACCTGATTAG---------------------------
or paired end sequencing
AAGCCTATT-----------------------------AGTTCCAAT
AGGTCAAGC-----------------ACCGTAAT
assembled
50
AGGTCAAGCCTATT-------ACCGTAAT-----AGTTCCAAT
Technology
Gene Assembly
Problems
Computation
• errors introduced either processing and sequencing
• sequencing is based on a signal from C,G,A or T - the
nt with the strongest signal is used
• PHRED score is a measure of reliability in the
sequence for each position
Typical assembly projects:
de novo sequencing
isoform detection
resequencing (may use both the observed
fragments and a "reference genome")
• genetic variation (imperfect match to reference)
• gene families (perfect matches among regions of
different genes)
• not enough sequence (incomplete assembly)
51
Technology
Computation
Resequencing
52
Technology
Resequencing
Computation
If there is a reference genome or transcriptome, reads
are matched to the reference.
For gene expression, this is called RNA-seq.
reference sequence
• for gene and exon expression, if reads are mappable,
more reads are better than long reads
AACGTTACCTGAATTGTGTGACCTAAACTGGAGATCATATCGAATGGTACCAGTAC
TTACCTG
TGAATTGT
CCTAAACTG
• the number of reads falling an a region (e.g. an exon)
can be used to quantify expression
CGAATGGT
CTAAACTGGA
ACCTAAAC
reads
53
54
from Mortazavi et al, 2008
9
Technology
Resequencing
Computation
For gene expression, this is called RNA-seq.
• reads spanning noncontiguous regions of the genome
provide direct evidence of splicing
Technology
Resequencing
Computation
Typically, in resequencing studies many reads do not
match the reference.
• reads are too short to provide direct splice variant
information
55
Technology
The Internet
Computation
The Internet
Computation
•NCBI website: reference genomes, transcriptomes, microarray
data, sequencing data, analysis tools ....
•Gene Ontology (GO) Database: standardized vocabulary to
annotate genes for many species
• documentation
•Kyoto Encyclopaedia of Genes and Genomes (KEGG): diagrams
of known gene networks, genomic information, software tools for
network analysis
• documentation tools
• sequence matching tools
• statistical analysis tools
•Bioconductor: hundreds of R libraries for bioinformatics work
along with useful databases and tools to download information
from, e.g. NCBI, GO and KEGG
• visualization tools
• tools to organize the tools
•GALAXY: a data and software management system to keep track
of analyses
57
The Internet
Technology
Some examples of what is out there:
A large percentage of all the microarray and
sequencing data collected in academic research
settings is freely available on the internet along with:
Technology
56
Computation
NCBI website:
reference
genomes, transcriptomes,
microarray data,
sequencing
data, software
tools
• UCSC Genome Browser: a visualization tool
Technology
The Internet
58
Computation
NCBI Gene
Expression
Omnibus:
433,240 samples
including 7342
platforms
There is another
database for
"short read" data
59
60
10
Technology
The Internet
Computation
Technology
The Internet
Computation
KEGG has a variety of resources including scanned
network diagrams
61
Technology
The Internet
Computation
62
Technology
The Internet
Computation
63
Statistics
•
•
•
•
•
•
•
•
•
•
•
Statistical Analysis
normalization
differential expression
cross-platform analysis
combining information
utilizing known error structure
eQTL and high-dimensional response problems
expression networks
peak finding
metagenomics
isoform expression
65
from genome to phenome
64
Statistics
Normalization
Main focus:
Remove some of the sample-specific noise to improve signal
detection.
66
11
Statistics
Statistics
Normalization
Main focus:
Remove some of the sample-specific noise to improve signal
detection.
Normalization
Main focus:
Remove some of the sample-specific noise to improve signal
detection.
Well-studied problem (but improvement still on-going)
Gene Expression Microarrays
Other types of microarrays
• platform-specific methods abound
• work well within study when all arrays should
have the same mean expression level (averaged
over genes)
•less well studied
67
Statistics
68
Statistics
Normalization
Normalization
Main focus:
Remove some of the sample-specific noise to improve signal
detection.
Main focus:
Remove some of the sample-specific noise to improve signal
detection.
New problems
New problems
RNA-seq normalization
Within-platform cross-study batch effects
•the hidden problem: low quality data not returned to the user
•quality scores
•total reads versus mappable reads
•other features depending on sample preparation
• lots of studies use the same type of microarray but large
nonlinear batch effects are evident
69
Statistics
70
Statistics
Normalization
Differential Expression
Main focus:
Remove some of the sample-specific noise to improve signal
detection.
Main focus:
Determine if there are treatment effects
New problems
Well-studied problem (but improvements are possible)
Cross-platform effects
Gene expression microarrays
• different arrays, RNA-seq and qPCR have been applied to the
• ANOVA-type analysis of means
• Bayes, empirical Bayes and shrinkage methods are often used
• Normal theory tests, permutation tests and bootstrapping are
commonly used
• FWER and FDR are commonly applied to control for highly
multiple comparisons
same samples with different results
• we should be able to correct for this so that we can do combined
inference
71
72
12
Statistics
Statistics
Differential Expression
Differential Expression
Main focus:
Determine if there are treatment effects
Main focus:
Determine if there are treatment effects
Studied but less thoroughly
New problems
microarrays
RNA-seq
• ANOVA-type problems with complex designs
• treatments with no expression
• extra-Poisson or Binomial variation
• Bayes and friends
• other types of differential measurement
RNA-seq
2-sample tests assuming multinomial or Poisson distributions
73
Statistics
74
Statistics
Sequencing Error
Main focus:
Account for sequence error as an integral part of the analysis
High-Dimensional
Predictors and Response
Main focus:
Predict association between 2 types of "omics" data
New problems
Examples
Massively parallel sequencing
• eQTLs: find locations on chromosomes which are associated
with gene expression (for every gene, for every location)
• methylation: find methylation sites associated with gene
expression
• GWAS: find genes associated with multiple phenotypes (e.g.
disease, growth patterns, etc)
• account for quality scores during other analyses
75
Statistics
76
Statistics
Metagenomics
Main focus:
Determine what organisms are in an environmental sample
Expression Networks
Main focus:
Understand how genes work together
Examples
Expt. Design, Network Analysis, PDEs
• environmental sampling:
scoop up sea water in oil spill area, extract DNA, and try to
document all organisms in the water (including unknown
organisms
• ancient DNA: extract all DNA from mammoth hair and
separate into mammoth and contaminants
Inputs: protein binding
gene expression
Experiments: time course
gene knock-out, silencing or down-regulation
gene recovery, enhancing or up-regulation
77
78
13
Statistics
Statistics
Peak Finding
Isoform Expression
Main focus:
Understand "interesting" locations on chromosome
Main focus:
Identify and quantify splice variant expression
Example
Example
Where are the binding sites for protein X?
Peak finding.
isoform
fragments
ex 1
site
79
Statistics
ex 3
ex 4
ex 5
...
ex K
isoform
count
iso1
n+1
iso2
n+2
:
:
isoI
n+I
read total T1+
chromosome
ex 2
T2+
T3+
T4+
T5+
...
TK+
N
We observe the number of reads in each exon. We want to
infer the number of reads originating from each isoform. We 80
may not know all the exons or isoforms.
From Genome to Phenome
The Challenge Ahead
Main focus:
How do all the pieces fit together to form complex traits?
Combining data from many sources
Genotype affects binding affects expression.
Epigenetics affects expression.
Genotype affects epigenetics.
How do we model the differing types of data together to
understand biology on the cellular and larger level?
How can we validate the models?
81
Genomics data are challenging because
• the scientific questions are deep
• the possible impact on human life is great (cancer, crops, ...)
• analysis of the data requires at least some knowledge of a very
rapidly changing technology
• analysis of the data requires at least some knowledge of a
rapidly evolving web-based knowledge repository (some of
which has self-replicating errors)
• the jargon is impenetrable (and biologists take for granted that
you know what they are talking about)
• the amounts of data are daunting (e.g. RNA-seq is 20 million
short sequences per sample)
• the data are very noisy with noise depending on the technology
• p>>n always (and p is growing while n is not)
82
The Challenge Ahead
The Challenge Ahead
It is worth the effort of learning genomics because
• the scientific questions are deep
• the possible impact on human life is great (cancer, crops, ...)
• the rapid rate of technological change ensures that there will be
new statistical problems for many years to come
• huge amounts of data are available
• the work is inherently collaborative
It is worth the effort of learning genomics because
• the scientific questions are deep
• the possible impact on human life is great (cancer, crops, ...)
• the rapid rate of technological change ensures that there will be
new statistical problems for many years to come
• huge amounts of data are available
• the work is inherently collaborative
We are definitely in the race.
molecular biology
computer science
83
statistics
Understanding bias, variance
and the need to validate slows us
down, but we get to the right
nest!
84
14
with many thanks to:
For the Seriously Interested
The Huck Institute of Life Sciences (Penn State)
my biology collaborators at PSU who patiently taught me:
dePamphilis Lab
Federoff Lab
Ma Lab
Pugh Lab
Vandenberg Lab
Baums Lab
McSteen Lab
and the students who waded through the material with me
Statistical Genomics Journal Club
Bioinformatics II - Microarrays
Readings in Statistical Genomics
Molecular biology of the cell: Reference edition [Book]
by Bruce Alberts - Science - Garland Science (2008) - Hardback - 1601 pages
85
86
15
Download