Introduction to Quantitative Biology: 2015 Teaching

advertisement
An introduction to
quantitative biology
and R
David Quigley, Ph.D.
dquigley@cc.ucsf.edu
Helen Diller Comprehensive Cancer Center, UCSF
genetics of skin cancer
Balmain (UCSF)
genetics of breast cancer
Børresen-Dale (U Oslo)
genetic interactions
synthetic lethality
Ashworth (UCSF)
2007
2009
David Quigley dquigley@cc.ucsf.edu
2011
2013
2015
What’s quantitative biology?
The process of data analysis
Reproducible research
An glance at R
analysis walk-through
High-performance computing at UCSF
David Quigley dquigley@cc.ucsf.edu
What’s quantitative
biology?
Quantitative Biology
Studying biology by integrating molecular,
genetic, computational, and statistical methods.
c.f. molecular biology, developmental biology
Data Science
Statistics with venture capital funding
David Quigley dquigley@cc.ucsf.edu
Genetics has always been quantitative
evolutionary genetics
population genetics
epidemiology
linkage analysis
association tests
David Quigley dquigley@cc.ucsf.edu
Molecular biology 30 years ago
Wet lab
quantitative
David Quigley dquigley@cc.ucsf.edu
Suzuki Med Mol Morph 2010
Oh PNAS 1996
Mao Genes Dev 2004
Molecular biology now
Wet lab
quantitative
David Quigley dquigley@cc.ucsf.edu
Nik-Zainal Cell 2012
Fullwood Nature 2009
CGAN Nature 2012
Challenges
requires statistical sophistication
in study design
in interpretation
many data points
1,000 to 1,000,000 measurements per sample
many false positives which look like great stories
software becomes part of the experiment
divide between engineering, biology culture & thinking
David Quigley dquigley@cc.ucsf.edu
The process of
quantitative data
analysis
Quantitative Biology is biology.
Start with questions.
motivation
approach
statistical power
analysis
David Quigley dquigley@cc.ucsf.edu
First comes bioinformatics
engineering
instrument
native output
format
spectrographs
qPCR cycle files
microarray images
short sequences
David Quigley dquigley@cc.ucsf.edu
What did the machine say?
engineering
instrument
David Quigley dquigley@cc.ucsf.edu
bioinformatics
native output
format
standardized
output
spectrographs
qPCR cycle files
microarray images
short sequences
protein assignments
expression matrixes
genome variants
Considerations during primary analysis
batch effects
sample quantity
biological artifacts (e.g. GC content)
individual assay quality
sample quality
platform effects
operator effects
David Quigley dquigley@cc.ucsf.edu
Normalization challenges vary
solved problems
microarray expression level
taqman expression level
genotypes from SNP chips
best practices
SNV calling from sequence
gene-level RNA-seq
ChIP-seq
open problems
mRNA isoform reconstruction
tumor clonality analysis from
sequence
David Quigley dquigley@cc.ucsf.edu
Secondary analysis addresses the biological question
To which DNA sequences does TP53 bind?
What mutations are frequent in basal-like breast cancer?
Which kinases are does my tool compound target?
David Quigley dquigley@cc.ucsf.edu
Primary analysis
specialied tools and packages
standardized pipelines develop over time
driven by methods
Secondary analysis
general tools
open-ended
driven by statistics and biology
David Quigley dquigley@cc.ucsf.edu
Chosing quantitative tools
Cost
Learning curve
Ease of use
Flexibility
ecosystem
people
other tools
David Quigley dquigley@cc.ucsf.edu
Traditional programming languages
Python, C++, Java, others
can solve any computable problem
creates the fastest tools
free
requires programming expertise
complex to write and test
 high effort
David Quigley dquigley@cc.ucsf.edu
Specialized single-purpose programs
command line tools
academic research
type commands at a prompt or run scripts
PLINK, bowtie, GATK, bedtools
GUI (point and click)
commercial software for a vendor’s platform
slick, opaque, hard/impossible to automate
David Quigley dquigley@cc.ucsf.edu
Commercial statistics programs
STATA, SPSS, GraphPad, others
1) Load one dataset
2) Select analysis by clicking on a GUI
3) Generate a report
may have a built-in language
mature tools
Not free
David Quigley dquigley@cc.ucsf.edu
Web-based tools
Galaxy
string together pre-defined analysis steps
very easy to use
reproducible
David Quigley dquigley@cc.ucsf.edu
R: a “software environment”
Using R is like writing and using software
Traditionally, biologists did not do this.
David Quigley dquigley@cc.ucsf.edu
Why is R popular?
Open-ended, open-source
Large library of packages
package: easy-to use published methods
like a Qiagen kit
Free!
David Quigley dquigley@cc.ucsf.edu
You use R by typing at the prompt
There is no pull-down menu of statistical commands
David Quigley dquigley@cc.ucsf.edu
What’s good about this approach?
chain analyses
work with multiple datasets
use packages of code
easy to reproduce
runs on anything
makes sense to computer programmers
David Quigley dquigley@cc.ucsf.edu
What’s hard about this approach?
hard to get started
cryptic commands
built-in help is hard to use
David Quigley dquigley@cc.ucsf.edu
RStudio makes it easier
David Quigley dquigley@cc.ucsf.edu
bioconductor
Curated collection of R packages
Microarrays, aCGH, sequence analysis, advanced
statistics, graphics, lots more
bioconductor.org
David Quigley dquigley@cc.ucsf.edu
packages for common tasks
limma: microarray normalization and analysis
samr: differential expression
impute: dealing with missing data
downloaded for free from a central repository
David Quigley dquigley@cc.ucsf.edu
Reproducible research
Replicate a wet lab experiment
detailed protocols (not printed in the methods)
extensive optimization
reagents that might be unique or hard to get
techniques that require years of experience
David Quigley dquigley@cc.ucsf.edu
Replicate a dry lab experiment
published algorithms (if novel)
published source code
sometimes “available from the authors”
well-specified input and deterministic output
no reagents
Okay, maybe a supercomputer or cloud
How hard could it be?
David Quigley dquigley@cc.ucsf.edu
Many chances to make honest errors
Bookkeeping errors
Transposed column headers
Out-of-date/changed annotations
Off-by-one
Misunderstood sample labels
Batch effects
Cryptic cohort stratification
Inappropriate analytical methods
David Quigley dquigley@cc.ucsf.edu
Your notebook should be the final product
hand-curate metadata; automate the analysis
primary data
metadata
David Quigley dquigley@cc.ucsf.edu
analysis script
figures
tables
R Markdown
David Quigley dquigley@cc.ucsf.edu
R Markdown
David Quigley dquigley@cc.ucsf.edu
Learning R data types
by comparing them
to Excel spreadsheets
Comparing Excel and R
Excel
Easy tasks are easy
non-trivial tasks impossible or expensive
No paper trail
Mangles gene names
Plots look terrible
David Quigley dquigley@cc.ucsf.edu
Comparing Excel and R
Excel
Easy tasks are easy
non-trivial tasks impossible or expensive
No paper trail
Mangles gene names
Plots look terrible
R
Easy jobs are hard at first
Non-trivial things are possible
Easy to make a paper trail
Biostatistics researchers publish tools in R
Can create publication-ready plots
David Quigley dquigley@cc.ucsf.edu
Organizing data in Excel
Each subject has a row.
Each column has a feature of your subjects.
David Quigley dquigley@cc.ucsf.edu
R calls the data points variables
variables
numbers and characters (letters, words)
numbers:
characters:
David Quigley dquigley@cc.ucsf.edu
2.6, 4
“Flopsy”, “white, brown paws”
R calls the columns vectors
vectors
ordered collections of a variable
name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]
age:
[2.5, 2.6, 2.5, 4]
David Quigley dquigley@cc.ucsf.edu
R calls the data set a data frame
data frame
a list of vectors (columns) that have names
elements can be read and written by row & column
David Quigley dquigley@cc.ucsf.edu
I can slice and dice the data frame
David Quigley dquigley@cc.ucsf.edu
Tell R to do things using functions
function_name( details about how to do it )
generate sequence from 1 to 5 counting by 0.5
parameters for seq are named
from, to, and by
David Quigley dquigley@cc.ucsf.edu
Tell R to do things using functions
function_name( details about how to do it )
report the mean of my.data. Result of one
function is fed into another one.
David Quigley dquigley@cc.ucsf.edu
Tell R to do things using functions
function_name( details about how to do it )
define a new function that adds 2 to
whatever it’s passed
compare to original value of my.data
David Quigley dquigley@cc.ucsf.edu
Code is a protocol for the computer
A program is a series of operations on data
Short programs (scripts) are often linear
Large programs have decision points
“flow control”
David Quigley dquigley@cc.ucsf.edu
Most jobs: data preparation & scripts
tools that manipulate text
text editing programs (TextPad, BBEdit, Emacs)
Python
Old-school command line tools (awk)
David Quigley dquigley@cc.ucsf.edu
Walk-through a
straightforward
analysis
Primary data from METABRIC study
gene expression
TP53 sequence
1,400 samples from 5 hospitals
Is there an association between breast
cancer subtype and TP53 mutation?
David Quigley dquigley@cc.ucsf.edu
Tasks
Normalize data
batch effects
unwanted inter-sample variation
Identify outliers
associations between p53 and subtype
David Quigley dquigley@cc.ucsf.edu
Quantile Normalization (limma)
Force every array to have the same distribution of
expression intensities
> library(limma)
> raw = read.table('raw_extract.txt’, ...)
> raw.normalized = normalize.quantiles( raw )
> normalized = log2( raw.normalized )
David Quigley dquigley@cc.ucsf.edu
Identify batch effects in microarrays
gene 1
Principle Components Analysis
Identify strongest variation in a matrix
gene 2
David Quigley dquigley@cc.ucsf.edu
Identify batch effects in microarrays
gene 1
Principle Components Analysis
Identify axes of maximal variation in a matrix
gene 2
David Quigley dquigley@cc.ucsf.edu
Identify batch effects in microarrays
Principle Components Analysis
Identify strongest variation in a matrix
gene 1
gene 1
group A
group B
gene 2
David Quigley dquigley@cc.ucsf.edu
gene 2
second component
PCA of identifies a batch effect
hospital 3 (yellow)
first component
> my.pca = prcmp( t( expression.data ) )
> plot( my.pca, ... )
David Quigley dquigley@cc.ucsf.edu
batch correction reduces bias (ComBat)
second principle component
ComBat package reduces user-defined batch effects
first principle component
David Quigley dquigley@cc.ucsf.edu
Molecular subtypes of breast carcinoma, defined by gene expression
ER status
Luminal A
N=507
Luminal B
N=379
Her2
N=161
> sa = read.table(‘patients.txt’, ...)
> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)
(convert counts to percentages)
> barplot( c( tumor.counts[1], tumor.counts[2] ),
col=c(“red”,”green”), ... )
David Quigley dquigley@cc.ucsf.edu
Basal
N=234
Find interactions: TP53 and subtype
Fit a linear model:
> fitted.model = lm( dependent ~ independent )
Perform Analysis of Variance:
> anova( fitted.model )
general form of my analysis:
> anova( lm( gene.expression ~ PAM * TP53 )
18,000 genes
PAM: {LumA, LumB, Her2, Basal}
TP53: {mutant, WT}
David Quigley dquigley@cc.ucsf.edu
Automate with loops
Calculate anova for 18,000 genes by looping
through each gene and storing result.
> n_genes = 18000
> result = rep( 0, n_genes )
> for( counter in 1:n_genes ){
result[counter] = anova(...)
}
sort results
identify significant interaction
David Quigley dquigley@cc.ucsf.edu
repeat 18,000 times
Immune infiltration in TP53-WT Basal
CD3E
log2 expression
log2 expression
Does p53 have a role in immune surveillance?
absent mild severe
infiltration
David Quigley dquigley@cc.ucsf.edu
High-performance
computing resources
Cluster computing
1 computer
20 hours
20 computers
1 hour
Clusters available on campus
Institute for Human Genetics
Recharge
~800 cores, plenty of disk space
HDFCC Cluster
Free for small jobs to cancer center members
Contribute resources for big jobs
~800 cores, plenty of disk space
QB3
Free for small jobs to QB3 members
Lots of cores, not much disk space
Amazon AWS
Infinite capacity, but bring a credit card
David Quigley dquigley@cc.ucsf.edu
Next steps:
getting help and
learning more
online forums: expert help for free
biostars.org
all of bioinformatics
David Quigley dquigley@cc.ucsf.edu
online forums: expert help for free
biostars.org
all of bioinformatics
David Quigley dquigley@cc.ucsf.edu
seqanswers.com
Nextgen sequencing
online forums: expert help for free
seqanswers.com
biostars.org
Nextgen sequencing
all of bioinformatics
stats.stackexchange.com
statistics
David Quigley dquigley@cc.ucsf.edu
UCSF resources
Library classes and information
Formal courses (BMI, Biostatistics)
Cores (Computational Biology, Genomics)
QGDG monthly methods discussion group
David Quigley dquigley@cc.ucsf.edu
Online classes and blogs
Free courses on data analysis
http://jhudatascience.org
simplystatistics.org
Coursera etc...
Good tutorials on sequence analysis
http://evomics.org/learning
David Quigley dquigley@cc.ucsf.edu
Questions?
dquigley@cc.ucsf.edu
Download