An introduction to quantitative biology and R David Quigley, Ph.D. dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF genetics of skin cancer Balmain (UCSF) genetics of breast cancer Børresen-Dale (U Oslo) genetic interactions synthetic lethality Ashworth (UCSF) 2007 2009 David Quigley dquigley@cc.ucsf.edu 2011 2013 2015 What’s quantitative biology? The process of data analysis Reproducible research An glance at R analysis walk-through High-performance computing at UCSF David Quigley dquigley@cc.ucsf.edu What’s quantitative biology? Quantitative Biology Studying biology by integrating molecular, genetic, computational, and statistical methods. c.f. molecular biology, developmental biology Data Science Statistics with venture capital funding David Quigley dquigley@cc.ucsf.edu Genetics has always been quantitative evolutionary genetics population genetics epidemiology linkage analysis association tests David Quigley dquigley@cc.ucsf.edu Molecular biology 30 years ago Wet lab quantitative David Quigley dquigley@cc.ucsf.edu Suzuki Med Mol Morph 2010 Oh PNAS 1996 Mao Genes Dev 2004 Molecular biology now Wet lab quantitative David Quigley dquigley@cc.ucsf.edu Nik-Zainal Cell 2012 Fullwood Nature 2009 CGAN Nature 2012 Challenges requires statistical sophistication in study design in interpretation many data points 1,000 to 1,000,000 measurements per sample many false positives which look like great stories software becomes part of the experiment divide between engineering, biology culture & thinking David Quigley dquigley@cc.ucsf.edu The process of quantitative data analysis Quantitative Biology is biology. Start with questions. motivation approach statistical power analysis David Quigley dquigley@cc.ucsf.edu First comes bioinformatics engineering instrument native output format spectrographs qPCR cycle files microarray images short sequences David Quigley dquigley@cc.ucsf.edu What did the machine say? engineering instrument David Quigley dquigley@cc.ucsf.edu bioinformatics native output format standardized output spectrographs qPCR cycle files microarray images short sequences protein assignments expression matrixes genome variants Considerations during primary analysis batch effects sample quantity biological artifacts (e.g. GC content) individual assay quality sample quality platform effects operator effects David Quigley dquigley@cc.ucsf.edu Normalization challenges vary solved problems microarray expression level taqman expression level genotypes from SNP chips best practices SNV calling from sequence gene-level RNA-seq ChIP-seq open problems mRNA isoform reconstruction tumor clonality analysis from sequence David Quigley dquigley@cc.ucsf.edu Secondary analysis addresses the biological question To which DNA sequences does TP53 bind? What mutations are frequent in basal-like breast cancer? Which kinases are does my tool compound target? David Quigley dquigley@cc.ucsf.edu Primary analysis specialied tools and packages standardized pipelines develop over time driven by methods Secondary analysis general tools open-ended driven by statistics and biology David Quigley dquigley@cc.ucsf.edu Chosing quantitative tools Cost Learning curve Ease of use Flexibility ecosystem people other tools David Quigley dquigley@cc.ucsf.edu Traditional programming languages Python, C++, Java, others can solve any computable problem creates the fastest tools free requires programming expertise complex to write and test high effort David Quigley dquigley@cc.ucsf.edu Specialized single-purpose programs command line tools academic research type commands at a prompt or run scripts PLINK, bowtie, GATK, bedtools GUI (point and click) commercial software for a vendor’s platform slick, opaque, hard/impossible to automate David Quigley dquigley@cc.ucsf.edu Commercial statistics programs STATA, SPSS, GraphPad, others 1) Load one dataset 2) Select analysis by clicking on a GUI 3) Generate a report may have a built-in language mature tools Not free David Quigley dquigley@cc.ucsf.edu Web-based tools Galaxy string together pre-defined analysis steps very easy to use reproducible David Quigley dquigley@cc.ucsf.edu R: a “software environment” Using R is like writing and using software Traditionally, biologists did not do this. David Quigley dquigley@cc.ucsf.edu Why is R popular? Open-ended, open-source Large library of packages package: easy-to use published methods like a Qiagen kit Free! David Quigley dquigley@cc.ucsf.edu You use R by typing at the prompt There is no pull-down menu of statistical commands David Quigley dquigley@cc.ucsf.edu What’s good about this approach? chain analyses work with multiple datasets use packages of code easy to reproduce runs on anything makes sense to computer programmers David Quigley dquigley@cc.ucsf.edu What’s hard about this approach? hard to get started cryptic commands built-in help is hard to use David Quigley dquigley@cc.ucsf.edu RStudio makes it easier David Quigley dquigley@cc.ucsf.edu bioconductor Curated collection of R packages Microarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more bioconductor.org David Quigley dquigley@cc.ucsf.edu packages for common tasks limma: microarray normalization and analysis samr: differential expression impute: dealing with missing data downloaded for free from a central repository David Quigley dquigley@cc.ucsf.edu Reproducible research Replicate a wet lab experiment detailed protocols (not printed in the methods) extensive optimization reagents that might be unique or hard to get techniques that require years of experience David Quigley dquigley@cc.ucsf.edu Replicate a dry lab experiment published algorithms (if novel) published source code sometimes “available from the authors” well-specified input and deterministic output no reagents Okay, maybe a supercomputer or cloud How hard could it be? David Quigley dquigley@cc.ucsf.edu Many chances to make honest errors Bookkeeping errors Transposed column headers Out-of-date/changed annotations Off-by-one Misunderstood sample labels Batch effects Cryptic cohort stratification Inappropriate analytical methods David Quigley dquigley@cc.ucsf.edu Your notebook should be the final product hand-curate metadata; automate the analysis primary data metadata David Quigley dquigley@cc.ucsf.edu analysis script figures tables R Markdown David Quigley dquigley@cc.ucsf.edu R Markdown David Quigley dquigley@cc.ucsf.edu Learning R data types by comparing them to Excel spreadsheets Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible David Quigley dquigley@cc.ucsf.edu Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible R Easy jobs are hard at first Non-trivial things are possible Easy to make a paper trail Biostatistics researchers publish tools in R Can create publication-ready plots David Quigley dquigley@cc.ucsf.edu Organizing data in Excel Each subject has a row. Each column has a feature of your subjects. David Quigley dquigley@cc.ucsf.edu R calls the data points variables variables numbers and characters (letters, words) numbers: characters: David Quigley dquigley@cc.ucsf.edu 2.6, 4 “Flopsy”, “white, brown paws” R calls the columns vectors vectors ordered collections of a variable name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”] age: [2.5, 2.6, 2.5, 4] David Quigley dquigley@cc.ucsf.edu R calls the data set a data frame data frame a list of vectors (columns) that have names elements can be read and written by row & column David Quigley dquigley@cc.ucsf.edu I can slice and dice the data frame David Quigley dquigley@cc.ucsf.edu Tell R to do things using functions function_name( details about how to do it ) generate sequence from 1 to 5 counting by 0.5 parameters for seq are named from, to, and by David Quigley dquigley@cc.ucsf.edu Tell R to do things using functions function_name( details about how to do it ) report the mean of my.data. Result of one function is fed into another one. David Quigley dquigley@cc.ucsf.edu Tell R to do things using functions function_name( details about how to do it ) define a new function that adds 2 to whatever it’s passed compare to original value of my.data David Quigley dquigley@cc.ucsf.edu Code is a protocol for the computer A program is a series of operations on data Short programs (scripts) are often linear Large programs have decision points “flow control” David Quigley dquigley@cc.ucsf.edu Most jobs: data preparation & scripts tools that manipulate text text editing programs (TextPad, BBEdit, Emacs) Python Old-school command line tools (awk) David Quigley dquigley@cc.ucsf.edu Walk-through a straightforward analysis Primary data from METABRIC study gene expression TP53 sequence 1,400 samples from 5 hospitals Is there an association between breast cancer subtype and TP53 mutation? David Quigley dquigley@cc.ucsf.edu Tasks Normalize data batch effects unwanted inter-sample variation Identify outliers associations between p53 and subtype David Quigley dquigley@cc.ucsf.edu Quantile Normalization (limma) Force every array to have the same distribution of expression intensities > library(limma) > raw = read.table('raw_extract.txt’, ...) > raw.normalized = normalize.quantiles( raw ) > normalized = log2( raw.normalized ) David Quigley dquigley@cc.ucsf.edu Identify batch effects in microarrays gene 1 Principle Components Analysis Identify strongest variation in a matrix gene 2 David Quigley dquigley@cc.ucsf.edu Identify batch effects in microarrays gene 1 Principle Components Analysis Identify axes of maximal variation in a matrix gene 2 David Quigley dquigley@cc.ucsf.edu Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix gene 1 gene 1 group A group B gene 2 David Quigley dquigley@cc.ucsf.edu gene 2 second component PCA of identifies a batch effect hospital 3 (yellow) first component > my.pca = prcmp( t( expression.data ) ) > plot( my.pca, ... ) David Quigley dquigley@cc.ucsf.edu batch correction reduces bias (ComBat) second principle component ComBat package reduces user-defined batch effects first principle component David Quigley dquigley@cc.ucsf.edu Molecular subtypes of breast carcinoma, defined by gene expression ER status Luminal A N=507 Luminal B N=379 Her2 N=161 > sa = read.table(‘patients.txt’, ...) > tumor.counts = table( sa$ER.status, sa$PAM50Subtype) (convert counts to percentages) > barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... ) David Quigley dquigley@cc.ucsf.edu Basal N=234 Find interactions: TP53 and subtype Fit a linear model: > fitted.model = lm( dependent ~ independent ) Perform Analysis of Variance: > anova( fitted.model ) general form of my analysis: > anova( lm( gene.expression ~ PAM * TP53 ) 18,000 genes PAM: {LumA, LumB, Her2, Basal} TP53: {mutant, WT} David Quigley dquigley@cc.ucsf.edu Automate with loops Calculate anova for 18,000 genes by looping through each gene and storing result. > n_genes = 18000 > result = rep( 0, n_genes ) > for( counter in 1:n_genes ){ result[counter] = anova(...) } sort results identify significant interaction David Quigley dquigley@cc.ucsf.edu repeat 18,000 times Immune infiltration in TP53-WT Basal CD3E log2 expression log2 expression Does p53 have a role in immune surveillance? absent mild severe infiltration David Quigley dquigley@cc.ucsf.edu High-performance computing resources Cluster computing 1 computer 20 hours 20 computers 1 hour Clusters available on campus Institute for Human Genetics Recharge ~800 cores, plenty of disk space HDFCC Cluster Free for small jobs to cancer center members Contribute resources for big jobs ~800 cores, plenty of disk space QB3 Free for small jobs to QB3 members Lots of cores, not much disk space Amazon AWS Infinite capacity, but bring a credit card David Quigley dquigley@cc.ucsf.edu Next steps: getting help and learning more online forums: expert help for free biostars.org all of bioinformatics David Quigley dquigley@cc.ucsf.edu online forums: expert help for free biostars.org all of bioinformatics David Quigley dquigley@cc.ucsf.edu seqanswers.com Nextgen sequencing online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics stats.stackexchange.com statistics David Quigley dquigley@cc.ucsf.edu UCSF resources Library classes and information Formal courses (BMI, Biostatistics) Cores (Computational Biology, Genomics) QGDG monthly methods discussion group David Quigley dquigley@cc.ucsf.edu Online classes and blogs Free courses on data analysis http://jhudatascience.org simplystatistics.org Coursera etc... Good tutorials on sequence analysis http://evomics.org/learning David Quigley dquigley@cc.ucsf.edu Questions? dquigley@cc.ucsf.edu