Reflections on teaching a bioinformatics programming and gene expression analysis course Garrett Dancik, PhD Eastern Connecticut State University http://bioinformatics.easternct.edu May 29, 2015 Bioinformatics Programming and Analysis (Course Background) • Supports a forthcoming Bioinformatics minor at Eastern Connecticut State University • A computer science-based course • open to students who have taken Programming I (C++) • An upper-level elective course for computer science majors, who will • Have little / no biological background • Have little / no background in statistics • But are experienced programmers (C++, Java) Course content Basic programming concepts in R – variables, vectors, functions R programming • Gene expression profiling • Data processing • Clustering • Identification of differentially Biology / expressed genes Bioinformatics • Heatmaps • Clustering • Predictive modeling Probability and Statistics • Descriptive statistics • Probability and probability distributions • Hypothesis testing and p-values Note: a separate Introduction to Bioinformatics course is also offered (covers sequence databases, sequence alignment, etc) R programming • R (http://cran.r-project.org) is an open source, freely available programming language for data analysis and visualization • Bioconductor (http://www.bioconductor.org) is an open source repository of (primarily) R packages for the analysis of genomic data. • GEOquery – downloads data from the Gene Expression Omnibus (GEO) • affy – processing and analysis of affymetrix oligonucleotide arrays • limma (Linear Models for Microarray Data) – Data analysis of gene expression data • Rstudio (http://www.rstudio.com) is an integrated development environment (IDE) for R. Run History R Script Plots R Console • R Notebooks • Creates a document (e.g., web page) interweaving R code and output • Requires the knitr package • Serves as a study guide linking R code with output • Confirms that code is error-free Probability and Statistics • Why is this necessary? • Summary of gene expression data and sample characteristics • Quality assessment of gene expression data • Identification of differentially expressed genes • What is a p-value? What is an adjusted p-value (or FDR)? • How do students learn this? • Students are surveyed to collect data that will be fun to analyze • R is used as a tool for illustrating important concepts in probability and statistics • Sample space • Probability distributions • Central limit theorem • P-values Summary of Student Survey – Data is analyzed in class or in an assignment Probability and Statistics • Why is this necessary? • Summary of gene expression data and sample characteristics • Quality assessment of gene expression data • Identification of differentially expressed genes • What is a p-value? What is an adjusted p-value (or FDR)? • How do students learn this? • Students are surveyed to collect data that will be fun to analyze • R is used as a tool for illustrating important concepts in probability and statistics • Sample space • Probability distributions • Central limit theorem • P-values R as a tool: Understanding sample space (all possible poker hands) R as a tool: Random sampling and the Central Limit Theorem Sample R code get.sample.mean <-function(n) mean(rexp(n)) x.population = rexp(1000) x.10 = replicate(5000, get.sample.mean(10)) … Bioinformatics and Gene Expression Analysis • Gene expression and genetics, microarrays • Processing raw microarray data (CEL files), using affy package • Downloading data from the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/), using GEOquery package • Clustering and heatmaps • Identification of differentially expressed genes, • Extends concepts of • Linear models to understand limma • p-value to understand false discovery rate (FDR) • Predictive modeling and classification, e.g., k-NN • Learning by example through case studies… Sample Assignment 1. 2. 3. 4. Confirm that high EGFR expression is associated with sensitivity to erlotinib. Identify the top 10 probes differentially expressed between sensitive and resistant samples. Generate a heatmap of these 10 probes Calculate the accuracy of knn, using leave-one-out classification Heatmap of differentially expressed genes between erlotinib-sensitive and –resistant patients Sample R code fit = lmFit(GSE31625.X, design) contrast.matrix <makeContrasts(resistant sensitive,levels=design) fit = contrasts.fit(fit, contrast.matrix) fit = eBayes(fit) tt = topTable(fit,sort.by = "p", p.value = 0.05) probes = rownames(tt) m = match(probes, rownames(GSE31625.X)) X = GSE31625.X[m,] X = t(scale(t(X))) heatmap(X, ColSideColors = col.response, col = col.heat) Lessons learned… • Students enjoyed learning R and using R studio, and found R to be a very intuitive language • Students enjoyed the class survey and were able to learn important statistical concepts. • Students like the real-world examples (cancer, Alzheimer's disease) • Technical classroom issues – package installation • Future ideas: • Include a small research project where a student analyzes a GEO dataset of their choice • Include a lab on the analysis of sequencing data • Can anything be cut out? • Less in-class time on "R basics" • Some statistical concepts (e.g., hypothesis testing for a proportion) can be skipped THANK YOU! Garrett Dancik, PhD Eastern Connecticut State University E-mail: dancikg@easternct.edu http://bioinformatics.easternct.edu