BIOST 2055 Introductory high-throughput genomic data analysis I: data mining and applications. Spring/2012 Class location: Room A622 Crabtree Hall Computer lab location: Room 3073 (3rd floor), Department of Computational Biology, BST3, 3501 Fifth Avenue Class schedule: Wednesday, Friday 9:30-10:45AM Course homepage: use Blackboard Lecturer: George C. Tseng and Yan Lin Office hour: by appointment Office: 303 Parran Hall Email address: ctseng@pitt.edu Telephone number: 412-624-5318 Lecturer’s homepage: http://www.pitt.edu/~ctseng TA: Shaowu Tang Office hour: Monday 1:30-3:00pm and Friday: 12:30pm-2:00pm. Office: Parran Hall 325A (or Parran 309) Email address: biost2055pitt@gmail.com Course Description: This course is a graduate level aimed at introducing modern statistical methods and applications for high-throughput genomic data. The first half of the course contains sessions to introduce fundamental statistical and computational methods. The second half covers topics on various highthroughput genomic applications and related data analysis issues. This course is designed for graduate students or researchers from both quantitative fields (statistics and computer science) and qualitative biological fields. All students require basic statistical training (i.e. two elementary statistics courses, basic calculus and linear algebra) and programming proficiency (R programming is required for homework and final project). The visions of the course include: (1) to motivate students from quantitative fields into genomic research (2) to familiarize students from biological fields with related statistical methods (3) to promote inter-disciplinary collaboration habits in class. Tentative Schedule of Sessions and Assignments: The first 18 sessions (75 minutes each session) are designed to introduce fundamental statistical methods used in genomic data analysis. Then another 10 sessions are devoted to selected special topics of highthroughput genomic data, some of which will be taught by invited guest speakers. The last two sessions are for student presentations on their final projects. Part I: Fundamental statistical methods 1/4 Introduction of the entire course and basic molecular biology and genetics. (George & Yan) 1/6 Introduction More on molecular biology and biological database. (Yan) 1/11 Introduction microarray and next-generation sequencing (NGS) technology. (Yan) 1/13 Data preprocessing 1/18 1/20 1/25 1/27 2/1 2/3 2/8 2/10 2/15 2/17 2/22 2/24 2/29 3/2 Data summarization, data transformation, data filtering and missing value imputation. (Yan) Detecting differentially expressed (DE) genes Empirical Bayes. Comparative analysis of two or more conditions; permutation methods; SAM; control false discovery rate (FDR). (Yan) (Lab1) Introduction Bioconductor and NCBI database. Up-stream analysis analysis on real Affymetrix and cDNA array data sets. Homework 1 distributed. (Yan) Supervised learning (classification) basic concepts in machine learning; feature selection, overfitting and cross-validation. sensitivity and specificity. (George) Supervised learning (classification) Bayes classifier; popular machine learning methods: Logistic regression, LDA/QDA/Fisher’s criterion, KNN, CART, bagging, boosting, random forest, SVM, ANN, nearest shrunken centroid. (George) Supervised learning (classification) cont’d (George) (Lab2) DE analysis and classification Data analysis on detecting DE genes and classification problem. Homework 2 distributed. (George) Dimension reduction data visualization; principal component analysis (PCA); multidimensional scaling (MDS). (Yan) Unsupervised learning (clustering) hierarchical clustering, K-means, selforganizing maps (SOM), model-based clustering; estimate number of clusters. (Yan) Unsupervised learning (clustering) tight clustering; penalized and weighted Kmeans; cluster stability and tightness; bi-clustering. (Yan) (Lab3) Dimension reduction and Clustering analysis Homework 3 distributed. (Yan) Pathway analysis microarray and gene annotation databases (GO, KEGG and more); enrichment analysis; motif finding. (George) Genetic regulatory network Genomic regulatory network inference using microarray data (Boolean network and Baysian network). (George) Horizontal genomic meta-analysis microarray meta-analysis (random effects model, Fisher’s method, maxP, rank-based methods etc). (George) (Lab4) Genomic meta-analysis, gene annotation and pathway analysis Homework 4 distributed. (George) Spring break 3/7 & 3/9 Part II: Selected topics 3/14 Copy number variation (CNV) and loss of heterozygosity (LOH) array CGH, SNP array (Eleanor) 3/16 Next generation sequencing I (Dr. Wei Chen) 3/21 Genome-wide association (GWAS) (Eleanor Feingold) 3/23 Next generation sequencing II (Dr. Wei Chen) 3/28 Copy number variation (CNV) and loss of heterozygosity (LOH) hidden Markov model (HMM), change-point model (George) 3/30 Gene regulation and miRNA regulation (Dr. Takis Benos) 4/4 Gene regulation and miRNA regulation (Dr. Takis Benos) 4/6 DNA-protein interaction (ChIP-chip, ChIP-seq and ENCODE) (Dr. Xinghua Lu) 4/11 Epigenomics methylation, histone modification, methylation array, bisulfite sequencing (Dr. Xinghua Lu) 4/13 4/18 4/20 Vertical integrative analysis integration of multi-dimensional genomic experimental data (transcriptome, microRNA, methylation, ChIP, SNP, phenome) (George) Proteomics Introduction to 2D-gel, mass spectrometry, protein array and proteomics. (George) Student final project presentation Student final project presentation Handout: Course information and handouts will be posted to the Blackboard. You are encouraged to print out the slides before each lecture. Computer Lab: There will be four lab sessions for hand-on experiences on programming and software usage during the first half of the course. R is the major language used and ability of programming in R is a prerequisite. Four homework sets are distributed after each computer lab. Final project: Final projects are conducted by groups of 3 students. We will encourage/enforce mixture of quantitative (statisticians) and qualitative (biologists) students in the final projects. The lecturer will provide a list of topics/references at the beginning of the semester and the major goal is to apply statistical techniques learned in class to analyze the real data sets and solve the problems. A presentation and a final report are expected from each group in the end of the semester. Grade: Homework 1~4: 52% Final project: 48% (mid-term progress report due 3/14 for 8%; final presentation for 20%; final paper due on 4/22 for 20%)