(Lab4) Genomic meta-analysis, gene annotation and pathway analysis

advertisement
BIOST 2055
Introductory high-throughput genomic data analysis I: data mining and
applications.
Spring/2013
Class location: Room A622 Crabtree Hall
Computer lab location: Room 3073 (3rd floor), Department of Computational Biology, BST3,
3501 Fifth Avenue
Class schedule: Wednesday, Friday 9:30-10:45AM
Course homepage: use Blackboard
Lecturer: George C. Tseng, Yan Lin and Wei Chen
Office hour: by appointment
Office: 303 Parran Hall (George Tseng)
Email address: ctseng@pitt.edu
Telephone number: 412-624-5318
Lecturer’s homepage: http://www.pitt.edu/~ctseng (George Tseng)
TA: Serena Liao
Office hour: TBD
Office: TBD
Email address:
Course Description:
This course is designed for graduate students (both doctoral and master students) and researchers
from both quantitative fields (statistics, information and computer science) and qualitative biological
fields who are interested in high-throughput genomic data analysis. The course aims to introduce
modern statistical and computational methods in high-throughput genomic data analysis. It mainly
focuses on the method and applied aspects while the in-depth methodological and theoretical details
are left to the second course in the fall (BIOST 2078 Introductory high-throughput genomic data
analysis II: theories and algorithms). The first half of the course focuses on fundamental statistical
and computational methods applicable in virtually all kinds of high-throughput genomic data. The
second half covers new selected topics that are subject to change every year. Students are required to
have basic statistical training (i.e. two elementary statistics courses, basic calculus and linear algebra)
and basic programming proficiency (R programming is required for homework and final project and
can be learned from the class). The visions of the course include: (1) to motivate students from
quantitative fields into genomic research (2) to familiarize students from biological fields with deeper
understanding of statistical methods (3) to promote inter-disciplinary collaboration atmosphere in
class.
Tentative Schedule of Sessions and Assignments:
The first 18 sessions (75 minutes each session) are designed to introduce fundamental statistical methods
used in genomic data analysis. Additional 8 sessions are devoted to selected special topics in the field.
The last two sessions are for student presentations of their final projects.
Part I: Fundamental statistical methods
1/9
Introduction of the entire course and basic molecular biology and genetics. (Lin)
1/11
Introduction microarray and next-generation sequencing (NGS) technology.
(Lin)
1/16
Data preprocessing
Data summarization, data transformation, data filtering and missing value
imputation. (Lin)
1/18
Detecting differentially expressed (DE) genes Empirical Bayes. Comparative
analysis of two or more conditions; permutation methods; SAM; control false
discovery rate (FDR). (Lin)
1/23
(Lab1) Introduction Bioconductor and NCBI database.
Up-stream analysis analysis on real Affymetrix and cDNA array data sets.
Homework 1 distributed. (Lin)
1/25
Supervised learning (classification) basic concepts in machine learning; feature
selection, overfitting and cross-validation, sensitivity and specificity. (Tseng)
1/30
Supervised learning (classification) Bayes classifier; popular machine learning
methods: Logistic regression, LDA/QDA/Fisher’s criterion, KNN, CART,
bagging, boosting, random forest, SVM, ANN, nearest shrunken centroid.
(Tseng)
2/1
Supervised learning (classification) cont’d (Tseng)
2/6
(Lab2) DE analysis and classification Data analysis on detecting DE genes and
classification problem.
Homework 2 distributed. (Tseng)
2/8
Dimension reduction data visualization; principal component analysis (PCA);
multidimensional scaling (MDS). (Lin)
2/13
Unsupervised learning (clustering) hierarchical clustering, K-means, selforganizing maps (SOM), model-based clustering; estimate number of clusters.
(Lin)
2/15
Unsupervised learning (clustering) tight clustering; penalized and weighted Kmeans; cluster stability and tightness; bi-clustering. (Lin)
2/20
(Lab3) Dimension reduction and Clustering analysis
Homework 3 distributed. (Lin)
2/22
Pathway analysis microarray and gene annotation databases (GO, KEGG and
more); enrichment analysis; motif finding. (Tseng)
2/27
Genetic regulatory network Genomic regulatory network inference: Bayesian
network, hidden Markov model and general network analysis. (Tseng)
3/1
Horizontal genomic meta-analysis microarray meta-analysis (random effects
model, Fisher’s method, maxP, rank-based methods etc). (Tseng)
3/6
Vertical integrative analysis (Tseng)
3/8
(Lab4) Genomic meta-analysis, gene annotation and pathway analysis
Homework 4 distributed. (Tseng)
3/13 & Spring break
3/15
Part II: Selected topics
3/20
Copy number variation (CNV) and loss of heterozygosity (LOH) array CGH,
SNP array (Chen)
3/22
Genome-wide association (GWAS) (Chen)
3/27
Next generation sequencing I introduction of technology(Chen)
3/29
Next generation sequencing II DNA-seq analysis(Chen)
4/3
Next generation sequencing III RNA-seq analysis (Chen)
4/5
Next generation sequencing IV ChIP-seq analysis; bisulfite sequencing and
methylation array (Chen)
4/10
Gene regulation and miRNA regulation (guest lecturer: Dr. Takis Benos)
4/12
Gene regulation and miRNA regulation (guest lecturer: Dr. Takis Benos)
4/17
4/19
Student final project presentation
Student final project presentation
Handout:
Course information and handouts will be posted to the Blackboard. Students are encouraged to print out
the slides before each lecture.
Computer Lab:
There will be four lab sessions for hands-on experiences on programming and software usage during the
first half of the course. R is the major language used and ability of programming in R is a prerequisite (In
some situations, students may not be familiar with R programming before the semester begins but are
expected to learn to catch up in the first few weeks). Four homework sets are distributed after each
computer lab.
Final project:
Final projects are conducted by groups of 3 students. We will encourage/enforce mixture of quantitative
(statisticians) and qualitative (biologists) students in the final projects. The lecturer will provide a list of
topics/references at the beginning of the semester and the major goal is to apply statistical techniques
learned in class to analyze real data sets and solve real-world problems. A presentation and a final report
are expected from each group in the end of the semester.
Grade:
Homework 1~4: 52%
Final project: 48% (mid-term progress report due 3/17 for 8%; final presentation for 20%; final paper due
on 4/21 for 20%)
Download