龙星课程—肿瘤生物信息学上机课程 曹莎 Email:scaorobin@sina.com 课程安排 • 各类数据类型的介绍,简单的R入门; • 基因表达数据和蛋白表达数据的相关性; • 差异性表达的检验, 假阳性检验(FDR), 批次效应 (batch effect); • 基因突变数据以及表达通路的富集分析 • 基因表达数据的相关性以及双聚类分析 • 各类数据的整合 基因表达数据和metabolic profiling的数据;基因表达数据和表观遗传数 据的整合 数据类型的介绍—基因表达数据 • Microarray – 高通量测量几万个探针 – 精度较低 • 如何获取? – GEO Dataset, array-express, TCGA • 这些数据有何信息? • 使用microarray数据须知 • • • • Organism Experimental design Sample list (Sample distribution, sample size) Platform • Important!!!! 数据类型的介绍—基因表达数据 • RNA-seq • 如何获取? – TCGA, SRA • 这些数据测有何信息? Data levels and data types • https://tcgadata.nci.nih.gov/tcga/tcgaDataType.jsp 数据类型的介绍—基因组数据 • Somatic point mutation • 如何获取? – TCGA, GEO SRA • 这些数据测的是什么?有何信息? 数据类型的介绍—表观遗传数据 • DNA甲基化数据 • 如何获取? – TCGA, GEO Dataset • 这些数据测的是什么,有何信息? 数据类型的介绍—表观遗传数据 • Histone modification数据 • 如何获取? – Very limited • 这些数据测的是什么,有何信息? 数据类型的介绍—蛋白质组学数据 • Protein array • 如何获取? – TCGA, literature search • 这些数据测的是什么?有何信息? 数据类型的介绍—代谢组学数据 • Metabolic profiling • 如何获取? – literature search • 这些数据测的是什么?有何信息? 简单的R入门 • • • • 简单的数据处理 统计检验 统计建模(回归,矩阵分解等) 可视化 Print • print(matrix(c(1,2,3,4), 2, 2)) • print(list("a","b","c")) Basis functions • ls() • rm() • c() #creating a vector, c() is a function • mode() # • class() # • • • • • • mean(x) median(x) sd(x) var(x) cor(x, y) # cov(x, y) Creating Sequences • • • • • • • 1:5 5:1 seq(from=0, to=20, by=5) 1.1:10.1 1.1:10.3 a<-rep(0,3) rep(c(1,2,a),2) Basic calculations • • • • • • • + * / %% ^ %*% #matrix multiply • log(x) • sin(x) • exp() • • • • e Pi Inf NA Data mode: Physical Type mode(3.1415) # Mode of a number [1] "numeric" > mode(c(2.7182, 3.1415)) # Mode of a vector of numbers [1] "numeric" > mode("Moe") # Mode of a character string [1] "character" Data Class: Abstract type • scalar • array (vector) • matrix • From array to matrix • factor (looks like a vector, but has special properties, for Categorical variables or grouping) • data.frame data.frame • Same data mode in each column • Unique Row/column names (rownames, colnames) • One row of a data.frame is a data.frame • as.data.frame(****) matrix • Same data mode in the whole matrix • Can have repeated Row/column names • One row of matrix is an array (vector) • as.matrix(****) 这门课处理的数据类型 • Clinical data-> data.frame • Experimental data-> data.frame or matrix – Microarray data – RNA seq data – Somatic mutation data – Protein array – DNA methylation data Data combining • cbind – Combine data by column • rbind – Combine data by row • Eg. a<-matrix(0,2,2) b<-matrix(1,2,2) cbind(a,b) rbind(a,b) length • a<-c(1:5) • length(a) apply • Apply Functions Over Array Margins • apply(DATA, MARGIN, FUNCTION, ...) – MARGIN= 1 for rows; 2 for columns • Eg. m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2) apply(m, 1, mean) apply(m, 2, mean) Pattern寻找 • Which command • which(****),**** should be a logical operation • which(****), return the index of TRUE elements in the logical operation • Eg x<- floor(10*runif(10)) x which(x<5) x[which(x<5)] For loop For loop: http://en.wikipedia.org/wiki/For_loop In computer science a for loop is a programming language statement which allows code to be repeatedly executed Question: Calculate the sum of all the values in the vector x<- floor(10*runif(10)) For loop Real computer program! Eg. for(i in 1:100){ print("Hello world!") print(i*i) } For loop for(*** in ***){} for(VARIABLE in TARGETSET){} for(i in 1:100){} x <-floor(10*runif(10)) total_x<-0 for(i in 1:length(x)) { print(i) print(x[i]) total_x<-total_x+x[i] } Working directory • getwd() • setwd(“****”) • list.files() • load(“****”) • save.image(“****”) 实例 • 摘出colon cancer的clinical information中所有 二期和三期的样本 步骤 • 将数据load进来 • 找到数据中所有的期的信息 • 用for循环将所有的一期,二期的样本摘出 来,并且合并所有的数据 R code setwd("D:\\DragonStar\\dragon_star_data\\TCGA_colon_cancer_data") rm(list=ls()) list.files() load("COAD_clinical_data.RData") data_clinical<-COAD_clinical_data data_clinical$pathologic_stage<-as.character(data_clinical$pathologic_stage) head(data_clinical) table(data_clinical$pathologic_stage) all_stages<-unique(sort(data_clinical$pathologic_stage)) all_I_II_stages<-all_stages[c(2,3,4,5,6,7)] all_I_II_stages<-c("Stage I","Stage IA","Stage II","Stage IIA","Stage IIB","Stage IIC") data_stage_I_II<-c() for(i in 1:length(all_I_II_stages)){ id<-which(data_clinical$pathologic_stage==all_I_II_stages[i]) data_stage_I_II<-rbind(data_stage_I_II,data_clinical[id,]) }