2014

advertisement
龙星课程—肿瘤生物信息学上机课程
曹莎
Email:scaorobin@sina.com
课程安排
• 各类数据类型的介绍,简单的R入门;
• 基因表达数据和蛋白表达数据的相关性;
• 差异性表达的检验, 假阳性检验(FDR), 批次效应
(batch effect);
• 基因突变数据以及表达通路的富集分析
• 基因表达数据的相关性以及双聚类分析
• 各类数据的整合 基因表达数据和metabolic
profiling的数据;基因表达数据和表观遗传数
据的整合
数据类型的介绍—基因表达数据
• Microarray
– 高通量测量几万个探针
– 精度较低
• 如何获取?
– GEO Dataset, array-express, TCGA
• 这些数据有何信息?
•
使用microarray数据须知
•
•
•
•
Organism
Experimental design
Sample list (Sample distribution, sample size)
Platform
• Important!!!!
数据类型的介绍—基因表达数据
• RNA-seq
• 如何获取?
– TCGA, SRA
• 这些数据测有何信息?
Data levels and data types
• https://tcgadata.nci.nih.gov/tcga/tcgaDataType.jsp
数据类型的介绍—基因组数据
• Somatic point mutation
• 如何获取?
– TCGA, GEO SRA
• 这些数据测的是什么?有何信息?
数据类型的介绍—表观遗传数据
• DNA甲基化数据
• 如何获取?
– TCGA, GEO Dataset
• 这些数据测的是什么,有何信息?
数据类型的介绍—表观遗传数据
• Histone modification数据
• 如何获取?
– Very limited
• 这些数据测的是什么,有何信息?
数据类型的介绍—蛋白质组学数据
• Protein array
• 如何获取?
– TCGA, literature search
• 这些数据测的是什么?有何信息?
数据类型的介绍—代谢组学数据
• Metabolic profiling
• 如何获取?
– literature search
• 这些数据测的是什么?有何信息?
简单的R入门
•
•
•
•
简单的数据处理
统计检验
统计建模(回归,矩阵分解等)
可视化
Print
• print(matrix(c(1,2,3,4), 2, 2))
• print(list("a","b","c"))
Basis functions
• ls()
• rm()
• c() #creating a vector,
c() is a function
• mode() #
• class() #
•
•
•
•
•
•
mean(x)
median(x)
sd(x)
var(x)
cor(x, y) #
cov(x, y)
Creating Sequences
•
•
•
•
•
•
•
1:5
5:1
seq(from=0, to=20, by=5)
1.1:10.1
1.1:10.3
a<-rep(0,3)
rep(c(1,2,a),2)
Basic calculations
•
•
•
•
•
•
•
+
*
/
%%
^
%*% #matrix multiply
• log(x)
• sin(x)
• exp()
•
•
•
•
e
Pi
Inf
NA
Data mode: Physical Type
mode(3.1415) # Mode of a number
[1] "numeric"
> mode(c(2.7182, 3.1415)) # Mode of a vector of numbers
[1] "numeric"
> mode("Moe") # Mode of a character string
[1] "character"
Data Class: Abstract type
• scalar
• array (vector)
• matrix
• From array to matrix
• factor (looks like a vector, but has special
properties, for Categorical variables or
grouping)
• data.frame
data.frame
• Same data mode in
each column
• Unique Row/column
names (rownames,
colnames)
• One row of a
data.frame is a
data.frame
• as.data.frame(****)
matrix
• Same data mode in the
whole matrix
• Can have repeated
Row/column names
• One row of matrix is an
array (vector)
• as.matrix(****)
这门课处理的数据类型
• Clinical data-> data.frame
• Experimental data-> data.frame or matrix
– Microarray data
– RNA seq data
– Somatic mutation data
– Protein array
– DNA methylation data
Data combining
• cbind
– Combine data by column
• rbind
– Combine data by row
• Eg.
a<-matrix(0,2,2)
b<-matrix(1,2,2)
cbind(a,b)
rbind(a,b)
length
• a<-c(1:5)
• length(a)
apply
• Apply Functions Over Array Margins
• apply(DATA, MARGIN, FUNCTION, ...)
– MARGIN= 1 for rows; 2 for columns
• Eg.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
apply(m, 1, mean)
apply(m, 2, mean)
Pattern寻找
• Which command
• which(****),**** should be a logical operation
• which(****), return the index of TRUE elements
in the logical operation
• Eg
x<- floor(10*runif(10))
x
which(x<5)
x[which(x<5)]
For loop
For loop:
http://en.wikipedia.org/wiki/For_loop
In computer science a for loop is a programming
language statement which allows code to be
repeatedly executed
Question:
Calculate the sum of all the values in the
vector x<- floor(10*runif(10))
For loop
Real computer program!
Eg.
for(i in 1:100){
print("Hello world!")
print(i*i)
}
For loop
for(*** in ***){}
for(VARIABLE in TARGETSET){}
for(i in 1:100){}
x <-floor(10*runif(10))
total_x<-0
for(i in 1:length(x))
{
print(i)
print(x[i])
total_x<-total_x+x[i]
}
Working directory
• getwd()
• setwd(“****”)
• list.files()
• load(“****”)
• save.image(“****”)
实例
• 摘出colon cancer的clinical information中所有
二期和三期的样本
步骤
• 将数据load进来
• 找到数据中所有的期的信息
• 用for循环将所有的一期,二期的样本摘出
来,并且合并所有的数据
R code
setwd("D:\\DragonStar\\dragon_star_data\\TCGA_colon_cancer_data")
rm(list=ls())
list.files()
load("COAD_clinical_data.RData")
data_clinical<-COAD_clinical_data
data_clinical$pathologic_stage<-as.character(data_clinical$pathologic_stage)
head(data_clinical)
table(data_clinical$pathologic_stage)
all_stages<-unique(sort(data_clinical$pathologic_stage))
all_I_II_stages<-all_stages[c(2,3,4,5,6,7)]
all_I_II_stages<-c("Stage I","Stage IA","Stage II","Stage IIA","Stage IIB","Stage IIC")
data_stage_I_II<-c()
for(i in 1:length(all_I_II_stages)){
id<-which(data_clinical$pathologic_stage==all_I_II_stages[i])
data_stage_I_II<-rbind(data_stage_I_II,data_clinical[id,])
}
Download