Lab 1_F2013.docx

advertisement
Lab 1:
You need to download R. Go to: http://cran.r-project.org/
Choose your computer and operating system and Download and Install R. The Binary Versions are the
fastest. Also install the packages as many as you can. It will ask for a CRAN site or mirror that’s close to
you. I always use USA(WA) as my CRAN site.
Also, to get started on things we will need in lab Install the Bioconductor packages:
YOU NEED TO LOAD BICONDUCTOR TO GET SOME OF THE PACKAGES WE WILL NEED.
Go to http://www.bioconductor.org/ and download biocLite from the website under install. We will need
several packages from this site.
Before we get into actually analyzing genomic/proteomic/rna-seq data, let us first learn to navigate
ourselves in R.
Some simple things to do in R just to get acclimatized before we run packaged programs.
1. Data input:
a. Reading from the keyboard (you physically typing in data)
b. Reading from an existing file (I will use csv or txt files mostly in this class).
2. Doing descriptive statistics:
a. Mean
b. Median
c. SD
d. Variance
e. Quartiles
f. Confidence intervals
g. Simple t-test (paired t, pooled t)
h. Basic regression
i. Basic ANOVA
3. Simple graphics
a. Histogram
b. Pie charts
c. Scatter plot.
Data:
Data is R can be of various types.
1. Vectors: one dimensional data set.
Example:
x1=c(1,2,3,5,7)
x2=c(2,7,8,4,1)
2. Matrix: 2 dimensional data, has rows and columns.
Example: m1=cbind(x1,x2)
OR
m2=matrix(c(1,2,3,5,7,2,7,8,4,1),nrow=5,ncol=2)
3. Data frames (the one we will use the most): idea of a data set with various elements.
mydata= data.frame(x1,x2,m1,m2)
4. Arrays: collection of matrices, multidimensional matrices
5. List: a collection of data frames, matrices, vectors etc. Variables can be different sizes and not
all numerical.
Inputting data from a file: I will show you how to read from a csv file (which will be common for us).
Reading data from a csv file.
Put the data in the correct directory and make sure that’s your work directory. I created a directory
called myRfolder in my C drive. I will try to use that as my default directory. I put in a data set called
classdata.csv (This is a class data dealing with the students height, age, shoesize, eye and hair color).
Use the commands:
>setwd("/myRfolder")
>mydata<-read.table("classdata.csv",header=TRUE,sep=",",row.names="id",na.strings=" ")
> mydata
height age shoe eyecolor hair
1
62 18 6.5 brown blonde
2
69 19 11.0 hazel auburn
3
71 20 13.0 brown brown
4
69 20 6.5 hazel blonde
5
66 20 8.0
6
63 18 7.5 brown brown
7
66 20 9.5
blue blonde
8
65 20 8.5
blue blonde
9
67 19 9.5 brown brown
10
67 23 8.5 brown brown
11
67 19 9.5 green brown
12
69 NA 8.5 hazel blonde
13
66 20 10.0
14
70 20 9.5 hazel brown
15
69 20 9.5 hazel brown
16
72 20 11.0
blue blonde
17
70 19 10.5
blue brown
18
71 24 12.5
blue brown
19
63 19 8.0 green blonde
20
65 44 7.0
21
68 22 9.5 hazel brown
22
69 20 10.0 hazel brown
23
70 35 10.0 brown brown
24
70 22 10.0
blue brown
25
64 19 7.5
blue brown
blue blonde
blue blonde
blue brown
The data is saved in R as a dataframe mydata. It’s a temporary data set and will be lost when we get
out of R.
R looks at everything as a matrix. So if we want to get a summary measure for height all we have to
do is refer to it as mydata[ ,1] referring to the first column of the data or can use the $ sign to say
mydata$height
Simple summaries:
>summary(mydata[,1])
Min. 1st Qu. Median Mean 3rd Qu. Max.
62.00 66.00 68.00 67.52 70.00 72.00
> summary(mydata$height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
62.00 66.00 68.00 67.52 70.00 72.00
> mean(mydata$height)
[1] 67.52
> sd(mydata$height)
[1] 2.77068
SIMPLE GRAPHICS
> barplot(table(mydata$hair))
> pie(table(mydata$hair))
> hist(mydata$shoe)
> plot(mydata$shoe,mydata$height)
> table(mydata$hair,mydata$eye)
To be able to qqplots we would need to have the library CAR installed.
> library(stats)
> qqnorm(mydata$shoe)
> t.test(mydata$height,mydata$shoe)
Welch Two Sample t-test
data: mydata$height and mydata$shoe
t = 90.2019, df = 39.25, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 56.95384 59.56616
sample estimates:
mean of x mean of y
67.52
9.26
t.test(mydata$shoe, mu=8)
the call for t tests:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
Wilcox.test(mydata$shoe,mu=8)
Regression:
>mymodel=lm(x1~x2,data=mydata)
> class(mymodel)
> summary(mymodel)
ANOVA:
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
anova(lm.D9)
summary(lm.D9)
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1)
# Residuals, Fitted, ...
par(opar)
Generate factors by specifying the pattern of their levels.
Usage
gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)
R code:
setwd("/myRfolder")
mydata<-read.table("classdata.csv",header=TRUE,sep=",",row.names="id",na.strings=" ")
mydata
summary(mydata[,1])
summary(mydata$height)
mean(mydata$height)
sd(mydata$height)
barplot(table(mydata$hair))
pie(table(mydata$hair))
hist(mydata$shoe)
plot(mydata$shoe,mydata$height)
table(mydata$hair,mydata$eye)
library(stats)
qqnorm(mydata$shoe)
t.test(mydata$height,mydata$shoe)
#Regression:
mymodel=lm(mydata$height~mydata$age,data=mydata)
class(mymodel)
summary(mymodel)
Download