Exercises_in_R

advertisement
An introduction to R
This is a short introduction to the different kinds of data that can be explored in R.
More information can be found in the R Data Vignette at:
http://cran.r-project.org/doc/manuals/R-data.pdf
Simple commands in R
R is an object-oriented language which makes performing statistical analysis of large data
very easy. To create an object in R containing data you use the arrow notation ‘<-‘. The c
operator just concatenates these numbers.
My_data <- c(1,2,3,4,5)
To get simple summary data (quantiles, mean and median, min and max) on an object, use
the command summary().
> summary(My_data)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
1
2
3
3
4
5
To quit R use the command:
> q()
To check the working directory (where files are read/placed):
> getwd()
To set the working directory:
> setwd(“/home/alistair/”)
To get help on a function:
> ?setwd()
Data structures in R
Vectors
Use the concatenation operator ( c ) to generate a vector of numbers:
> x <- c(1,2,3,4,5)
Generate a numeric sequence
> x <- c(1:5)
Generate a reoccurring list of numbers:
> x <- rep(c(3,4,5),4)
Character vectors require quotes;
> ord <- c("one","two","three","four","five")
You can extract a subset of the vector by using the [ ], or square brackets:
The first element of a vector has the index [1].
> x <- c(0,1,2,3,4,7,8,9,10)
> x[3:6]
>[1] 2 3 4 7
Data Frames
Complicated data sets consisting of characters and numbers can be represented in an R data
frame, which is essentially a list of objects.
Reading data into R
Getting data into R can often be the hardest part, especially if you are dealing with strange
formats.
Data files, such as text files exported from Excel, can be easily read into R using the read.file
command.
First set the working directory to the directory holding the files:
setwd("/media/disk-7/CoursePresentations/KI_stem_bioinformatics_course_2008/Exercises/")
Here we read in a csv file which was saved from Excel, removing white spaces as we go.
> human_genes <- read.csv("human_genes.csv", strip.white = TRUE)
Check the input data with:
>human_genes[1,]
To browse for your file on your computer as the command file.choose() to the code.
> zz <- read.csv(file.choose(),strip.white = TRUE)
Here we read in a data file, assigning the row names in the header to the row names of the data frame.
You can also use col.names.
> shoes <- read.table("shoes.txt", header = TRUE, row.names = 1)
Check that R read the file correctly (objects can be printed just by typing their name):
> shoes
You can also print the column headers only (sometimes the whole table does not fit on the
screen, and this might be more helpful):
> names(shoes)
Prepare the data for analysis
Individual columns can be called using the following syntax: type the name of the object,
followed by a dollar sign, then the name of the column:
> shoes$height
This is laborious, and the data can be split by column into discrete variables. So, prepare the
data for analysis by splitting it by column:
> attach(shoes)
Now you can call each column by its header. For example:
> height
Simple statistics
What is mean height and shoe size?
> mean(height)
[1] 169.7647
> mean(shoesize)
[1] 40.47059
Check the standard deviations?
> sd(height)
> sd(shoesize)
What is the sampling distribution like for each group? That is, how many observations are in
each group?
> table(gender)
> table(population)
Command table can also be used for cross-tabulations:
> table(gender,population)
Plotting the data
Usually graphical inspection gives an easier interpretation. How are the heights distributed?
To use a histogram, type:
> hist(height)
That’s the distribution for the whole population, but, is there is a difference in heights between
the two sites? This can be studied using a box plot. In this case variable height is divided into
two groups using the variable gender, and a separate boxplot is produced for both of these
plots:
> boxplot(height~gender)
So, there is large difference between the genders in heights. Does the same apply for
sampling sites?
Write the code for this yourself.
How are height and shoe size related? You can get a graphical view of this by making a
scatter plot:
> plot(height, shoesize)
Let’s visualize gender to the same plot with different colors. Color 1 is always black, and
higher numbers give other colors.
> plot(height, shoesize, col=as.numeric(gender))
Gender is automatically coded as a numeric value in the plotting command above. How do we
know which observations are coded with 1 and which are coded with 2? We can compare the
original variable and the recoded variable. Variables are bound together row wise using the
following command:
> data.frame(gender, as.numeric(gender))
It appears that males are coded with 2 and females with 1. Thus, in the plot above, females
are colored with black and males with red.
We can even add the sampling site using different marks (argument pch):
> plot(height, shoesize, col=as.numeric(gender),
pch=as.numeric(population))
Plotting symbol starts from an open point (o), that is denoted by number 1. Higher numbers
give different plotting symbols. Find out, which population is plotted using the triangles.
A legend can be added to a figure after it has been plotted. To add a legend for the plot
above, use the following command:
> legend(x="bottomright", legend=c("killar lund","tjejer lund",
"killar boden", "tjejer boden"), pch=c(2,1,2,1 ), col=c(1,1,2,2))
First, the position of the legend is given (argument x). Argument legend allows the text to be
added to the legend, and arguments pch and col give details on the plotting symbols and their
colors.
Recoding variables
Variables gender and population are factors. Sometimes we need to convert factors to
vectors (for instance, for certain statistical analyses). This is accomplished with the following.
> gen<-as.numeric(gender)
> pop<-as.numeric(population)
> class(gen)
> class(pop)
Or with a more general command ifelse:
> gen<-ifelse(gender=="male", 1, 2)
> pop<-ifelse(population=="lund", 1, 2)
> class(gen)
> class(pop)
Check from the help file for the arguments for the ifelse command.
Making a new dataset
Make a new dataset from the variables height, shoesize, gen and pop:
> shoes.new<-data.frame(height, shoesize, gen, pop)
Check that the new dataset is OK:
> shoes.new
> class(shoes.new)
Close the previous dataset, and take this new one into use:
> detach(shoes)
> attach(shoes.new)
Extracting a subset from a dataset
Make two subsets of the dataset shoes.new. Split it in two according to gender.
First, check which individuals are males:
> which(gen==1)
Based on that use subscripts to select the correct subset (take only rows for which gender is
male):
> shoes.lund<-shoes.new[which(gen==1),]
Similarly, make a new dataset from females.
We may want to split the dataset using a continuous variable, such as height. We could do
this using the median of the variable. Follow the example below to make two new datasets
that contain individuals below and above the median height. Check as you go.
> median(height)
> shoes.short<- shoes.new[which(height<=median(height)),]
> shoes.short
> shoes.long<- shoes.new[which(height>median(height)),]
> shoes.long
Quit R
To quit R, type:
> q( )
You will be prompted to save your workspace .. this is so that you can return to the work you
have performed at a later date. It’s usually a very good idea to do this.
When you next start R, load this workspace to continue where you left off, and you can also
load history to produce the same analysis results.
Visualisation
Visualising and displaying your raw data can be very informing. It should be preformed on any
data you collect.
Scatterplots can be made with the plot command. Put your data into two vectors, x and y.
Then to plot x versus y, use the command plot(x,y).
> x <- seq(0,10,0.1) # making some data
> x <- x[sample(1:length(x))] # randomize for fun
> y <- -1.5 + 4*x + rnorm(length(x),0,2)# linear function + noise
> plot(x,y)
If you provide only one vector to plot, the entries of the vector will be plotted against their
indices. For example, let us plot the results of 100 rolls of a fair die.
> z <- sample(1:6,100,prob=rep(1/6,6),replace=T)
> z
> plot(z)
Box plots are useful for comparing similar data Let’s look at the die rolling example from
above:
boxplot(z,col="orange")
The same data can also be displayed as a histogram. The histogram can take the form of a
frequency histogram, the height of each bar is the number of occurrences of the
corresponding value or class, or a density histogram, in which the area of each bar is the
relative frequency of the corresponding value or class.
Frequency histogram:
> hist(z, col="orange")
Density histogram:
> hist(z,freq=FALSE, col="orange")
Multiple plots in one frame
Multiple plots can be put into a single frame using the par command.
Give this command first:
> par(mfrow=c(2,3))
.. then continue with your plots as normal.
Download