A brief introduction to R

advertisement
A brief introduction to R
R is a software for data manipulation, calculation and graphic display. It is a
common used software among statisticians. It is a free software and with many open
sources. Many new statistical methods are implemented in R and built into R platform as a package, which makes R attractive to many practitioners. It is developed
from S language, which was developed at Bell Laboratories by Rick Becker, John
Chambers and Allan Wilks.
In this short introduction, some basic commands and functions are provided to
help you get started with R. However, I believe that the most efficient way to learn
a language is to use it. To get more information on any specific named function in
R, for example solve, use the command
> help(solve)
An alternative is
> ?solve
The help.search command (alternatively ??) allows searching for help in various
ways. For example,
> ??solve
For more detail and a comprehensive introduction to R, please refer to
http://cran.r-project.org/manuals.html
1
Vectors
R operates on objective, such as vectors of real or complex values, which are the
basic units we work on. To set up a vector named x, consisting of four numbers (1.2,
1.5, 5.6, 2.5), use the R command
> x<-c(1.2, 1.5, 5.6, 2.5)
In this example, the left arrow ‘<–’ is an old-version way to assign objectives with
values. The latest versions of R also recognize ‘=’ as an assignment. The function
c() creates a vector combing all numbers. If you would like to assign x with a
sequence of values with some pattern, for example equally spaced values, you can
do it in the following ways:
1
> x<-1:10
> x
[1] 1 2 3 4 5 6 7
> x<-seq(1,10,length=10)
> x
[1] 1 2 3 4 5 6 7
> x<-seq(1,10,by=1)
> x
[1] 1 2 3 4 5 6 7
8
9 10
8
9 10
8
9 10
The elementary arithmetic operators, such as +, -, ×, /, between vectors are performed element by element. In addition, many commonly used arithmetic functions
can be applied to vectors. For example, log, exp, sin, cos, tan, sqrt.
> x<-c(1.2, 1.5, 5.6, 2.5)
> x^2
# square each component in x
[1] 1.44 2.25 31.36 6.25
> x^4
# x to the power of 4
[1]
2.0736
5.0625 983.4496 39.0625
> sqrt(x)
[1] 1.095445 1.224745 2.366432 1.581139
> tan(x)
[1] 2.5721516 14.1014199 -0.8139433 -0.7470223
R provides many convenient ways to manipulate vectors. x[1] represent the first
component of the vector x, where the [] means the index set of the vector. To select
a sub vector from vector x, we appending to x with an index vector in the bracket
[]. Such index set could be (a) a logic vector (b) a vector of positive integers in the
range of the length of the vector (c) a vector with negative integers.
> x[c(T,F,F,T)]
[1] 1.2 2.5
> x>2
[1] FALSE FALSE
> x[x>2]
[1] 5.6 2.5
> x[c(1,4)]
[1] 1.2 2.5
> x[-1]
[1] 1.5 5.6 2.5
2
# select the 1st and 4th components in x.
# T for TRUE and F for FALSE
TRUE
TRUE
# select the sub vector greater than 2
# select the 1st and 4th components in x
# exclude the 1st component
Matrices
Matrix can be easily created in R by using the function matrix(). It largely simplifies
the complication in manipulate arrays in the basic languages, such as C. For example,
2
> amatrix<-matrix(c(1:6),2,3)
> amatrix
[,1] [,2] [,3]
[1,]
1
3
5
[2,]
2
4
6
# Create a 2 by 3 matrix named amatrix
The following examples provides ways to obtain columns or rows from the matrix
amatrix :
> amatrix[1,]
[1] 1 3 5
> amatrix[,2]
[1] 3 4
> amatrix[,2:3]
[,1] [,2]
[1,]
3
5
[2,]
4
6
We can also bind a vector or matrix to another matrix or vector to form new vectors
or matrices.
> cmat<-c(2,3)
> cmat
[1] 2 3
> dmat<-cbind(amatrix,cmat)
> dmat
cmat
[1,] 1 3 5
2
[2,] 2 4 6
3
> emat<-c(2,3,8)
> dmat<-rbind(amatrix,emat)
> dmat
[,1] [,2] [,3]
1
3
5
2
4
6
emat
2
3
8
R contains many operators and functions that are available only for matrices.
For example t(X) is the matrix transpose function. The functions nrow(A) and
ncol(A) give the number of rows and columns in the matrix A respectively. The
matrix multiplication can be done using the operators %*% in R.
> bmatrix<-matrix(c(1:6),3,2)
> amatrix%*%bmatrix
[,1] [,2]
[1,]
22
49
[2,]
28
64
3
To get an inverse, eigenvalues and eigenvectors, determinant and singular value
decomposition of a matrix, the corresponding functions are solve(), eigen(), det(),
svd().
> solve(dmat)
emat
[1,] -1.75 1.125 0.25
[2,] 0.50 0.250 -0.50
[3,] 0.25 -0.375 0.25
> eigen(dmat)
$values
[1] 12.1204947 1.3635608 -0.4840555
$vectors
[,1]
[,2]
[,3]
[1,] -0.4572209 -0.3248032 -0.9607523
[2,] -0.5985983 -0.8212344 0.2381445
[3,] -0.6577455 0.4691235 0.1422753
> det(dmat)
[1] -8
> svd(dmat)
$d
[1] 12.883717
1.339795
0.463458
$u
[,1]
[,2]
[,3]
[1,] -0.4572923 0.2742784 0.8459640
[2,] -0.5767985 0.6325630 -0.5168824
[3,] -0.6768953 -0.7243172 -0.1310628
$v
[,1]
[,2]
[,3]
[1,] -0.2301106 0.06774931 -0.9708033
[2,] -0.4431762 0.88083306 0.1665171
[3,] -0.8663971 -0.46855431 0.1726641
3
Reading data from external files
There are several ways to read external data file into R. Here we will introduce
functions read.table() and scan(). For more detail about reading external data file
and exporting data, please see R Data Import/Export manual.
The function read.table() is used to import data with a special form. For example, the airline passenger data as follows:
4
1949
1950
1951
1952
1953
Jan
112
115
145
171
196
Feb
118
126
150
180
196
Mar
132
141
178
193
236
Apr
129
135
163
181
235
May
121
125
172
183
229
Jun
135
149
178
218
243
Jul
148
170
199
230
264
Aug
148
170
199
242
272
Sep
136
158
184
209
237
Oct
119
133
162
191
211
Nov
104
114
146
172
180
Dec
118
140
166
194
201
The first row contains names of the columns, each row is one sample and all the rows
containing the same number of columns. Suppose the file name of the above data is
‘AirPassenger.txt’. To read this file into R, we could use the following commands:
> AirPassenger<-read.table(file="AirPassengers.txt", header=T)
# set the working directory to be the same as AirPassengers.txt
> AirPassenger
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
If the data file does not contain column names, you should modify header=T to
header=F. The above file can be also read using scan().
> AirData<-scan("AirPassengers.txt",skip=1) #
Read 65 items
> AirData
[1] 1949 112 118 132 129 121 135 148
[14] 1950 115 126 141 135 125 149 170
[27] 1951 145 150 178 163 172 178 199
[40] 1952 171 180 193 181 183 218 230
[53] 1953 196 196 236 235 229 243 264
4
skip the column name
148
170
199
242
272
136
158
184
209
237
119
133
162
191
211
104
114
146
172
180
118
140
166
194
201
Probability distributions
As a statistical package, R provides easy commands to generate random variables,
evaluate densities or mass functions and distribution functions, and quantile functions for a comprehensive list of probability distributions. Some of the commonly
used distributions are listed in Table 1.
To generate random variables from above distributions, prefix the R name by
’r’. For example, if we want to generate 10 random variables from binomial(5,0.5),
we will use the following command
> rbinom(10,5,0.5)
[1] 3 1 2 4 3 3 4 1 1 4
5
Table 1: Commonly used distributions
Distribution
beta
binomial
chi-squared
exponential
F
gamma
geometric
normal
Poisson
Student’s t
uniform
Weibull
R name
beta
binom
chisq
exp
f
gamma
geom
norm
pois
t
unif
weibull
Additional arguments
shape1, shape2, ncp
size, prob
df, ncp
rate
df1, df2, ncp
shape, scale
prob
mean, sd
lambda
df, ncp
min, max
shape, scale
The first argument is the number of random variables and the second and third
arguments are parameters n and p respectively.
In addition, prefix the R name given here by d for the density, p for the CDF, q
for the quantile function. The first argument is x for dxxx , q for pxxx , p for qxxx
and n for rxxx. For example,
> dbinom(4,5,0.5)
[1] 0.15625
The above command gives the probability mass at 4 for binomial distribution with
parameters 5 and 0.5.
5
Graphics
Graphical facilities are important components of the R environment. Plotting commands are divided into two basic groups:
• High-level plotting functions create a new plot on the graphics device, possibly
with axes, labels, titles and so on.
• Low-level plotting functions add more information to an existing plot, such as
extra points, lines and labels.
Some of the often used high-level plotting functions are:
plot(x, y): If x and y are vectors, plot(x, y) produces a scatter plot of y against
x.
hist(x): Produces a histogram of the numeric vector x.
6
qqnorm(x): Distribution-comparison plots.
contour(x, y, z, ...) : Draw a contour plot for z, as a function of x and y.
Useful lower-level plotting functions are:
points(x, y), lines(x, y): Adds points or connected lines to the current plot.
text(x, y, labels, ...): Add text to a plot at points given by x, y.
abline(a, b): Adds a line of slope b and intercept a to the current plot.
legend(x, y, legend, ...): Adds a legend to the current plot at the specified
position.
To access and modify the list of graphics parameters for the current graphics device,
using par() function.
6
Loops and apply () functions
As many other languages, R has three ways to construct loops.
(a) for (name in expr1 ) expr2
where name is the loop variable. expr 1 is a vector expression, (often a sequence
like 1:20), and expr 2 is often a grouped expression. expr 2 is repeatedly
evaluated as name ranges through the values in the vector result of expr 1.
(b) repeat expr1 if expr2 break;
(c) while (condition ) expr
For example, the following code is for “calculating averages for Bootstrap samples” given in Lecture 2.
> x<-c(1,4,9)
> myvec<-NULL
> times<-100000
> for (i in 1:times)
+ {
+ y<-sample(x, 3, replace=T)
+ my<-mean(y)
+ myvec<-c(myvec,my)
+ }
> table(myvec)/times
myvec
1
2
0.03747
0.10960
4
4.66
3
0.11106
5.66
7
3.66
0.11055
6.33
0.03746
7.33
0.11096
0.22615
9
0.03659
0.10918
0.11098
However, loops are used in R code much less often than in compiled languages. Code
that takes a whole object view is likely to be both clearer and faster in R. Typically,
the functions apply(), tapply() can be used to replace loops.
## Compute row sums for a matrix:
> x <- cbind(x1 = rep(3,1000), x2 = c(500:1, 2:501))
> rmean<-rep(0,nrow(x))
> for (i in 1:nrow(x))
+ {rmean[i]<-mean(x[i,])}
> row.means <- apply(x, 1, mean)
8
Download