Note-1 - Department of Statistics | Rajshahi University

advertisement
Training on R
For 3rd and 4th Year Honours Students, Dept. of Statistics, RU
Installation and Data Structures of R
Empowered by
Higher Education Quality Enhancement Project (HEQEP)
Department of Statistics
Rajshahi University, Rajshahi-6205, Bangladesh
March 21-23, 2013
History of R
Statistical Programming Language S developed at Bell Labs, 1976.
Licensed as S-Plus in 1983.
1990 : R
An open source program similar to S
Developed by Robert Gentleman and Ross Ihaka (Auckland, NZ)
1997: Developed international “R-core” team
Updated versions available every couple months
For more: http://cran.r-project.org/mirrors.html
Advantage of R
 R is a free computer programming language, developed by
renowned Statisticians.
 It is open-source and runs on Windows, Linux and
Macintosh.
 R has excellent graphing capabilities.
 R has an excellent built-in help system.
 R's language has a powerful, easy to learn syntax with many
built-in statistical functions.
 The language is easy to extend with user-written functions.
To obtain and install R on your computer
 Go to http://cran.r-project.org/mirrors.html
to choose a mirror near you
 Click on your favorite operating system (Windows, Linux, or
Mac)
 Download and install from the “base”
To install additional packages
 Start R on your computer
 Choose the appropriate item from the “Packages” menu
Here, CRAN = Comprehensive R Archive Network.
To obtain and install R on your computer
To obtain and install R on your computer
To obtain and install R on your computer
To obtain and install R on your computer
Double Click
To obtain and install R on your computer
To obtain and install R on your computer
The R Environment
Tools bar
Menu bar
Command Prompt
The R Environment
For clear screen
ctrl + L
Creating a Script File
>
Working in R: As Calculator
Numeric Operators
Operator
Symbol
Addition
+
Subtraction
-
Multiplication
*
Division
/
Power
^ or **





4 +2 =6
4–2=2
4*2=8
4/2=2
4 ^ 2 = 16
Variables & Assignment Operator
 Numeric
5, 5.76, etc
 Logical
Values corresponding to True or False
 Character Strings
Sequences of characters (blue, male, Rahim, etc)
 Variables are assigned by the operator <- or =
 Data type need not to be declared.
a = 5 (or, a <- 5)
b = “blue”
c = a^2 + 5
c>a
etc
Data Structure






Vectors
Matrices
Arrays
Factors
Lists
Data frames
Vector
Here we introduce three functions, c, seq, and rep, that are used to
create vectors in various situations.
c() to concatenate elements or sub-vectors
rep() to repeat elements or patterns
seq() to generate sequences
> c(2, 7, 9)
> [1] 2 7 9
> a = c(2, 7, 9)
> b = c(3, 5, 8, a)
>b
> [1] 2 7 9 2 7 9
rep(value(s), number of repetition)
> rep(5,10)
[1] 5 5 5 5 5 5 5 5 5 5
> rep(c(2,4,6),3)
[1] 2 4 6 2 4 6 2 4 6
seq(initial value, Terminated value, increment)
> seq(2, 10, 2)
> [1] 2 4 6 8 10
Vector
h = c(21,25, 19, 22, 23, 20)
h
[1] 21 25 19 22 23 20
# Numeric vector
name = c(“Rahim”, “Rani”, “Raju”)
name
[1] “Rahim” “Rani” “Raju”
# Character vector
c = h > 22
c
[1] FALSE TRUE FALSE FALSE TRUE FALSE
# Logical vector
a = c(1,2,3,4,5)
a
[1] 1 2 3 4 5
a = 1:5
a
[1] 1 2 3 4 5
Vector
Indexing
w = c(1, 3, 5, 2, 10)
> w[3]
>[1] 5
# the third element of w
> w[3:5]
>[1] 5 2 10
# the third to fifth element of w, inclusive
> w[w>3]
>w[-2]
>[1] 1 5 2 10
# elements in w greater than 3
# all except the second element
> w[w>2 & w<=5)# greater than 2 and less than or equal to 5
Vector
Vector used in functions
w = c(1, 3, 5, 2, 10)
length(w)
cumsum(w)
max(w)
sum(w)
median(w)
std(w)
abs(10-50)
sort(w, decreasing=T)
sum(w)
min(w)
range(w)
mean(w)
var(w)
summary(w)
sort(w)
etc
Working in R: Using help
Specific
R
keyword
?keyword
HTML
help(keyword)
> ?mean
> help(mean)
> help(median)
> help.start()
# information on mean command
CRAN Full
Manual
HTML
help.start()
help.search(“topic”)
Finding "vague"
topic
??topic
Array & Matrix
A matrix in mathematics is just a two-dimensional array of numbers. Matrices
and arrays are represented as vectors with dimensions:
# Generate a 3 by 4 array
> x <- 1:12
> dim(x) <- c(3,4)
>x
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Generate a 4 by 5 array
> A <- array(1:20, dim = c(4,5))
>A
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
 The dim assignment function sets or changes the dimension attribute of
x, causing R to treat the vector of 12 numbers as a 3 × 4 matrix.
 Notice that the storage is column-major; that is, the elements of the first
column are followed by those of the second, etc.
Array & Matrix
A matrix in mathematics is just a two-dimensional array of numbers. Matrices
and arrays are represented as vectors with dimensions:
# Generate a 3 by 2 Matrix
> A = matrix(1:12, nrow=3, byrow=T)
>A
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
# 3 x 2 matrix of 0
> Y <- matrix(0, nrow=3, ncol=2)
>Y
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 0 0
> A[ ,2]
# 2nd column of matrix A
[1] 2 6 10
> A[3, ]
# 3rd row of matrix A
[1] 9 10 11 12
> A[2 ,2]
[1] 2 6 10
# (2, 2) th element of matrix A
Basic operations – Matrix
R command Purpose (output)
A+B
A*B
A %*% B
t(A)
solve(A)
cbind()
addition of A and B matrices
element by element products
product of A and B matrices
transpose of matrix A
inverse of matrix A
forms matrices by binding together
matrices horizontally, or column-wise
rbind()
forms matrices by binding together
matrices vertically, or row-wise
Basic operations – Matrix
> A.mat <- matrix(c(19,8,11,2,18,17,15,19,10),nrow=3)
> A.mat
[,1] [,2] [,3]
[1,] 19 2 15
[2,] 8 18 19
[3,] 11 17 10
> inv.A <- solve(A.mat)
# inverse of matrix A.mat
> t(A.mat)
# transpose of matrix A.mat
> A.mat %*% inv.A
Basic operations – Matrix
> a=matrix(1:9,nrow=3)
> b=matrix(2:10, nrow=3)
>a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> cbind(a,b)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 7 2 5 8
[2,] 2 5 8 3 6 9
[3,] 3 6 9 4 7 10
> rbind(a,b)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[4,] 2 5 8
[5,] 3 6 9
[6,] 4 7 10
>b
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 4 7 10
Cov.matrix = cov(b)
Row.mean = apply(b, 1, mean)
Cor.matrix = cor(b)
Col.mean = apply(b, 2, mean)
NOTE: apply(X, MARGIN, FUN)
List
vector: an ordered collection of data of the same type.
> a = c(7,5,1)
> a[2]
[1] 5
list: an ordered collection of data of arbitrary types.
> a = list(Name="Rahim",age=c(12, 23,10), Married = F)
> a
$Name
[1] "Rahim"
$age
[1] 12 23 10
$Married
[1] FALSE
 Typically, vector elements are accessed by their index (an integer), list
elements by their name (a character string).
Data frames
 Data frame is supposed to represent the typical data table that
researchers come up with – like a spreadsheet.
 It is a rectangular table with rows and columns with same length; data
within each column has the same type (e.g. number, text, logical), but
different columns may have different types.
Example:
> a
localisation tumorsize progress
1
proximal
6.3
FALSE
2
distal
8.0
TRUE
3
proximal
10.0
FALSE
Making data frames
We illustrate how to construct a data frame from the following
car data.
Make
Model
Cylinder
Honda
Civic
V4
Chevrolet
Beretta
V4
Ford
Escort
V4
Eagle
Summit
V4
Volkswagen
Jetta
V4
Buick
Le Sabre
V6
Mitsubishi
Galant
V4
Dodge Grand Caravan
V6
Chrysler
New Yorker
V6
Acura
Legend
V6
Weight
2170
2655
2345
2560
2330
3325
2745
3735
3450
3265
Mileage
33
26
33
33
26
23
25
18
22
20
Type
Sporty
Compact
Small
Small
Small
Large
Compact
Van
Medium
Medium
Making data frames
> Make <- c("Honda","Chevrolet","Ford","Eagle","Volkswagen","Buick","Mitsbusihi",
+ "Dodge","Chrysler","Acura")
> Model <- c("Civic","Beretta","Escort","Summit","Jetta","Le Sabre","Galant",
+ "Grand Caravan","New Yorker","Legend")
> Cylinder <-c (rep("V4",5),"V6","V4",rep("V6",3))
> Weight <- c(2170, 2655, 2345, 2560, 2330, 3325, 2745, 3735, 3450, 3265)
> Mileage <- c(33, 26, 33, 33, 26, 23, 25, 18, 22, 20)
> Type <- c("Sporty","Compact",rep("Small",3),"Large","Compact","Van",
+ rep("Medium",2))
Making data frames
Now data.frame() function combines the six vectors into a single data
frame.
> Car <- data.frame(Make, Model, Cylinder, Weight, Mileage, Type)
> Car
Make
1 Honda
2 Chevrolet
3 Ford
4 Eagle
5 Volkswagen
6 Buick
7 Mitsubishi
8 Dodge
9 Chrysler
10 Acura
Model
Civic
Beretta
Escort
Summit
Jetta
Le Sabre
Galant
Grand Caravan
New Yorker
Legend
Cylinder
V4
V4
V4
V4
V4
V6
V4
V6
V6
V6
Weight
2170
2655
2345
2560
2330
3325
2745
3735
3450
3265
Mileage
33
26
33
33
26
23
25
18
22
20
Type
Sporty
Compact
Small
Small
Small
Large
Compact
Van
Medium
Medium
Making data frames
> names(Car)
[1] "Make" "Model" "Cylinder“ "Weight" "Mileage" "Type"
> Car[1,]
Make Model Cylinder Weight Mileage Type
1 Honda Civic
V4
2170 33
Sporty
> Car[10,4]
[1] 3265
> Car$Mileage
[1] 33 26 33 33 26 23 25 18 22 20
> mean(Car$Mileage) #average mileage of the 10 vehicles
[1] 25.9
> min(Car$Weight)
[1] 2170
Making data frames
> table(Car$Type)
# gives a frequency table
Compact Large Medium Small Sporty Van
2
1
2
3
1
1
> table(Car$Make, Car$Type) # Cross tabulation
Compact Large Medium Small Sporty Van
Acura
0
0
1
0
0
0
Buick
0
1
0
0
0
0
Chevrolet 1
0
0
0
0
0
Chrysler
0
0
1
0
0
0
Dodge
0
0
0
0
0
1
Eagle
0
0
0
1
0
0
Ford
0
0
0
1
0
0
Honda
0
0
0
0
1
0
Mitsbusihi 1
0
0
0
0
0
Volkswagen 0
0
0
1
0
0
Making data frames
> Make.Small <- Car$Make[Car$Type == "Small"]
> summary(Car$Mileage)
# gives summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 22.25
25.50 25.90 31.25 33.00
Making data frames
> b = data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10))
>b
x
y
z
1 -1.7651180 0.462309932 0.09230914
2 -0.7340731 -1.681826091 0.66648791
3 -0.4968900 1.728658405 -0.68281664
4 -1.3217873 0.307030157 0.24192745
5 -0.2070019 0.003892192 1.19591807
6 -0.9633084 0.060328696 -1.40424843
7 -1.1323626 1.079521099 1.63552915
8 -0.7301976 -1.422012899 -0.16695860
9 0.2979073 0.528152338 0.65995778
10 -0.5759655 0.655296337 -0.39156127
> cor(b)
x
y
z
x 1.0000000000 0.0007151043 0.12151913
y 0.0007151043 1.0000000000 -0.05770153
z 0.1215191317 -0.0577015345 1.00000000
> apply(b,1,var)
[1] 1.42472853 1.39573092 1.80047438 0.85041478 0.57226442 0.56454121
[7] 2.14379987 0.39516798 0.03357767 0.44098693
Making data frames
> b = data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10))
>b
x
y
z
1 -1.7651180 0.462309932 0.09230914
2 -0.7340731 -1.681826091 0.66648791
3 -0.4968900 1.728658405 -0.68281664
4 -1.3217873 0.307030157 0.24192745
5 -0.2070019 0.003892192 1.19591807
6 -0.9633084 0.060328696 -1.40424843
7 -1.1323626 1.079521099 1.63552915
8 -0.7301976 -1.422012899 -0.16695860
9 0.2979073 0.528152338 0.65995778
10 -0.5759655 0.655296337 -0.39156127
attach(b)
lm.D9 <- lm(y ~ x)
lm.D90 <- lm(weight ~ group - 1)
anova(lm.D9)
summary(lm.D9
# Regression of y on x
# omitting intercept
Data Entry using Data Editor
•
•
R has a Data Editor with spreadsheet-like interface.
The interface quite useful for small data sets.
 Suppose we want to construct a data frame based on
following data
Roll
4701
4702
4703
4704
Bstat101 Bstat102
78
80
75
65
60
70
72
68
Data Entry using Data Editor
 To do this – type
> result <- data.frame(Roll=integer(0), Bstat101=numeric(0),
Bstat102=numeric(0))
> result <- edit(result)
 Then enter the data in the Data Editor and close Editor
> result
# To see the data
> result <- edit(result) # To modify the data
Reading data from File
An entire data frame can be read directly with the read.table() function.
# Reading data from Excel .csv File
> data1 <- read.table(file= “d:/RFiles/data1.csv", header=T, sep=“,”)
> data1 <- read.csv(file= “d:/RFiles/data1.csv", header=T )
> data1
# Reading data from text file
data2 <- read.table(file= “d:/RFiles/data3.txt", header=T, sep=“\t” )
> data2
> attach(data1)
> detach(data1)
Importing from other statistical systems
Package foreign on cran provides import facilities for files
produced by the following statistical software.
> read.mtp
> read.xport
> read.spss
# imports a `Minitab Portable Worksheet’
# reads a file in SAS format
# reads files created by spss
Package Rstreams on cran contain functions
> readSfile
> data.restore
# reads binary objects produced by S-PLUS
# reads S-PLUS data dumps
(created by data.dump)
Thanks
Download