Introduction to R

advertisement
Introduction to R
Data Set: Galapagos Islands
Variables:
Species: the number of species of tortoise found on the island
Endemics: the number of endemic species
Elevation: the highest elevation of the island (m)
Nearest: The distance from the nearest island (km)
Scruz: the distance from santa Cruz (km)
Adjacent: area of the adjacent island (km2)
Reading the data in
The first step is to read the data in. You'll need to get the data and save it.
> gala <- read.table("gala.data")
> gala
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra
58
23
25.09 346 0.6
0.6
1.84
Bartolome
31
21
1.24 109 0.6 26.3 572.33
Caldwell
3
3
0.21 114 2.8 58.7
0.78
Champion
25
9
0.10
46 1.9 47.4
0.18
Coamano
2
1
0.05
77 1.9
1.9 903.82
Daphne.Major
18
11
0.34 119 8.0
8.0
1.84
Daphne.Minor
24
0
0.08
93 6.0 12.0
0.34
Darwin
10
7
2.33 168 34.1 290.2
2.85
Eden
8
4
0.03
71 0.4
0.4
17.95
Enderby
2
2
0.18 112 2.6 50.2
0.10
Espanola
97
26
58.27 198 1.1 88.3
0.57
Fernandina
93
35 634.49 1494 4.3 95.3 4669.32
Gardner1
58
17
0.57
49 1.1 93.1
58.27
Gardner2
5
4
0.78 227 4.6 62.2
0.21
Genovesa
40
19
17.35
76 47.4 92.2 129.49
Isabela
347
89 4669.32 1707 0.7 28.1 634.49
Marchena
51
23 129.49 343 29.1 85.9
59.56
Onslow
2
2
0.01
25 3.3 45.9
0.10
Pinta
104
37
59.56 777 29.1 119.6 129.49
Pinzon
108
33
17.95 458 10.7 10.7
0.03
Las.Plazas
12
9
0.23
94 0.5
0.6
25.09
Rabida
70
30
4.89 367 4.4 24.4 572.33
SanCristobal 280
65 551.62 716 45.2 66.6
0.57
SanSalvador
237
81 572.33 906 0.2 19.8
4.89
SantaCruz
444
95 903.82 864 0.6
0.0
0.52
SantaFe
62
28
24.08 259 16.5 16.5
0.52
SantaMaria
285
73 170.92 640 2.6 49.2
0.10
Seymour
44
16
1.84 147 0.6
9.6
25.09
Tortuga
16
8
1.24 186 6.8 50.9
17.95
Wolf
21
12
2.85 253 34.1 254.7
2.33
If your data file is stored in folder Stat214, for example, and the file was created using an
editor you may enter
gala<read.table("C:/Stat242/gala.data",sep="\t",quote="",header=T,row.names=
NULL)
The "<-" is an assignment operator which reads the data into the object gala. You can
use "=" (underscore) as an alternative to "<-".
We can check the dimension of the data:
> dim(gala)
[1] 30 7
If we don’t remember the variable (column) names we can enter:
> names(gala)
[1] "X"
[7] "Scruz"
"Species" "Endemics" "Area"
"Elevation" "Nearest"
"Adjacent"
We can have access to the variables in gala by entering
attach(gala)
Then by entering the name of the variable, e.g. Species, I see all the Species values:
> Species
[1] 58 31 3 25 2 18 24 10 8 2 97 93 58 5 40 347 51 2 104
[20] 108 12 70 280 237 444 62 285 44 16 21
Numerical Summaries
One easy way to get the basic numerical summaries is:
> summary(gala)
Species
Min.
: 2.00
1st Qu.: 13.00
Median : 42.00
Mean
: 85.23
3rd Qu.: 96.00
Max.
:444.00
Nearest
Min.
: 0.20
1st Qu.: 0.80
Endemics
Min.
: 0.00
1st Qu.: 7.25
Median :18.00
Mean
:26.10
3rd Qu.:32.25
Max.
:95.00
Scruz
Min.
: 0.00
1st Qu.: 11.02
Area
Min.
:
0.0100
1st Qu.:
0.2575
Median :
2.5900
Mean
: 261.7000
3rd Qu.: 59.2400
Max.
:4669.0000
Adjacent
Min.
:
0.03
1st Qu.:
0.52
Elevation
Min.
: 25.00
1st Qu.: 97.75
Median : 192.00
Mean
: 368.00
3rd Qu.: 435.30
Max.
:1707.00
Median : 3.05
Mean
:10.06
3rd Qu.:10.02
Max.
:47.40
Median : 46.65
Mean
: 56.98
3rd Qu.: 81.08
Max.
:290.20
Median :
2.59
Mean
: 261.10
3rd Qu.: 59.24
Max.
:4669.00
We can compute these numbers seperately also:
> gala$Species
[1] 58 31
3 25
2 18
51
2 104
[20] 108 12 70 280 237 444
> mean(gala$Sp)
[1] 85.23333
> median(gala$Sp)
[1] 42
> min(gala$Sp)
[1] 2
> range(gala$Sp)
[1]
2 444
> quantile(gala$Sp)
0% 25% 50% 75% 100%
2
13
42
96 444
24
10
8
2
97
62 285
44
16
21
93
58
5
40 347
We can get the variance and sd:
> var(gala$Sp)
[1] 13140.74
> sqrt(var(gala$Sp))
[1] 114.6331
We can write a function to compute sd's:
> sd <- function(x) sqrt(var(x))
> sd(gala$Sp)
[1] 114.6331
The correlations:
> cor(gala)
Species
Endemics
Area
Elevation
Nearest
Species
1.00000000 0.970876516 0.6178431 0.73848666 -0.014094067
Endemics
0.97087652 1.000000000 0.6169791 0.79290437 0.005994286
Area
0.61784307 0.616979087 1.0000000 0.75373492 -0.111103196
Elevation 0.73848666 0.792904369 0.7537349 1.00000000 -0.011076984
Nearest
-0.01409407 0.005994286 -0.1111032 -0.01107698 1.000000000
Scruz
-0.17114244 -0.154264319 -0.1007849 -0.01543829 0.615410357
Adjacent
0.02616635 0.082658026 0.1800376 0.53645782 -0.116247885
Scruz
Adjacent
Species
-0.17114244 0.02616635
Endemics -0.15426432 0.08265803
Area
-0.10078493 0.18003759
Elevation -0.01543829 0.53645782
Nearest
0.61541036 -0.11624788
Scruz
1.00000000 0.05166066
Adjacent
0.05166066 1.00000000
Or more neatly
> round(cor(gala),3)
Species Endemics
Area Elevation Nearest Scruz Adjacent
Species
1.000
0.971 0.618
0.738 -0.014 -0.171
0.026
Endemics
0.971
1.000 0.617
0.793
0.006 -0.154
0.083
Area
0.618
0.617 1.000
0.754 -0.111 -0.101
0.180
Elevation
0.738
0.793 0.754
1.000 -0.011 -0.015
0.536
Nearest
-0.014
0.006 -0.111
-0.011
1.000 0.615
-0.116
Scruz
-0.171
-0.154 -0.101
-0.015
0.615 1.000
0.052
Adjacent
0.026
0.083 0.180
0.536 -0.116 0.052
1.000
Another numerical summary with a graphical element is the stem and leaf plot:
> gala$En
[1] 23 21 3 9 1 11
65 81 95
[26] 28 73 16 8 12
> stem(gala$En)
0
7
4
2 26 35 17
4 19 89 23
2 37 33
9 30
The decimal point is 1 digit(s) to the right of the |
0
1
2
3
4
5
6
7
8
9
|
|
|
|
|
|
|
|
|
|
01223447899
12679
13368
0357
5
3
19
5
Graphical Summaries
We can make histograms and boxplot and specify the labels if we like:
> hist(gala$Sp)
> hist(gala$Sp,main="Histogram of Species",xlab="number of Species")
> boxplot(gala$Sp)
Scatterplots are easier - here we rescale the X-axis because of the skewness of area:
plot(gala$Area,gala$Sp)
plot(log(gala$Area),gala$Sp,xlab="log(Area)",ylab="Species")
We can make a scatterplot matrix:
pairs(gala)
> plot(gala) # also a scatterplot matrix
We can put several plots in one display
par(mfrow=c(2,2))
boxplot(gala$Ar)
boxplot(gala$Adj)
boxplot(gala$Elev)
boxplot(gala$Sc)
par(mfrow=c(1,1)) # back to 1 plot display
Selecting subsets of the data
Second row:
> gala[2,]
Bartolome
Species Endemics Area Elevation Nearest Scruz Adjacent
31
21 1.24
109
0.6 26.3
572.33
Third column
> gala[,3]
[1]
25.09
0.03
[10]
0.18
0.01
1.24
0.21
0.10
0.05
58.27
634.49
0.57
0.78
0.34
0.08
2.33
17.35 4669.32
129.49
[19]
59.56
170.92
[28]
1.84
17.95
0.23
1.24
2.85
4.89
551.62
572.33
903.82
24.08
The 2,3 element:
> gala[2,3]
[1] 1.24
c() is a function
> c(1,4,8)
[1] 1 4 8
for making vectors, e.g.
Select the first, fourth and eighth rows:
> gala[c(1,4,8),]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra
58
23 25.09
346
0.6
0.6
1.84
Champion
25
9 0.10
46
1.9 47.4
0.18
Darwin
10
7 2.33
168
34.1 290.2
2.85
The : operator is good for making sequences e.g.
> 3:11
[1] 3
4
5
6
7
8
9 10 11
We can select the third through eleventh rows:
> gala[3:11,]
Caldwell
Champion
Coamano
Daphne.Major
Daphne.Minor
Darwin
Eden
Enderby
Espanola
Species Endemics Area Elevation Nearest Scruz Adjacent
3
3 0.21
114
2.8 58.7
0.78
25
9 0.10
46
1.9 47.4
0.18
2
1 0.05
77
1.9
1.9
903.82
18
11 0.34
119
8.0
8.0
1.84
24
0 0.08
93
6.0 12.0
0.34
10
7 2.33
168
34.1 290.2
2.85
8
4 0.03
71
0.4
0.4
17.95
2
2 0.18
112
2.6 50.2
0.10
97
26 58.27
198
1.1 88.3
0.57
We can use "-" to indicate "everthing but", e.g all the data except the first two columns is:
> gala[,-c(1,2)]
Baltra
Bartolome
Caldwell
Champion
Coamano
Daphne.Major
Daphne.Minor
Darwin
Eden
Enderby
Espanola
Fernandina
Gardner1
Gardner2
Genovesa
Area Elevation Nearest Scruz Adjacent
25.09
346
0.6
0.6
1.84
1.24
109
0.6 26.3
572.33
0.21
114
2.8 58.7
0.78
0.10
46
1.9 47.4
0.18
0.05
77
1.9
1.9
903.82
0.34
119
8.0
8.0
1.84
0.08
93
6.0 12.0
0.34
2.33
168
34.1 290.2
2.85
0.03
71
0.4
0.4
17.95
0.18
112
2.6 50.2
0.10
58.27
198
1.1 88.3
0.57
634.49
1494
4.3 95.3 4669.32
0.57
49
1.1 93.1
58.27
0.78
227
4.6 62.2
0.21
17.35
76
47.4 92.2
129.49
Isabela
4669.32
Marchena
129.49
Onslow
0.01
Pinta
59.56
Pinzon
17.95
Las.Plazas
0.23
Rabida
4.89
SanCristobal 551.62
SanSalvador
572.33
SantaCruz
903.82
SantaFe
24.08
SantaMaria
170.92
Seymour
1.84
Tortuga
1.24
Wolf
2.85
1707
343
25
777
458
94
367
716
906
864
259
640
147
186
253
0.7 28.1
29.1 85.9
3.3 45.9
29.1 119.6
10.7 10.7
0.5
0.6
4.4 24.4
45.2 66.6
0.2 19.8
0.6
0.0
16.5 16.5
2.6 49.2
0.6
9.6
6.8 50.9
34.1 254.7
634.49
59.56
0.10
129.49
0.03
25.09
572.33
0.57
4.89
0.52
0.52
0.10
25.09
17.95
2.33
We may also want select the subsets on the basis of some criterion e.g. which islands
exceed 500 in area:
> gala[gala$Area > 500,]
Species Endemics
Area Elevation Nearest Scruz Adjacent
Fernandina
93
35 634.49
1494
4.3 95.3 4669.32
Isabela
347
89 4669.32
1707
0.7 28.1
634.49
SanCristobal
280
65 551.62
716
45.2 66.6
0.57
SanSalvador
237
81 572.33
906
0.2 19.8
4.89
SantaCruz
444
95 903.82
864
0.6
0.0
0.52
Learning more about R
While running R you can get help about a particular commands - eg - if you want help
about the stem() command just type help(stem)
If you don't know what the name of the command is that you want to use then type:
help.start()
and then browse.
A short introduction to R is given at
http://gd.tuwien.ac.at/languages/R/doc/contrib/Rdebuts_en.pdf.
A detailed introduction to R can be found at http://spider.stat.umn.edu/R/doc/manual/Rintro.html.
Download