Introduction to R Data Set: Galapagos Islands Variables: Species: the number of species of tortoise found on the island Endemics: the number of endemic species Elevation: the highest elevation of the island (m) Nearest: The distance from the nearest island (km) Scruz: the distance from santa Cruz (km) Adjacent: area of the adjacent island (km2) Reading the data in The first step is to read the data in. You'll need to get the data and save it. > gala <- read.table("gala.data") > gala Species Endemics Area Elevation Nearest Scruz Adjacent Baltra 58 23 25.09 346 0.6 0.6 1.84 Bartolome 31 21 1.24 109 0.6 26.3 572.33 Caldwell 3 3 0.21 114 2.8 58.7 0.78 Champion 25 9 0.10 46 1.9 47.4 0.18 Coamano 2 1 0.05 77 1.9 1.9 903.82 Daphne.Major 18 11 0.34 119 8.0 8.0 1.84 Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34 Darwin 10 7 2.33 168 34.1 290.2 2.85 Eden 8 4 0.03 71 0.4 0.4 17.95 Enderby 2 2 0.18 112 2.6 50.2 0.10 Espanola 97 26 58.27 198 1.1 88.3 0.57 Fernandina 93 35 634.49 1494 4.3 95.3 4669.32 Gardner1 58 17 0.57 49 1.1 93.1 58.27 Gardner2 5 4 0.78 227 4.6 62.2 0.21 Genovesa 40 19 17.35 76 47.4 92.2 129.49 Isabela 347 89 4669.32 1707 0.7 28.1 634.49 Marchena 51 23 129.49 343 29.1 85.9 59.56 Onslow 2 2 0.01 25 3.3 45.9 0.10 Pinta 104 37 59.56 777 29.1 119.6 129.49 Pinzon 108 33 17.95 458 10.7 10.7 0.03 Las.Plazas 12 9 0.23 94 0.5 0.6 25.09 Rabida 70 30 4.89 367 4.4 24.4 572.33 SanCristobal 280 65 551.62 716 45.2 66.6 0.57 SanSalvador 237 81 572.33 906 0.2 19.8 4.89 SantaCruz 444 95 903.82 864 0.6 0.0 0.52 SantaFe 62 28 24.08 259 16.5 16.5 0.52 SantaMaria 285 73 170.92 640 2.6 49.2 0.10 Seymour 44 16 1.84 147 0.6 9.6 25.09 Tortuga 16 8 1.24 186 6.8 50.9 17.95 Wolf 21 12 2.85 253 34.1 254.7 2.33 If your data file is stored in folder Stat214, for example, and the file was created using an editor you may enter gala<read.table("C:/Stat242/gala.data",sep="\t",quote="",header=T,row.names= NULL) The "<-" is an assignment operator which reads the data into the object gala. You can use "=" (underscore) as an alternative to "<-". We can check the dimension of the data: > dim(gala) [1] 30 7 If we don’t remember the variable (column) names we can enter: > names(gala) [1] "X" [7] "Scruz" "Species" "Endemics" "Area" "Elevation" "Nearest" "Adjacent" We can have access to the variables in gala by entering attach(gala) Then by entering the name of the variable, e.g. Species, I see all the Species values: > Species [1] 58 31 3 25 2 18 24 10 8 2 97 93 58 5 40 347 51 2 104 [20] 108 12 70 280 237 444 62 285 44 16 21 Numerical Summaries One easy way to get the basic numerical summaries is: > summary(gala) Species Min. : 2.00 1st Qu.: 13.00 Median : 42.00 Mean : 85.23 3rd Qu.: 96.00 Max. :444.00 Nearest Min. : 0.20 1st Qu.: 0.80 Endemics Min. : 0.00 1st Qu.: 7.25 Median :18.00 Mean :26.10 3rd Qu.:32.25 Max. :95.00 Scruz Min. : 0.00 1st Qu.: 11.02 Area Min. : 0.0100 1st Qu.: 0.2575 Median : 2.5900 Mean : 261.7000 3rd Qu.: 59.2400 Max. :4669.0000 Adjacent Min. : 0.03 1st Qu.: 0.52 Elevation Min. : 25.00 1st Qu.: 97.75 Median : 192.00 Mean : 368.00 3rd Qu.: 435.30 Max. :1707.00 Median : 3.05 Mean :10.06 3rd Qu.:10.02 Max. :47.40 Median : 46.65 Mean : 56.98 3rd Qu.: 81.08 Max. :290.20 Median : 2.59 Mean : 261.10 3rd Qu.: 59.24 Max. :4669.00 We can compute these numbers seperately also: > gala$Species [1] 58 31 3 25 2 18 51 2 104 [20] 108 12 70 280 237 444 > mean(gala$Sp) [1] 85.23333 > median(gala$Sp) [1] 42 > min(gala$Sp) [1] 2 > range(gala$Sp) [1] 2 444 > quantile(gala$Sp) 0% 25% 50% 75% 100% 2 13 42 96 444 24 10 8 2 97 62 285 44 16 21 93 58 5 40 347 We can get the variance and sd: > var(gala$Sp) [1] 13140.74 > sqrt(var(gala$Sp)) [1] 114.6331 We can write a function to compute sd's: > sd <- function(x) sqrt(var(x)) > sd(gala$Sp) [1] 114.6331 The correlations: > cor(gala) Species Endemics Area Elevation Nearest Species 1.00000000 0.970876516 0.6178431 0.73848666 -0.014094067 Endemics 0.97087652 1.000000000 0.6169791 0.79290437 0.005994286 Area 0.61784307 0.616979087 1.0000000 0.75373492 -0.111103196 Elevation 0.73848666 0.792904369 0.7537349 1.00000000 -0.011076984 Nearest -0.01409407 0.005994286 -0.1111032 -0.01107698 1.000000000 Scruz -0.17114244 -0.154264319 -0.1007849 -0.01543829 0.615410357 Adjacent 0.02616635 0.082658026 0.1800376 0.53645782 -0.116247885 Scruz Adjacent Species -0.17114244 0.02616635 Endemics -0.15426432 0.08265803 Area -0.10078493 0.18003759 Elevation -0.01543829 0.53645782 Nearest 0.61541036 -0.11624788 Scruz 1.00000000 0.05166066 Adjacent 0.05166066 1.00000000 Or more neatly > round(cor(gala),3) Species Endemics Area Elevation Nearest Scruz Adjacent Species 1.000 0.971 0.618 0.738 -0.014 -0.171 0.026 Endemics 0.971 1.000 0.617 0.793 0.006 -0.154 0.083 Area 0.618 0.617 1.000 0.754 -0.111 -0.101 0.180 Elevation 0.738 0.793 0.754 1.000 -0.011 -0.015 0.536 Nearest -0.014 0.006 -0.111 -0.011 1.000 0.615 -0.116 Scruz -0.171 -0.154 -0.101 -0.015 0.615 1.000 0.052 Adjacent 0.026 0.083 0.180 0.536 -0.116 0.052 1.000 Another numerical summary with a graphical element is the stem and leaf plot: > gala$En [1] 23 21 3 9 1 11 65 81 95 [26] 28 73 16 8 12 > stem(gala$En) 0 7 4 2 26 35 17 4 19 89 23 2 37 33 9 30 The decimal point is 1 digit(s) to the right of the | 0 1 2 3 4 5 6 7 8 9 | | | | | | | | | | 01223447899 12679 13368 0357 5 3 19 5 Graphical Summaries We can make histograms and boxplot and specify the labels if we like: > hist(gala$Sp) > hist(gala$Sp,main="Histogram of Species",xlab="number of Species") > boxplot(gala$Sp) Scatterplots are easier - here we rescale the X-axis because of the skewness of area: plot(gala$Area,gala$Sp) plot(log(gala$Area),gala$Sp,xlab="log(Area)",ylab="Species") We can make a scatterplot matrix: pairs(gala) > plot(gala) # also a scatterplot matrix We can put several plots in one display par(mfrow=c(2,2)) boxplot(gala$Ar) boxplot(gala$Adj) boxplot(gala$Elev) boxplot(gala$Sc) par(mfrow=c(1,1)) # back to 1 plot display Selecting subsets of the data Second row: > gala[2,] Bartolome Species Endemics Area Elevation Nearest Scruz Adjacent 31 21 1.24 109 0.6 26.3 572.33 Third column > gala[,3] [1] 25.09 0.03 [10] 0.18 0.01 1.24 0.21 0.10 0.05 58.27 634.49 0.57 0.78 0.34 0.08 2.33 17.35 4669.32 129.49 [19] 59.56 170.92 [28] 1.84 17.95 0.23 1.24 2.85 4.89 551.62 572.33 903.82 24.08 The 2,3 element: > gala[2,3] [1] 1.24 c() is a function > c(1,4,8) [1] 1 4 8 for making vectors, e.g. Select the first, fourth and eighth rows: > gala[c(1,4,8),] Species Endemics Area Elevation Nearest Scruz Adjacent Baltra 58 23 25.09 346 0.6 0.6 1.84 Champion 25 9 0.10 46 1.9 47.4 0.18 Darwin 10 7 2.33 168 34.1 290.2 2.85 The : operator is good for making sequences e.g. > 3:11 [1] 3 4 5 6 7 8 9 10 11 We can select the third through eleventh rows: > gala[3:11,] Caldwell Champion Coamano Daphne.Major Daphne.Minor Darwin Eden Enderby Espanola Species Endemics Area Elevation Nearest Scruz Adjacent 3 3 0.21 114 2.8 58.7 0.78 25 9 0.10 46 1.9 47.4 0.18 2 1 0.05 77 1.9 1.9 903.82 18 11 0.34 119 8.0 8.0 1.84 24 0 0.08 93 6.0 12.0 0.34 10 7 2.33 168 34.1 290.2 2.85 8 4 0.03 71 0.4 0.4 17.95 2 2 0.18 112 2.6 50.2 0.10 97 26 58.27 198 1.1 88.3 0.57 We can use "-" to indicate "everthing but", e.g all the data except the first two columns is: > gala[,-c(1,2)] Baltra Bartolome Caldwell Champion Coamano Daphne.Major Daphne.Minor Darwin Eden Enderby Espanola Fernandina Gardner1 Gardner2 Genovesa Area Elevation Nearest Scruz Adjacent 25.09 346 0.6 0.6 1.84 1.24 109 0.6 26.3 572.33 0.21 114 2.8 58.7 0.78 0.10 46 1.9 47.4 0.18 0.05 77 1.9 1.9 903.82 0.34 119 8.0 8.0 1.84 0.08 93 6.0 12.0 0.34 2.33 168 34.1 290.2 2.85 0.03 71 0.4 0.4 17.95 0.18 112 2.6 50.2 0.10 58.27 198 1.1 88.3 0.57 634.49 1494 4.3 95.3 4669.32 0.57 49 1.1 93.1 58.27 0.78 227 4.6 62.2 0.21 17.35 76 47.4 92.2 129.49 Isabela 4669.32 Marchena 129.49 Onslow 0.01 Pinta 59.56 Pinzon 17.95 Las.Plazas 0.23 Rabida 4.89 SanCristobal 551.62 SanSalvador 572.33 SantaCruz 903.82 SantaFe 24.08 SantaMaria 170.92 Seymour 1.84 Tortuga 1.24 Wolf 2.85 1707 343 25 777 458 94 367 716 906 864 259 640 147 186 253 0.7 28.1 29.1 85.9 3.3 45.9 29.1 119.6 10.7 10.7 0.5 0.6 4.4 24.4 45.2 66.6 0.2 19.8 0.6 0.0 16.5 16.5 2.6 49.2 0.6 9.6 6.8 50.9 34.1 254.7 634.49 59.56 0.10 129.49 0.03 25.09 572.33 0.57 4.89 0.52 0.52 0.10 25.09 17.95 2.33 We may also want select the subsets on the basis of some criterion e.g. which islands exceed 500 in area: > gala[gala$Area > 500,] Species Endemics Area Elevation Nearest Scruz Adjacent Fernandina 93 35 634.49 1494 4.3 95.3 4669.32 Isabela 347 89 4669.32 1707 0.7 28.1 634.49 SanCristobal 280 65 551.62 716 45.2 66.6 0.57 SanSalvador 237 81 572.33 906 0.2 19.8 4.89 SantaCruz 444 95 903.82 864 0.6 0.0 0.52 Learning more about R While running R you can get help about a particular commands - eg - if you want help about the stem() command just type help(stem) If you don't know what the name of the command is that you want to use then type: help.start() and then browse. A short introduction to R is given at http://gd.tuwien.ac.at/languages/R/doc/contrib/Rdebuts_en.pdf. A detailed introduction to R can be found at http://spider.stat.umn.edu/R/doc/manual/Rintro.html.