Chapter 1: Introduction to using R with Mind on Statistics

advertisement
Introduction to using R
The purpose of this manual is to help you learn how to use the free R software to perform the graphs, simulations,
and data analyses. This manual will describe how to obtain R and give a brief introduction to the software.
Two methods will be used to demonstrate the R code. The first method uses screen copies. The screen copy shows
exactly what you will see when running R. The red text shown in the screen copy is what you type. You will not
need to type the initial “>” prompt at the start of each line. The blue text in the screen copy is output produced by R.
A simple example is the first bullet under “Basic Commands” of this lab.
The second method is simply a display of the text you would type. You would type the command after the “>”
prompt. R would respond after you type the [Enter] or [Return] key. An example of this is the third bullet under
“Basic Commands” of this lab.
What is R?
R is a computer language and environment that was developed with statistical graphics and analysis in mind.
Consequently it is commonly thought of as a statistical software package, like the proprietary Minitab and SPSS
packages. In the growing atmosphere of free software, scientists are constantly making available new packages that
enable R to perform very advanced modern statistics.
There are several consequences of R being free software developed by scientists for scientists. First of all, it is very
powerful. If you decide to continue in a career that depends heavily on statistics such as economics, biology,
medicine, marketing, etc., R will allow you to develop your own statistical functions specific to your own immediate
needs. Secondly, it was created as a tool for scientists rather than for mass marketing to make money. Thus it is
line command driven and lacks features as pull down menus and point-and-click commands. This results in a
software that has a high “nerd factor” as you will notice when looking at the help commands and manuals. S-Plus,
a proprietary software package, is almost identical to R with respect to line commands, but includes pull down
menus and some point-and-click commands.
Obtaining R
R is freely available via the website www.r-project.org as are its online manuals. To install R on your
computer, follow the links which originate from the “Download CRAN” link found on the left-hand side of the R
project homepage. This link leads you to a web page where you select the location of a mirror site closest to your
location; .e.g., University of California at Berkeley, California, USA. If you are using Microsoft Windows, for
example, you would then click on the “Windows (95 and later)” link, followed by the “base” link, and
then the “rw2010.exe” link to install R.
Basic commands
The most essential features and commands to keep in mind to use R are:
 R is a line command driven software with the commands typed at the > prompt followed by the [Enter] or
[Return] key. Values are assigned using the R the two key arrow <-. It is created by typing < followed by
a dash. To see the value of a variable, simply type it and hit [Enter]. The following screen copy shows the
basic command of assigning the value 10 to the variable n.
1








(Note: The bracketed [1] is used to count the values in the output as you will notice when working with
larger datasets.)
The # sign is used to add comments to a command line. R ignores everything on the command line typed
after #. For example, at the prompt, type the following command to assign 4, 6, and 3 to the variable
“heights”.
> heights <- c(4,6,3) #c()assigns a string of numbers
Although comments are provided throughout this manual to give extra instruction, it is not necessary
for you to type the comments to run the commands.
R commands are followed by parentheses with variables and options put within the parentheses. Example:
> sort( heights )
[1] 3 4 6
Typing the command without the parentheses will result in the software code flashing on the screen;
however, it causes no harm.
R is case sensitive which means that R will distinguish between the variables Heights and heights.
To list what objects you have available, use the ls() command. Example:
> ls( )
[1] "heights" "Heights" "n"
To delete an object use the remove command, remove(). Example:
> remove( heights )
> ls()
[1] "Heights" "n"
If you want to change a previous command you can hit the up arrow key and edit your old commands.
If you type [Enter] before a command is completed, R will go to the next line and respond with a “+” to
denote the command is not finished and you can continue typing. To terminate the command, type the
[Escape] or [Esc] key.
To quit R, type q().
2
Importing data into R
R is best designed to import and export text data. Go to www.humboldt.edu/~mar13/datasets.html and click on the
Trillium dataset. Save the dataset to your computer on the C drive, for example, in a folder called Temp,
“c:/Temp/trillium.txt”. Using the Notepad program, edit the trillium.txt file to remove the header.
The read.table( ) command is used to import data. R uses forwards rather than backward slashes to denote a
folder location. Try importing the trillium.txt dataset.
> trillium <read.table("c:/Temp/trillium.txt", header=T)
The header=T option lets R know that the first row in the dataset represent the variable names. To ensure you
read the data, type trillium and watch the data flash by. Also try the following command to see the names of
the variables for the trillium data.
> names(trillium)
[1] "leaf"
"stem"
"flower" "site"
Storing data: Objects, vectors, matrices, and data frames.
Anything stored by R is called an object. Thus the function sort( ) is an object, as were the variables height
and n in the previous examples. We will focus primarily on objects that store data, such as did height and n.
Data will most commonly be stored in either vectors or data frames.
A vector is simply a string of numbers. n is a vector of length 1 and heights a vector of length 3. If you are
typing a one variable with only a few data values you will often simply type the values into a vector. R allows
mathematical operations to be carried out on an entire vector. Example:
> x <- c( 5,2,4 )
> x + 6
[1] 11 8 10
Elements of a vector can be specified by use of the hard brackets [ ]. Either a single element or more than one can
be specified. Examples:
> x[3]
[1] 4
> x[2:3]
[1] 2 4
> x[c(1,3)]
[1] 5 4
A matrix is a string of vectors of the same mode (numeric, character, or factor) and of the same length bound
together. The dimension of a matrix is typically described by its number of rows and columns. Analogous to a
vector, the elements within a matrix are described by their row and column position in the matrix; e.g., X[ row,
column ]. Leaving either the row or column unspecified is the same as specifying them all. Examples:
> X <- c(2,4,6,8,10,12,14,16,18,20,22,24) #vector
> X <- matrix( X, nrow=3, byrow=F ) # turn into matrix
> X
[,1] [,2] [,3] [,4]
[1,]
2
8
14
20
[2,]
4
10
16
22
[3,]
6
12
18
24
> X[2,3] # row 2, column 3
[1] 16
> X[1:2, 3:4] # rows 1 through 2 and columns 3 and 4.
[,1] [,2]
3
[1,]
14
20
[2,]
16
22
> X[1,]
# row 1, all columns
[1] 2 8 14 20
> X[,3]
# all columns, row 3
[1] 14 16 18
> X+3 # add 3 to all values of X
[,1] [,2] [,3] [,4]
[1,]
5
11
17
23
[2,]
7
13
19
25
[3,]
9
15
21
27
A data frame is like a matrix, but one column of the data frame may consist of numbers and another column words.
The read.table()function automatically puts the data into a data frame. Do the following commands:
> trillium # scroll back up to see the column headers
> trillium$flower
The first column (1,2,,582) are simply the row labels while the data are the columns listed under “leaf”, “stem”,
“flower”, and “site”. Columns within a data frame can be accessed directly using the $ symbol.
The variables within a data frame can be accessed more directly by attaching the data frame. For example, suppose
you typed flower. R would say that it could not find the variable. That is because it looks in the directory
(displayed by ls() ) and sees the data frame trillium, but does not look inside trillium for the variable
flower. Using the attach(trillium) command tells R to also look inside data frame trillium if it can
not find flower in the directory. Once finished using the dataset, detach the data frame. Examples:
> flower
[1] p w
[38] p p
[75] w w
[112] w w
[149] w w
[186] w w
[223] w w
[260] p s
[297] w s
[334] w w
[371] w p
[408] w s
[445] w p
[482] p w
[519] w w
[556] w p
Levels: p
w
p
w
w
w
p
p
s
p
w
w
p
p
w
w
w
s
w
w
w
w
w
w
w
p
w
w
p
s
p
p
w
w
w
p
w
p
w
p
w
w
p
w
w
w
s
w
w
w
p
p
p
p
w
w
s
w
w
w
w
w
s
p
w
p
p
w
p
w
w
w
s
w
w
w
p
w
p
w
p
w
w
p
w
p
p
w
w
w
w
p
w
p
p
w
p
p
w
p
w
p
w
w
w
p
w
p
w
s
s
w
w
p
w
p
p
w
p
w
w
w
w
p
p
p
s
p
p
w
w
w
w
p
w
w
w
w
w
w
w
w
s
w
w
p
p
w
w
w
w
p
w
w
p
w
p
s
w
w
p
w
p
w
p
w
w
w
w
w
w
w
p
s
w
p
p
w
w
p
p
w
p
w
w
p
p
w
s
w
w
p
w
w
w
p
p
w
w
w
p
w
w
w
w
w
w
p
p
w
p
p
w
w
p
p
p
w
p
p
w
p
p
w
w
p
w
w
w
p
w
w
w
w
p
w
s
s
p
w
p
p
w
p
w
w
p
w
w
w
p
p
w
w
p
w
w
w
w
w
w
w
p
p
w
w
w
w
w
w
w
p
w
w
p
w
w
w
p
p
p
p
w
p
p
p
p
w
w
p
w
w
w
p
w
p
p
p
p
w
w
w
w
p
w
p
w
p
w
p
w
w
w
p
w
s
w
w
p
w
p
w
w
p
p
p
w
p
w
w
w
w
w
w
w
p
w
w
p
w
w
p
w
w
p
p
w
p
w
w
p
w
w
p
p
w
w
p
w
p
w
w
p
w
s
s
p
p
w
p
w
w
w
w
w
p
p
w
s
w
p
s
p
w
p
p
p
w
w
w
w
p
p
p
p
p
p
w
p
p
p
w
w
w
w
w
p
p
w
p
s
w
p
p
w
w
w
w
w
w
p
w
p
p
p
w
w
s
p
w
p
p
w
w
w
w
s
p
w
w
w
w
p
s
p
w
w
w
w
w
p
p
p
p
p
w
w
w
w
p
w
w
p
p
w
p
p
w
w
w
w
w
p
s
p
w
p
w
p
w
w
p
w
w
w
p
w
w
s
p
p
p
w
w
w
p
w
p
w
w
p
w
w
s
p
p
w
w
w
w
p
p
w
p
p
w
w
w
s
p
p
w
w
p
w
w
w
w
p
p
w
p
w
w
p
p
w
p
w
w
w
w
w
w
p
w
s
w
p
p
w
p
w
> detach("trillium")
> flower
Error: Object "flower" not found
Of course, basic statistical commands can be carried out on objects.
> attach(trillium)
> mean(leaf)
[1] 12.78847
> sd(leaf)
[1] 2.521759
> summary(leaf)
Min. 1st Qu. Median
4.80
11.10
12.70
> plot(leaf,stem)
> cor(leaf,stem)
Mean 3rd Qu.
12.79
14.50
Max.
21.50
4
[1] 0.6173458
Editing data
Objects in R can be edited using the edit() function where the output is assigned either to a new object or the
original object. A spread sheet editor will appear for data frames and a simple text editor for vectors. To end the
editing session, simply click on the X icon in the upper right hand corner of the window. Examples:
> newtrillium <- edit( trillium )
> x <- edit( x )
Exporting data
The function write.table()is used to export a data frame to a text file on your computer. There are many
advanced optional features to the command, but you must provide an R object to export and a file destination for the
object. The row.names=F option avoids writing the 1 through 20 down at the start of each row. Example:
write.table(newtrillium,file="c:/Temp/newtrillium.txt",row.names=F)
Learning more about R
The potential for using R for statistical analysis is almost endless. There always seems to be more about this
software and statistics that a person can learn no matter how introductory or advanced the user. Besides the R
manuals available through the Help icon at the top of R, there are a number of books written at introductory and
advanced levels which describe how to use R and the similar S-Plus package.
Learning R through application
The best way to learn software is to try to use it to accomplish a goal. Goal oriented computing can initially be
frustrating, but ultimately is the best teacher for the user to learn advanced commands. Probability simulation will
be used to introduce for-loops, simulations, and graphics.
Birthday Game. To calculate the probability of at least two people sharing the same birthday, it is easier to
calculate the probability of two people not sharing the same birthday and subtracting this probability from one. For
365  364  363  362
example, with four people in the room, the probability would be 1 
. The equivalent R code
365  365  365  365
1-prod((365:362)/365) returns the probability 0.016. The general formula for n people in a room is
n
1

i 1
366  i
366  i
. Notice how the term
approaches 0 as i approaches 366. This is because 366 is one more
365
365
person than there are days in a year. Try the following R code and inspect the probabilities.
for ( n in 1:366)
{
nobody <- prod( ( 365:(366-n) )/365 )
p <- 1- nobody
print ( c(p,n) )
}
Question 1: What are the probabilities for 22 people in a room? 30? 41? 57?
Estimating π. Remember that the area of a circle is  r 2 where r is the radius of the circle. To estimating π, we
can throw “darts” randomly into a two-by-two unit grid and count the proportion that fall into a circle centered
inside the grid with a radius of one. The area of the circle is π and the area of the grid is four. Consequently the
5
proportion of darts that fall into the circle , p 
number of darts in circle

, should be about equal to . Using
number of darts thrown
4
algebra, we get   4 p .
In R we can create “darts” by randomly choosing x and y coordinates each from the uniform distribution with a -1 to
+1 range, thus having the circle centered at the origin. If the distance, d  x 2  y 2 , is less than one consider the
dart to be within the circle. The following R code throws 100,000 darts and estimates π. Cut and paste it repeatedly
to watch the variability of the estimate.
N <- 100000
xdarts <- runif(N, min=-1, max=1 )
ydarts <- runif(N, min=-1, max=1 )
d <- sqrt( xdarts^2 + ydarts^2 )
4 * sum( d < 1 ) / N
Question 2: Create a vectors of 100 zeroes and call the vectors piA and piB. (The rep function will be useful.)
Using a for-loop around the code and changing N to 5000, save 100 estimates of π to vector piA. Repeat the
process, but change N to 100,000 and save the 100 estimates of π to vector piB. Create side by side boxplots of
the two vectors (boxplot(piA, piB)). Use the function summary() on each vector to find the first and
second quartiles of each vector. For this problem you only need to show the boxplots and state the quartiles.
Writing functions in R. A nice feature in R is that you can write functions to perform sequences of commands
that are frequently used and where options may be altered. A simple function to add two numbers x and y would be:
add <- function( x, y )
{
xandy <- x + y
return( xandy )
}
Suppose we wanted defaults to be available in case x and/or y is not specified by the user. The following alterations
would have x=1 and y=3 unless specified. Play with both functions sometimes providing only one or none of the x
and y values.
add2 <- function( x=1, y=3 )
{
xandy <- x+ y
return( xandy )
}
Question 3: Create a function that will throw N darts into a grid with -1 and 1 being the lower and upper bounds for
the x and y coordinates. Let the default for N be 100. Calculate and return the proportion of darts that were inside
the circle with radius one. Also plot the points out where the points within the circle are a circle and outside of the
circle a cross. Provide your function and a graph produced from N=50. The following graphing codes will be
useful.
plot( xdarts, ydarts, type="n" ) # plot nothing, but create frame
points( xdarts[d<1], ydarts[d<1], pch=1 ) #pch=point character
points( xdarts[d>=1], ydarts[d>=1], pch=3 )
6
Download