vectors data

advertisement

An Introductory Course for

R

By Dallas J. Bateman

I. Introduction

What is R ?

“ R is a free software environment for statistical computing and graphics.”

1 R is the freeware version of

S-plus. It is widely known for its ability to produce high quality, customized graphics. Since R is a free software, many people contribute “packages” to make R possible. Each package may contain its own written functions and/or datasets. Many of the packages are written specifically for certain areas of expertise. Such packages may include randomForest , s catterplot3d , or maps . There have been authors who have written packages to meet the needs of textbook material such as the faraway and TSA package. These packages tend to be named after the author (as in faraway for Julian Faraway) or after the subject matter of the book (as in TSA for time series analysis).

In order to function in R , one must be willing to learn a new coding language. All commands in R are executed using coded commands at a command prompt. There are no point-and-click options available in R for running analysis. Some basic commands will be introduced later.

R is an interpretable language meaning that as soon as a command is sent to the command prompt, the command is immediately compiled and executed. Because R is an interpretable language (as opposed to a compiler language), it is not wise to write functions that require large loops. R is vectorized, which allows it to work efficiently with vectors; however, it takes up immense amounts of time and memory to run large loops.

For such cases where large loops cannot be avoided, R has the ability to be compatible with C, a complier language.

How to obtain R

R is available for Linux, Macintosh, and Windows operating systems. One can download R for free by visiting this link: http://cran.r-project.org/ . There are instructions that can be followed after selecting the operating system of your choosing.

After you open R

Once R is open, there will be a console stating the version of R being used and copyright information.

There are instructions for viewing license information, contributors, and instructions for citing R is publications.

R has a couple of methods for finding help to get started. help.search() is a command that allows one to search R for specific words or phrases. For example, if you cannot remember how to fit a linear model in R , it may be helpful to type help.search(“fitting linear models”) . This will bring up a window displaying a list of functions that may be helpful in performing linear regression analysis along with the package that would need to be loaded to use that function.

If you can remember the name of the function, but you cannot remember the syntax or the parameters required to run the function, you can simply type ?functionName

. Say you know that the lm() function fits a linear model, but now you cannot remember what goes inside the parameters. By typing ?lm() in the command prompt, R will open up a help file giving (among other information) the purpose of the function, the syntax accepted by the function, the meaning of each parameter, possible values that are stored as a result of the function (such as residuals, fitted values, test statistics, p-values, etc.), and finally a few examples on using the function.

There are several PDF files within R that may be of benefit to the beginner in R . To read such files, click on the Help drop-down menu and choose Manuals (in PDF) . There will be a list of documents through which one may read to gain additional help.

Installing Packages

After downloading the software, only a select few packages are already loaded into R . To install additional packages select from the drop down menus Packages and

Install Package(s)…

. This brings up a new window called “CRAN mirror.” Here, the user must select a mirror, usually the one closest to the user. After selecting a CRAN mirror, a new pop-up window called “Packages” appears. This allows the user to select any package to install.

Once the package has been installed, it is still not available for use. The package is now a library, and the library must now be opened. This is done using the following command: library( packagename )

Reading data into R

The first thing anyone wants to do when opening a statistical software package is read in some data. R has the ability to read in several types of data: csv files, text files, SPSS data files, or Stata data files. Common files that are used are either text files or csv files created in Excel. To read in such files, use the following respective commands: read.table(“ filename.txt

”) read.csv(“ filename.csv

”)

Please make note that if your file contains a header (meaning a line that has the names of the variables, then you need to include that in the read-in functions: read.table(“ filename.txt

”, header=T) read.csv(“ filename.csv

”, header=T)

This will tell R that the first line should not be considered data.

As mentioned before, there are many datasets already available in R . To view a list of these datasets, simply type the command data() . This will only list all of the datasets available in the currently installed packages. To open such a dataset, type data( datasetName ) . The data is now available for use.

Important basic syntax to know

Much of the basic syntax needed to get around comfortably in R can be found in a handy reference card found at the R website.

2

This document has four pages of functions that can be used in getting help, importing or exporting data, creating data, converting variables, selecting or manipulating data, doing basic math, using matrices, using dates and times, plotting, fitting models, and much more.

Another helpful document also found on the R website is titled “An Introduction to R”.

3 This document goes into greater detail on using vectors, manipulating data, creating data frames, reading in files, probability distributions, function coding, graphical procedures, and many other topics.

II. Common Statistical Analyses in R

Most software packages that are made for statistical analysis can do basic statistics such as ANOVA, linear regression, or hypothesis testing. This section will go through the basics of a few of the common analyses and procedures in R . Please be aware that these examples are not meant to show significant results or best-fit models. They are only to show one how to obtain the needed information.

For additional examples on higher level analysis in R , please reference the examples at http://www.ats.ucla.edu/stat/r/dae/default.htm

.

The examples included here will include notes or comments. To comment out a section in R , use the # sign followed by any information that you do not want run as code. Anything to the right of the # sign will be considered a comment until it hits a new line.

Data Manipulations

This section will show you how to create variables and how to do simple manipulations.

# create a variable called x from the numbers 1 through 10: x=1:10 x

# create a variable y from 2 to 20 by 2's: y=seq(2,20,by=2) y

# create a variable z of 10 normal random numbers N(5,9): z=rnorm(10,5,9) z

# let's multiply x by 3 and take the square root of y: x*3 sqrt(y)

# here are some basic commands that are be clear from the function names.

# for help on each function use the help() command. sort(z) sum(z) cumsum(x)

# create a matrix by combining x,y,&z:

Data=cbind(x,y,z)

Data

# Transpose the data: t(Data)

# display only the first 4 rows of Data:

Data[1:4,]

# display only the second column of Data:

Data[,2]

Initial Data Analysis

The first thing one would like to do with data is view scatterplots and summary statistics. Here is a tutorial on how to do so in a dataset called trees :

# open the dataset called "trees" data(trees)

# this dataset will give three variables:

# Girth - the diameter of the tree in inches

# Height - the height of the tree in feet

# Volume - the volume of the tree in cubic feet

# to display the dataset, simply type the name of the dataset: trees pairs(trees) summary(trees)

# produces a scatterplot matrix

# produces summary statistics of the data

Linear Regression

Linear regression uses the lm() function, which stands for linear model. Within the function, the only true information required is the model and sometimes the dataset being used. Here we will use the same trees dataset used above.

# The formula says that Volume is the response and Girth and Height are predictors

# data=trees is saying that we are taking the variables from the trees dataset

# We will store the model into a variable called "model":

Model1=lm(Volume~Girth+Height,data=trees)

# To view a summary of the model type: summary(model)

# To view some of the values that are stored, type the following:

Model1$coef # gives the model coefficients

Model1$resid

Model1$fitted

# gives the residuals of the model

# gives the fitted values for the regression line

Histograms & QQ Plots

Suppose we want to check the model assumption of normality for our model. Here are the simple commands to check the normality of the residuals for the trees model. resid=model1$resid hist(resid)

# We can change axis labels and add a title: hist(resid,xlab="Residuals",ylab="Frequency",main="Histogram of Trees Data")

# We can even change the color of the plot: hist(resid,xlab="Residuals",ylab="Frequency",main="Histogram of Trees Data",col="blue") qqnorm(resid)

# We can change the look of the plotting dots: qqnorm(resid,pch=24) qqnorm(resid,pch=19)

# type ?poins and search under 'pch' values for a list of different shapes

# If we want to add a reference line, use qqline(): qqline(resid)

# We can even change the color and thickness of the line qqline(resid,col="red",lwd=3)

Categorical Variable Analysis and Chi-squared Testing

R also does general linear models and other analyses involving categorical data. Here we will use a dataset called HairEyeColor where 592 people were sampled and recorded their gender, hair color and eye color.

# Open the HairEyeColor dataset: data(HairEyeColor)

# View the data (it will be displayed as a table):

HairEyeColor

# We can store the table as a dataframe:

HairEyeColor=data.frame(HairEyeColor)

HairEyeColor

# We can go back to a table, but not separate it by gender:

Table=xtabs(Freq~Hair+Eye,data=HairEyeColor)

Table

# Make a mosaic plot of the data: mosaicplot(Table)

# Produce a generalized linear model in a similiar manner as a linear model model2=glm(Freq~Hair+Eye,data=HairEyeColor)

summary(model2)

# View the same kinds of information stored: model2$coef model2$resid model2$fitted

There are many other analyses available in R such as ANOVA, multivariate analysis, time series, and much much more.

III. Working with Graphics in R

There simply is not enough space in this paper to explore all of the wonderful details that R can provide with graphics. Here, I will merely provide a handful of commands that will allow one to see the types of plots R can produce. To learn how to change plot outlooks and preferences, a wonderful source can be found at http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html

. This webpage will provide multiple graphs with R code so that one can play around with the different settings and see the possibilities of R .

A few of the possible graphing features were discussed with scatterplot matrices, histograms, QQ-plots, and mosaic plots. Much more can be learned by reading the plotting help file: ?plot

. On top of regular plotting features R also has a large mapping library that allows the user to produce maps of the world, countries, states of the United States, county maps of US states, and a database of locations for major cities worldwide.

IV. Miscellaneous R Issues

Saving files

R has its own format for files, which can be saved from scripts. To open a script, select File from the drop-down menu and New script . This opens a window that allows you to write your code and save it so that you do not have to retype all of the commands next time you open R .

Aside from saving code, when you close R , it will ask if you want save the workspace image. This means that you can save a shortcut to R that when opened will also have all data and variables defined during your session. That shortcut will be saved in a specified directory. To change the directory choose

Change dir… from the File drop-down menu and select where you would like the workspace image saved.

Missing Values

Some functions in R can deal with missing values while others cannot. If you receive an error when calling a function, it may have something to do with missing values in your data. For example, cor(x,z) will give an error if there are missing values in either variable. ?NA

may be a helpful command in reading up on dealing with missing values. If missing values must be removed from a single variable, na.omit( variable ) is sufficient. If an entire observation must be removed because of a missing value in one variable, na.exclude( dataframe ) would be required, as is the case for cor(x,z) .

Exporting Files

Exporting files is as easy as importing data. Depending on the type of file thatyou are trying to export, certain commands are needed. Let us return to our examples of importing a text file and a csv file. Exporting them is as simple as replacing read with write and including the dataset that you want to export: write.table(dataframe,“ filename.txt

”) write.csv(dataframe,“ filename.csv

”)

Notes

1.

http://www.r-project.org/ (accessed 4/13/2010).

2.

http://cran.r-project.org/doc/contrib/Short-refcard.pdf

(accessed 4/13/2010).

3.

http://cran.r-project.org/doc/manuals/R-intro.pdf

(accessed 4/13/2010).

Download