subset of InsectSprays

advertisement
Controlling and Managing Your R Workspace
This guide provides an overview of the R Studio Workspace—including its basic logic and structure. The
goal is to develop an understanding of how to best control and manage your Workspace so that you can
avoid or quickly diagnose several common errors.
Basic Overview
The R Environment consists of all the files necessary for running the R Program as well as data sets and
other objects that you have created or loaded into your Workspace. These files can be broken down into
three basic types:
1. The base packages that run all the standard analyses that we use in this course. These files are
installed automatically when you first download and install the R program.
2. Additional packages you can install on your own and which allow for more advanced statistical
analysis or additional commands.
3. The datasets, functions and other objects you create or import.
R-Studio Console
The four default windows in R-Studio are:
1. R scripts (top left) – this is where you’ll type the code you want to save. You can have multiple
scripts open at the same time.
2. Environment/History windows (top right) – the Environment tab gives you the list of datasets
and variables you’ve assigned and the History tab gives you the list of commands you’ve recently
executed.
3. Console (bottom left)
4. Bottom Right Window – Five tabs: Files (shows all the objects present in your workspace), Plots
(where your plots are displayed), Packages (shows the packages you’ve installed and loaded
(checked)), Help (can use to help clarify syntax, learn about specifications/variations to
commands), and Viewer for viewing local web content.
Getting Started
Your working directory tells R-Studio where the files you’ll be accessing are stored. In order to set your
working directory: either go to ‘Session’ tab at the top of the screen and click on ‘set working directory’
which takes you to windows explorer to choose your directory, or use the code:
Windows: setwd(“C:/MyDocuments/….”).
Macs: setwd("/Users/Rachel/….")
It is also where everything you save will go by default (e.g., plots or pdfs), unless you specify the path in
the filename.
Importing data
After you’ve set your working directory, you’re ready to import your data. Multiple file types can be
imported, but comma separated value (.csv) files are most common. When importing data, it’s best to
give it a (short) name. When assigning names to datasets and variables (any object), you use the <symbol:
Name <- read.csv(“File_name.csv”)
The read.csv command tells R it’s reading a .csv file. If you’re importing another type of file you’ll use a
different read command (i.e. read.dcf, read.table, etc.)
Example: import InsectSprays – a pre-existing data frame in R using: data(InsectSprays)
Note – you only use “data()” with datasets that come with R. For everything else you read it in using
read.csv() or read.table().
*If using read.table(), be sure you assign your separator using sep=” “ (for spaces), sep=”,” (for commas),
etc.
*Be sure you don’t name objects the same as functions (e.g. data<-read.csv()). If you decide you want to
rename your dataset or make a copy, be sure you have the new name on the left side of the arrow.
To have a look at the data, we can use View(InsectSprays) or just type the dataset name InsectSprays.
You can also view portions of the data using head(InsectSprays) or tail(InsectSprays). Head prints the first
6 lines by default, and tail does the same thing, but from the end of the dataset. You can also specify a
specific number of rows – e.g. head(InsectSprays, 10).
Syntax
Syntax is everything in R. You may have your files loaded, the code written and ready to go, but a comma
out of place can set you back an hour trying to get it to work. Capitals vs. lower case letters are also very
different, so ‘Sample.csv’ is totally different from ‘sample.csv’.
R-Studio helps a lot with this. When you type in a command it often automatically creates a close bracket
for your open bracket, and if you forget to close it somehow, it will start your next line indented.
However, it only goes so far. If you’re having trouble getting code to work, take the time to look over
your syntax and make sure everything is right.
Remember this when you’re creating column names in your csv. All spaces will be converted to periods
when imported. It’s often best to look at the data set after you’ve imported it – you can do so by clicking
on the name in the ‘Environment’ window (top right) or by using the code:
View(InsectSprays)
In your script window (top left), you will write/save your code. You can write notes to yourself using #
sign:
# note to self: R is awesome
# subset of InsectSprays
subset <- InsectSprays[1:79, ]
# everything written after the # sign is a note and will be read as text (so won’t run)
You can also create indexes by using four # signs:
e.g.
#### anovas section ####
There is a small drop down window in the bottom left corner of the script window that allows you to
easily move from one index to another.
To run your code from the script window you can either hit the run button at the top right, or simply hit
ctrl + R in windows, or command enter in mac
Editing Data
You can make minor edits using edit(InsectSprays), but try to avoid editing your original data frame, and
it’s better to show the code of what you changed so you can track any changes you made.
R most efficiently refers to data by position. By using square brackets you can designate a specific row
and/or column (the default is all, so if you don’t assign one, it uses all the rows/columns). In general:
InsectSprays[row,column]
For example, InsectSprays[1, 2] refers to the cell in the first row, second column of InsectSprays.
InsectSprays[ , 2] refers to the entire second column (by leaving the row assignment blank it includes all
rows).
InsectSprays[3, 5] <- 8 will change the third row of the fifth column to the number 8.
InsectSprays[ , 5] <- InsectSprays[ , 5]*10 will multiply every cell in the fifth column by 10, and replace it
with that number.
Assigning Variables
To assigning variables, you use the same <- as when you assigned your data frame a name:
variable1 <- sqrt(InsectSprays$column)
This creates a vector of the square root of whatever data column you select. You can now use this
transformed data in a t-test or graph, etc.
However, it is often better practice to create a new variable within your data frame, rather than in your
workspace, which can get overloaded:
data$new_column <- sqrt(InsectSprays$column)
Subsetting Data
To remove any objects from your workspace:
rm(list=ls())
to remove any objects you’ve created:
rm(object_name)
If you want to subset a portion of your data, there are a few ways:
subset1 <- subset(InsectSprays, InsectSprays$column==”value”, drop = TRUE)
Subsets your data for all rows that have the specified value in the column
subset2 <- InsectSprays[1:79, ]
For rows 1 through 79 of your data frame (includes all columns)
subset3 <- InsectSprays[ ,1:79]
For columns 1 through 79 of your data frame (includes all rows)
Example: sprayA <- subset(InsectSpray, InsectSpray$spray==”A”, drop = TRUE)
Common Errors
+ - if you’re typing a line of code in the R console or trying to run something from your script, you
may get a + sign instead of your desired output. This generally means you haven’t closed a bracket,
so R is waiting for the rest of the command. You’ll need to go back and correct your code.
Object not found – this means that you are telling R to use something (e.g. a column) it can’t find.
e.g. What is the mean of the sprays in the InsectSpray data?
Try:
1. mean(count)
2. mean(InsectSprays$count) With this syntax we tell R where to find the count data.
Could not find function – this most often happens when you don’t have the right package loaded.
e.g. Quick plot of the spray data (not using base graphics)
Try:
1. qplot(InsectSprays$spray, InsectSprays$count)
2. install.packages(“ggplot2”)
library(ggplot2)
qplot(InsectSprays$spray, InsectSprays$count)
Basic Functions
Some basic functions you may need:
mean()
sd()
log()
sqrt()
sd(x)/sqrt(length(x))
mean
standard deviation
log
square root
standard error
try:
mean(dataset$variable, na.rm=T)
For our example: mean(InsectSprays$count, na.rm = TRUE)
na.rm=TRUE tells R not to include blanks – otherwise the mean will be N/A.
Graphing:
plot(dataset$predictor, dataset$response)
or:
with(InsectSprays, plot(spray, count))
This second option tells R, I am going to use InsectSprays to do the following commands, so look there for
my data.
Normality tests:
1. shapiro.test(InsectSprays$column)
2. qqnorm(InsectSprays$column)
T-test: t.test(response,predictor)
Anova:
1. Create a linear model:
lm1 <- lm(response~predictor, data=InsectSprays)
summary(lm1)
2. Conduct anova:
summary.aov(lm1)
Also good to know:
class(InsectSprays$column)
tells you if it’s numeric, character, factor etc – nice
str(InsectSprays)
tells you lots about the data – useful to see if you have a data frame or a list. If
it’s a list pretty much nothing will work, so it’s good to know. Important to check
before running any tests.
summary(InsectSprays$column) gives you mean, quartiles, etc. for the column
levels(InsectSprays$column)
will tell you all the levels of the factor
names(InsectSprays)
will tell you all the column names.
Useful note: you can use Tab to autocomplete your command in both your script and the console. If you
want to run a t.test() for example, type t and then hit Tab – it will give you a list of possible commands
that start with t (there will be a lot). If you type t. then Tab you’ll get a shorter list. This is a nice way to
help you find the correct command without having to search the help tab each time.
Loading Packages
You may end up needing to do analyses that are not included in the default R library. In order to install
packages, you’ll need to know the name of the package you’re interested in (Google helps). R-Studio
makes it rather easy to load packages through the packages tab in the bottom right window - just click on
the check box next to the package you want and it will load.
It’s important to note that loading packages is a two-step process. First you install the package via
install.packages(package_name)
You only need to install a package one, but each time you want to use the package, you have to load it
(either by checking the box in your packages tab, or using either:
library(package_name) or require(package_name)
Projects
Creating projects is a good way to keep your analyses grouped together appropriately (by project,
chapter, etc.). To create a project, just go to the file tab (top left) and select ‘New Project’. By
creating projects it allows you to switch between workspaces without having to re-load everything
or cluttering your environment with too many objects.
Accessing help within and outside of R
If you need help with a command, you can search the help tab on the bottom right window, run
help(command) in the console or put a question mark before it?command, or run an example of the
command via example(command).
The R Book – Second Edition, Michael Crawley (in PDF form you can easily ctrl+F to search for what
you’re looking for)
Google! - R-project, R bloggers, stack overflow, CRAN-R project, Quick-R, rseek.org, code school, other
statistics departmental websites, etc.
Help tab within R – takes a while to learn how to read, but can be helpful, especially with syntax
CRAN Task view - http://cran.r-project.org/web/views/
Practice:
1.
2.
3.
4.
5.
Import mtcars
Look at data; change Mazda RX4 to 8 cylinder
Find min, max, and mean of mpg, hp, and qsec
Create subset of 8 cylinder cars; find min, max, and mean of mpg, hp, and qsec again
Check for normality of qsec and run a t test on the effect of cylinder size on qsec
Extra Practice – R Script Liv sent you
Download