Intro to R Stat 430 Fall 2011

advertisement
Intro to R
Stat 430
Fall 2011
Outline
•
•
•
•
Installation
•
Graphics
Learning a new language
Grammar
Vocabulary
Installing R
•
The main website for R is
http://www.r-project.org/
•
Follow the link to CRAN (Comprehensive R
Archive Network), pick a server close to you (try
http://streaming.stat.iastate.edu/CRAN/ ), and
download R for your platform
•
The newest version of R is 2.13.2
R is updated biannually (in Oct and APril)
Start R
•
•
Double-click the icon
or type R in your command-line
environment
Learning a
new
language
Learning a language
•
•
•
Grammar / Syntax
Vocabulary
“Thinking in that language”
Grammar
•
•
•
•
ma
Lik
the e
ma
tic
Basic algebra is the same
•
but 2*x not 2x, 2^p instead of 2p
Applying a function is similar
Making a variable, use <- instead of =
Everything in R is a vector
•
Index a vector using [ ]
s
Examples
•
•
x=2/3
a = 2(x + 3)2
•
•y
• ∑y
• 2y
• f(y, 2)
y = (1 2 3 5)T
1
You try
•
•
•
x = (4 1 3 9)T
y = (1 2 3 5)T
d = √(x 2 - y 2)
• 2(d
1
+ d4)
Vocabulary
•
What verbs (=functions) do you need to know?
•
•
•
•
•
Loading data
Accessing parts of things
Statistical summaries & models
Graphical summaries
...
Reference card: PDF
Loading data
•
•
•
•
Import data with:
•
read.csv() for csv files
(and use file.choose() to help find your file)
Save from excel as csv files
Stored in a data.frame
•
a list of variables with the same length
Your turn
•
•
Download flights-train data
•
Use head(flights) to check it worked
Load it into R
(use flights <- read.csv(file.choose())
Examining variables
•
•
•
•
•
a
head(a)
summary(a)
str(a)
dim(a)
What do we have?
•
A data.frame = a list of variables of the
same length (but may be different types)
•
Has row and column names
Extracting bits of a
data.frame
•
•
•
•
x$variable
x[, “variable”]
x[rows, columns]
•
•
x[1:5, 2:3]
x[c(1,5,6), c(“Day.of.Week”,“X.capital”)]
x$variable[rows]
Statistical summaries
•
•
•
mean, median, min, max, range
sd, var, cor
table
Your turn
•
Compare mean and median of ArrDelay.
How can we interpret the difference?
•
Does day of the week have an impact on
the number of landings? Use the command
table to find out
•
Do delays depend on weekday? How could
we find out?
Packages
•
One of the great advantages of R is that it
encourages development of ‘packages’
•
i.e. modules with additional functionality
made by users for users
Plotting package
•
•
install.packages("ggplot2")
See http://had.co.nz/ggplot2/ for more info
Your turn
•
load ggplot2 into your workspace:
•
Plot arrival delay in a histogram:
qplot(ArrDelay, data=flights)
library(ggplot2)
Histograms
•
•
Divide data into bins
Count number of observations in each bin
Histograms
•
•
•
•
qplot(ArrDelay, data=flights,
geom="histogram")
qplot(ArrDelay, data=flights,
geom="histogram", binwidth=10)
qplot(ArrDelay, data=flights,
geom="histogram", binwidth=60)
qplot(ArrDelay, data=flights,
geom="histogram", binwidth=1)
Interpreting Histograms
•
•
Big Pattern: Shape of the data
•
•
peaked vs flat,
skew vs symmetric
Small Pattern:
•
•
location/number of modes
gaps (or areas of low density)
Investigating relationships
Variables are
•
both continuous: use scatterplot
•
continuous and discrete: use multiple
boxplots
qplot(X,Y, data=flights)
qplot(factor(X),Y, data=flights, geom="boxplot")
•
both discrete: ?
Your turn
•
Explore relationships between arrival delay
and other variables
•
Are there any interesting patterns?
•
What does alpha=I(0.25) do?
Continuous vs discrete
•
If we use a scatterplot, there is a lot of
overplotting
•
Some solutions:
•
jitter points randomly so they don’t
overlap
•
summarise the distribution using
boxplots or histograms
Download