Intro to R Stat 430 Fall 2011 Outline • • • • Installation • Graphics Learning a new language Grammar Vocabulary Installing R • The main website for R is http://www.r-project.org/ • Follow the link to CRAN (Comprehensive R Archive Network), pick a server close to you (try http://streaming.stat.iastate.edu/CRAN/ ), and download R for your platform • The newest version of R is 2.13.2 R is updated biannually (in Oct and APril) Start R • • Double-click the icon or type R in your command-line environment Learning a new language Learning a language • • • Grammar / Syntax Vocabulary “Thinking in that language” Grammar • • • • ma Lik the e ma tic Basic algebra is the same • but 2*x not 2x, 2^p instead of 2p Applying a function is similar Making a variable, use <- instead of = Everything in R is a vector • Index a vector using [ ] s Examples • • x=2/3 a = 2(x + 3)2 • •y • ∑y • 2y • f(y, 2) y = (1 2 3 5)T 1 You try • • • x = (4 1 3 9)T y = (1 2 3 5)T d = √(x 2 - y 2) • 2(d 1 + d4) Vocabulary • What verbs (=functions) do you need to know? • • • • • Loading data Accessing parts of things Statistical summaries & models Graphical summaries ... Reference card: PDF Loading data • • • • Import data with: • read.csv() for csv files (and use file.choose() to help find your file) Save from excel as csv files Stored in a data.frame • a list of variables with the same length Your turn • • Download flights-train data • Use head(flights) to check it worked Load it into R (use flights <- read.csv(file.choose()) Examining variables • • • • • a head(a) summary(a) str(a) dim(a) What do we have? • A data.frame = a list of variables of the same length (but may be different types) • Has row and column names Extracting bits of a data.frame • • • • x$variable x[, “variable”] x[rows, columns] • • x[1:5, 2:3] x[c(1,5,6), c(“Day.of.Week”,“X.capital”)] x$variable[rows] Statistical summaries • • • mean, median, min, max, range sd, var, cor table Your turn • Compare mean and median of ArrDelay. How can we interpret the difference? • Does day of the week have an impact on the number of landings? Use the command table to find out • Do delays depend on weekday? How could we find out? Packages • One of the great advantages of R is that it encourages development of ‘packages’ • i.e. modules with additional functionality made by users for users Plotting package • • install.packages("ggplot2") See http://had.co.nz/ggplot2/ for more info Your turn • load ggplot2 into your workspace: • Plot arrival delay in a histogram: qplot(ArrDelay, data=flights) library(ggplot2) Histograms • • Divide data into bins Count number of observations in each bin Histograms • • • • qplot(ArrDelay, data=flights, geom="histogram") qplot(ArrDelay, data=flights, geom="histogram", binwidth=10) qplot(ArrDelay, data=flights, geom="histogram", binwidth=60) qplot(ArrDelay, data=flights, geom="histogram", binwidth=1) Interpreting Histograms • • Big Pattern: Shape of the data • • peaked vs flat, skew vs symmetric Small Pattern: • • location/number of modes gaps (or areas of low density) Investigating relationships Variables are • both continuous: use scatterplot • continuous and discrete: use multiple boxplots qplot(X,Y, data=flights) qplot(factor(X),Y, data=flights, geom="boxplot") • both discrete: ? Your turn • Explore relationships between arrival delay and other variables • Are there any interesting patterns? • What does alpha=I(0.25) do? Continuous vs discrete • If we use a scatterplot, there is a lot of overplotting • Some solutions: • jitter points randomly so they don’t overlap • summarise the distribution using boxplots or histograms