Stat 579: Introduction and Preliminaries Ranjan Maitra 2220 Snedecor Hall Department of Statistics Iowa State University. Phone: 515-294-7757 maitra@iastate.edu August 25, 2011 , 1/10 What is R? Statistical software fulfilling similar functions as SAS, SPSS and Splus, Unlike numerical computation software (Maple, Matlab, Mathematica) Like any statistical software package, provides functions to perform non trivial statistical operations, classical (regression, logistic regression, analysis of variance (anova), decision trees, principal component analysis, etc.) more modern (neural networks, bootstrap, generalized additive models (GAM), mixed models, etc.) Freely available for download under the GNU Public License (GPL) at www.R-project.org. Free (as in ”free speech”) software free to use, modify, distribute and extend, as long as the rights and contributions of the contributors are acknowledged and protected. , 1/10 GUI, Speed and Memory Real programming language, not point-and-click software powerful: we are not limited by the software designers’ imagination, we can use it to do whatever we want it to do. Interpreted language: advantage: less time writing code drawback: computations slower than in lower-level programming languages such as C or Fortran adequate for many needs, including for most graduate class work not so for most research for which we may want to program in C/Fortran (Stat 580) still use some of R’s in-built C functions to help combine and use C/Fortran for the computer-intensive parts, and R for the tedious-coding parts , 2/10 Operating Systems R exists for all major OS’s Windows: click on the icon Mac: same as above Linux: type R at the prompt Getting in involves one of the above R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86 64-redhat-linux-gnu (64-bit) ...... , 3/10 Getting out of R Simply type: > q() Save workspace image? [y/n/c]: y Some systems will bring up a dialog box, others a text prompt to which you can respond (y)es, (n)o or (c)ancel (a single letter abbreviation will do) to save the data before quitting, quit without saving, or return to the R session. Saved data will be available in future R sessions. Every command in R is a function with an argument arguments may be null-valued, as in the above – q(). , 4/10 A Demonstration of Graphics, Images and Math-plotting capabilities Let us try a test run of R as desired by its developers: R > demo(graphics) > demo(plotmath) > demo(images) We get a list of commands pertaining to each set of capabilities Some functions also come with examples: An example of simple least-squares fitting of a linear regression model: > example(lsfit) Not all functions have example(): depends on developer(s) , 5/10 A Sample Session and Some Capabilities Some helpful features: > help.start() starts the HTML interface to on-line help (using a web browser available on your machine). Explore the features of this facility with the mouse. want to change browser? > help.start(browser=’’firefox’’) “An Introduction to R” is really a very comprehensive manual. Master it: very little need to come to class! “Search Engine and Keywords” will become more useful as the class and our careers progress. search on “plot” – this provides us with functions which have anything to do with plot. some are useful, some not so, but provides us with relevant functions that we may have forgotten , 6/10 Some simple examples Generate two pseudo-random normal vectors of x- and y-coordinates1 : > x <- rnorm(n = 50) > y <- rnorm(n = x) what does rnorm do? What do the arguments in the function do? Let us look at the following: > help(rnorm) which is the same as > ?rnorm and study the function details. Plot x against y: > plot(x = x, y = y) Plot points in the plane. A graphics window appears automatically. See which R objects are now in the R workspace: > ls() Let us remove objects no longer needed: > rm(x, y) 1 Pseudo-random? Random means unpredictable, not arbitrary, as is the colloquial interpretation. Pseudo- means “fake” or simulated. So, a , 7/10 Some more introductory examples – I Let us make a vector containing the sequence 1 through 20: > x <- 1:20 How do we call this object? To do that, we simply type: > x Let us try a simple operation on this object: > w <- 1 + sqrt(x)/2 This operation takes element-wise square root of the vector x and adds 1 to each coordinate. Moving on, can we get what this does? > dummy <- data.frame(x = x, y = x + rnorm(x)*w) > dummy and we make a “data frame” of two columns, x and y and look at it. , 8/10 Some more introductory examples – II Consider the following: > fm <- lm(y ∼ x, data=dummy) > summary(fm) Call: lm(formula = y ∼ x, data = dummy) Residuals: Min 1Q Median 3Q Max -3.6315 -0.8137 0.2134 0.8470 5.0178 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.63569 0.97234 1.682 0.11 x 0.84072 0.08117 10.358 5.19e-09 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.093 on 18 degrees of freedom Multiple R-squared: 0.8563, Adjusted R-squared: 0.8483 F-statistic: 107.3 on 1 and 18 DF, p-value: 5.187e-09 We fit a simple linear regression of y on x, store as a dataframe and look at the results. , 9/10 Some more introductory examples – III > attach(dummy) Make the columns in the data frame visible as variables. > plot(x = x, y = y) > abline(a = 0, b = 1, lty=3) # The true regression line: (intercept 0, slope 1). > abline(coef(fm)) # The simple linear regression line. > detach() Removed data frame from the search path. > plot(x = fitted(fm), y = resid(fm), xlab = "Fitted values", ylab = "Residuals", main="Residuals vs Fitted") A standard regression diagnostic plot to check for heteroscedasticity. Can you see it? > rm(fm, x, y, dummy) > q() , 10/10