Stat 579: Introduction and Preliminaries Ranjan Maitra

advertisement
Stat 579: Introduction and Preliminaries
Ranjan Maitra
2220 Snedecor Hall
Department of Statistics
Iowa State University.
Phone: 515-294-7757
maitra@iastate.edu
August 25, 2011
,
1/10
What is R?
Statistical software fulfilling similar functions as SAS, SPSS
and Splus,
Unlike numerical computation software (Maple, Matlab,
Mathematica)
Like any statistical software package, provides functions to
perform non trivial statistical operations,
classical (regression, logistic regression, analysis of
variance (anova), decision trees, principal component
analysis, etc.)
more modern (neural networks, bootstrap, generalized
additive models (GAM), mixed models, etc.)
Freely available for download under the GNU Public
License (GPL) at www.R-project.org.
Free (as in ”free speech”) software
free to use, modify, distribute and extend, as long as the
rights and contributions of the contributors are
acknowledged and protected.
,
1/10
GUI, Speed and Memory
Real programming language, not point-and-click software
powerful: we are not limited by the software designers’
imagination, we can use it to do whatever we want it to do.
Interpreted language:
advantage: less time writing code
drawback: computations slower than in lower-level
programming languages such as C or Fortran
adequate for many needs, including for most graduate class
work
not so for most research for which we may want to program
in C/Fortran (Stat 580)
still use some of R’s in-built C functions to help
combine and use C/Fortran for the computer-intensive parts,
and R for the tedious-coding parts
,
2/10
Operating Systems
R exists for all major OS’s
Windows: click on the icon
Mac: same as above
Linux: type R at the prompt
Getting in involves one of the above
R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for
Statistical Computing
ISBN 3-900051-07-0
Platform: x86 64-redhat-linux-gnu (64-bit)
......
,
3/10
Getting out of R
Simply type:
> q()
Save workspace image?
[y/n/c]:
y
Some systems will bring up a dialog box, others a text
prompt to which you can respond (y)es, (n)o or
(c)ancel (a single letter abbreviation will do) to save the
data before quitting, quit without saving, or return to the R
session.
Saved data will be available in future R sessions.
Every command in R is a function with an argument
arguments may be null-valued, as in the above – q().
,
4/10
A Demonstration of Graphics, Images and
Math-plotting capabilities
Let us try a test run of R as desired by its developers:
R
> demo(graphics)
> demo(plotmath)
> demo(images)
We get a list of commands pertaining to each set of
capabilities
Some functions also come with examples:
An example of simple least-squares fitting of a linear
regression model:
> example(lsfit)
Not all functions have example(): depends on
developer(s)
,
5/10
A Sample Session and Some Capabilities
Some helpful features:
> help.start()
starts the HTML interface to on-line help (using a web
browser available on your machine). Explore the features of
this facility with the mouse.
want to change browser?
> help.start(browser=’’firefox’’)
“An Introduction to R” is really a very comprehensive
manual.
Master it: very little need to come to class!
“Search Engine and Keywords” will become more useful as
the class and our careers progress.
search on “plot” – this provides us with functions which
have anything to do with plot.
some are useful, some not so, but provides us with relevant
functions that we may have forgotten
,
6/10
Some simple examples
Generate two pseudo-random normal vectors of x- and
y-coordinates1 :
> x <- rnorm(n = 50)
> y <- rnorm(n = x)
what does rnorm do? What do the arguments in the
function do? Let us look at the following: > help(rnorm)
which is the same as > ?rnorm and study the function
details.
Plot x against y:
> plot(x = x, y = y)
Plot points in the plane. A graphics window appears
automatically.
See which R objects are now in the R workspace:
> ls()
Let us remove objects no longer needed:
> rm(x, y)
1
Pseudo-random? Random means unpredictable, not arbitrary, as is the
colloquial interpretation. Pseudo- means “fake” or simulated. So, a
,
7/10
Some more introductory examples – I
Let us make a vector containing the sequence 1 through
20:
> x <- 1:20
How do we call this object? To do that, we simply type:
> x
Let us try a simple operation on this object:
> w <- 1 + sqrt(x)/2
This operation takes element-wise square root of the
vector x and adds 1 to each coordinate.
Moving on, can we get what this does?
> dummy <- data.frame(x = x, y = x +
rnorm(x)*w)
> dummy
and we make a “data frame” of two columns, x and y and
look at it.
,
8/10
Some more introductory examples – II
Consider the following:
> fm <- lm(y ∼ x, data=dummy)
> summary(fm)
Call: lm(formula = y ∼ x, data = dummy)
Residuals: Min 1Q Median 3Q Max -3.6315
-0.8137 0.2134 0.8470 5.0178
Coefficients: Estimate Std. Error t value
Pr(>|t|) (Intercept) 1.63569 0.97234 1.682
0.11 x 0.84072 0.08117 10.358 5.19e-09 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .
0.1 1
Residual standard error: 2.093 on 18 degrees
of freedom Multiple R-squared: 0.8563,
Adjusted R-squared: 0.8483 F-statistic:
107.3 on 1 and 18 DF, p-value: 5.187e-09
We fit a simple linear regression of y on x, store as a dataframe
and look at the results.
,
9/10
Some more introductory examples – III
> attach(dummy)
Make the columns in the data frame visible as variables. >
plot(x = x, y = y)
> abline(a = 0, b = 1, lty=3) # The true
regression line: (intercept 0, slope 1).
> abline(coef(fm)) # The simple linear
regression line.
> detach()
Removed data frame from the search path.
> plot(x = fitted(fm), y = resid(fm), xlab =
"Fitted values", ylab = "Residuals",
main="Residuals vs Fitted")
A standard regression diagnostic plot to check for
heteroscedasticity. Can you see it?
> rm(fm, x, y, dummy)
> q()
,
10/10
Download