Document

advertisement

R Lecture 5

Naomi Altman

Department of

Statistics

Example: Regression

The data are available at http://www.stat.psu.edu/~jls/stat511/homework/body.dat

?read.table

body=read.table("body.txt",header=T) plot(body$hips,body$weight) plot(body$waist,body$weight)

?formula

lm.out=lm(weight~hips+waist,data=body) attributes(lm.out)

Formulas

lm fits the regression of Y on a set of X variables.

The variable for Y and the predictors are denoted by a formula of the form.

You can also use formulas in other contexts. e.g.

plot(weight~waist, data=body)

Object Oriented Programming in R

or how a bunch of smart programming types made R easier to use and harder to program - at least in the eyes of a statistician

In the bad old days

If I wanted to write a function similar to something already in R, I would edit the R code: myFun=edit(Rfun) myDensity=edit(density)

Sometimes the R code would call a C or C++ program, but the code for that is also available.

plot boxplot rnorm

But now ...

Classes and Generic Functions

I have already mentioned that one of the attributes a R object can have is a class.

A generic function is a function that captures the class of an object and then calls another function to do the actual work. If the function is called fun and the class is called cls, the function that does the work is (almost always) called fun.cls.

If there is no suitable fun.cls, then fun.default is used.

e.g.

plot(body$hips,body$weight) plot(lm.out) plot.default

plot.lm

Classes

Actually, a class can be a pair c("first","second") in which the "first"

"inherits from" i.e. is a special case of

"second". In practise, this means that it has all the components of class "first" objects but possibly some additional ones.

If there is no fun.first, then the generic function will search for fun.second. Only if there is also no fun.second will fun.default be used.

e.g. plot uses plot.lm on an object with class "lm" and also on an object with class ("glm","lm")

'inherits' indicates whether its first argument inherits from any of the classes specified in the 'what' argument glm.out=glm(weight~hips+waist,data=body) class(glm.out)

"glm" "lm" inherits(lm.out,"lm") inherits(glm.out,"lm") inherits(lm.out,"glm") inherits(glm.out,"glm") plot.lm

plot.glm

plot(glm.out)

unclass

If you remove the class, most objects are just lists. lm.out

unclass(lm.out)

For example, the "lm" objects are lists with the following components:

"coefficients" "residuals" "effects" "rank" "fitted.values"

"assign" "qr" "df.residual"

"xlevels" "call" "terms" "model"

Some of these components are obvious.

Some of them are matrix computations that can be used to compute, e.g. the leverages and Cook's Distance (notice that these have not been stored).

Some of them are only empty - they are used primarily when the predictor variable is a factor (ANOVA).

Why use classes

For the user: less to think about e.g. you can try generic functions like plot and summary with any output

For the programmer: provides a framework e.g. you might think about having a plot.myfun and summary.myfun for the function you are writing also, you can use inheritance so that you do not need to write your own functions

Learning to Use Objects and other Extensions

Calling C or C++ from R:

Writing R extensions

Object oriented programming in R

(S3 protocol)

R Language Definition

(S4 protocol)

R Internals

Odds and Ends

 speeding things up (a bit)

 errors in functions and loops

 source and sink

 attach and detach

 resampling

Speeding Things Up

(I am assuming that R behaves like Splus)

Try to minimize the number of expressions evaluated.

R is an "interpreted" language, not a "compiled" language. This means that each time R encounters an expression, it has to decipher it and convert it to machine-readable microcode.

If you write a loop, then R has to decipher the expressions within the loop again and again.

Looking at the code for apply , it seems to create a loop, so apply does not get around this.

Speeding Things Up

Create space for new objects.

Every time you change the dimensions of an object, R reallocates space.

So my loop:

 intercepts=NULL

 for (i in 1:1000){y=line+rnorm(25,0,3) lmout=lm(y~x)

 intercepts=c(intercepts,getIntercept(lmout))} would be better as:

 intercepts=rep(0,1000) for (i in 1:1000){y=line+rnorm(25,0,3)

 lmout=lm(y~x) intercepts[i]=getIntercept(lmout))}

Errors in Functions and Loops

If an error occurs somewhere within a function or loop, R aborts the whole process and destroys the frame. Nothing will change in your .Rdata directory.

There is no such thing as "partial execution"; if even one mistake occurs the whole process is aborted.

If you want to "trap" errors so that your routine can recover, the " try " command can be used to return a flag.

"source" and "sink"

source("myfile.txt") runs the commands in the text file "myfile" until they are completed source("myOutput.txt") send everything that would be written to the screen to the text file "myOutput" until sink() is typed in

Attach and Detach

" attach " and " detach " expand and contract the search path for R.

You can attach either a dataframe or a

.Rdata workspace.

Any new assignments are stored in the current workspace.

When you want to stop searching the dataframe or workspace, you can detach it.

Session: Attach and Detach

myData=read.table("body.txt",header=T) lm.out=lm(weight~hips+waist,data=myData) hips[1:5] attach(myData) hips[1:5] waist[1:5] waist=hips waist[1:5] detach(myData) hips[1:5] waist[1:5] myData$waist[1:5]

Resampling

Many statistical methods require sampling from the data already collected.

Examples include the bootstrap and permutation methods.

sample(x, size, replace = FALSE, prob =

NULL) allows you to sample with or without replacement from x, with unequal probabilities if you want

Resampling

sample(x,length(x),replace=F) creates a random permutation of the data

Session: Resampling

a=c("H","T") sample(a,10,replace=T) sample(a,10,replace=T,prob=c(.2,.8)) sample(1:5,5,replace=T) sample(1:5,5,replace=F)

#bootstrap distribution of the intercept attach(myData) intercept=rep(0,1000) l=length(weight) for (i in 1:1000){ samp=sample(1:l,l,replace=T) intercept[i]=coefficients(lm(weight[samp]~ hip[samp]+waist[samp]))[1]

} hist(intercept)

Download