R Lecture 5
The data are available at http://www.stat.psu.edu/~jls/stat511/homework/body.dat
?read.table
body=read.table("body.txt",header=T) plot(body$hips,body$weight) plot(body$waist,body$weight)
?formula
lm.out=lm(weight~hips+waist,data=body) attributes(lm.out)
lm fits the regression of Y on a set of X variables.
The variable for Y and the predictors are denoted by a formula of the form.
You can also use formulas in other contexts. e.g.
plot(weight~waist, data=body)
or how a bunch of smart programming types made R easier to use and harder to program - at least in the eyes of a statistician
If I wanted to write a function similar to something already in R, I would edit the R code: myFun=edit(Rfun) myDensity=edit(density)
Sometimes the R code would call a C or C++ program, but the code for that is also available.
plot boxplot rnorm
I have already mentioned that one of the attributes a R object can have is a class.
A generic function is a function that captures the class of an object and then calls another function to do the actual work. If the function is called fun and the class is called cls, the function that does the work is (almost always) called fun.cls.
If there is no suitable fun.cls, then fun.default is used.
plot(body$hips,body$weight) plot(lm.out) plot.default
plot.lm
Actually, a class can be a pair c("first","second") in which the "first"
"inherits from" i.e. is a special case of
"second". In practise, this means that it has all the components of class "first" objects but possibly some additional ones.
If there is no fun.first, then the generic function will search for fun.second. Only if there is also no fun.second will fun.default be used.
e.g. plot uses plot.lm on an object with class "lm" and also on an object with class ("glm","lm")
'inherits' indicates whether its first argument inherits from any of the classes specified in the 'what' argument glm.out=glm(weight~hips+waist,data=body) class(glm.out)
"glm" "lm" inherits(lm.out,"lm") inherits(glm.out,"lm") inherits(lm.out,"glm") inherits(glm.out,"glm") plot.lm
plot.glm
plot(glm.out)
If you remove the class, most objects are just lists. lm.out
unclass(lm.out)
For example, the "lm" objects are lists with the following components:
"coefficients" "residuals" "effects" "rank" "fitted.values"
"assign" "qr" "df.residual"
"xlevels" "call" "terms" "model"
Some of these components are obvious.
Some of them are matrix computations that can be used to compute, e.g. the leverages and Cook's Distance (notice that these have not been stored).
Some of them are only empty - they are used primarily when the predictor variable is a factor (ANOVA).
For the user: less to think about e.g. you can try generic functions like plot and summary with any output
For the programmer: provides a framework e.g. you might think about having a plot.myfun and summary.myfun for the function you are writing also, you can use inheritance so that you do not need to write your own functions
Calling C or C++ from R:
Writing R extensions
Object oriented programming in R
(S3 protocol)
R Language Definition
(S4 protocol)
R Internals
speeding things up (a bit)
errors in functions and loops
source and sink
attach and detach
resampling
(I am assuming that R behaves like Splus)
Try to minimize the number of expressions evaluated.
R is an "interpreted" language, not a "compiled" language. This means that each time R encounters an expression, it has to decipher it and convert it to machine-readable microcode.
If you write a loop, then R has to decipher the expressions within the loop again and again.
Looking at the code for apply , it seems to create a loop, so apply does not get around this.
Create space for new objects.
Every time you change the dimensions of an object, R reallocates space.
So my loop:
intercepts=NULL
for (i in 1:1000){y=line+rnorm(25,0,3) lmout=lm(y~x)
intercepts=c(intercepts,getIntercept(lmout))} would be better as:
intercepts=rep(0,1000) for (i in 1:1000){y=line+rnorm(25,0,3)
lmout=lm(y~x) intercepts[i]=getIntercept(lmout))}
If an error occurs somewhere within a function or loop, R aborts the whole process and destroys the frame. Nothing will change in your .Rdata directory.
There is no such thing as "partial execution"; if even one mistake occurs the whole process is aborted.
If you want to "trap" errors so that your routine can recover, the " try " command can be used to return a flag.
source("myfile.txt") runs the commands in the text file "myfile" until they are completed source("myOutput.txt") send everything that would be written to the screen to the text file "myOutput" until sink() is typed in
" attach " and " detach " expand and contract the search path for R.
You can attach either a dataframe or a
.Rdata workspace.
Any new assignments are stored in the current workspace.
When you want to stop searching the dataframe or workspace, you can detach it.
myData=read.table("body.txt",header=T) lm.out=lm(weight~hips+waist,data=myData) hips[1:5] attach(myData) hips[1:5] waist[1:5] waist=hips waist[1:5] detach(myData) hips[1:5] waist[1:5] myData$waist[1:5]
Many statistical methods require sampling from the data already collected.
Examples include the bootstrap and permutation methods.
sample(x, size, replace = FALSE, prob =
NULL) allows you to sample with or without replacement from x, with unequal probabilities if you want
sample(x,length(x),replace=F) creates a random permutation of the data
a=c("H","T") sample(a,10,replace=T) sample(a,10,replace=T,prob=c(.2,.8)) sample(1:5,5,replace=T) sample(1:5,5,replace=F)
#bootstrap distribution of the intercept attach(myData) intercept=rep(0,1000) l=length(weight) for (i in 1:1000){ samp=sample(1:l,l,replace=T) intercept[i]=coefficients(lm(weight[samp]~ hip[samp]+waist[samp]))[1]
} hist(intercept)