Department of Statistics

advertisement
Department of Statistics
Stockholm University
R - Study Group - Session V
Statistical procedures
Anders Fornell-Söderberg
This is a highly selective summary of a few selected topics from in chapter 11 in “An
Introduction to R” by Venables and Smith. With an attempt of practical usage on some
real data. And a short description of running R in batch mode.
Formulas
The basis of statistical procedures in R is the formula. It is written in what can be
describe as variable only mode. (The template is the ordinary least square regression.)
y ~ emal36
where y in this case is the response/dependent/target values and emal36 is a predictor/
independent/ auxiliary, or what ever you want to call it, variable. It does not state
anything about the nature of the relationship. The actual model is then a function call
which takes the formula and the data and applies it to the function. There is usually a
good understanding of how the model must be build depending on the formula and the
data. For instance,
attach(x);
fkt <- as.formula(y ~ emal36);
model_1 <- lm(fkt);
## lift some data
## create a formula
## is an ordinary linear
## regression model
the function lm, which we have previously seen completes the model with an intercept
term on default. The model will automatically use numeric independents as covariates
and characters as factors. There is (at least not for the lm function) no problem of mixing
covariates and factors. One can create more elaborate formulas, such as
y ~ geog + emal36 + geog:emal36); ## the interaction term
y ~ geog/(1+emal36) -1);
### separate regr. lines
on geog classes
y ~ I(emal36 + ant_amal36);### where I gives the arithmetic
meaning.
y ~ factor(emal36) + ant_amal36;### tell the model that you
want to use the variable emal36 as categorical even though
it is numerical
1
More examples are collected in chapter 11 in “An Introduction to R” by Venables and
Smith.
You can create and/or edit a formula through the prompt or though fix() or edit().
fkt <- as.formula(y ~ emal36);
edit(fkt);
## this will not work
fkt2=edit(fkt);
##
fix(fkt);
## ok but it is no longer a class
formula how
ever in lm it works
However the edited formula fkt2 is no longer a formula and hence one needs to use
fkt2<-as.formula(fkt2);
Model functions
This was lm( ) function. One could use other inbuilt functions such as glm( ) which is the
generalized linjar model, defined by:
glm(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart,
offset, control = glm.control(...), model = TRUE,
method = "glm.fit", x = FALSE, y = TRUE, contrasts =
NULL,...)
Such as,
glm(fkt2, family=binomial(link = "logit"));
Or one could possibly use none standard models, such as logistic regression model form
function lrm() from the Design package.
library(Design); ## requires hmisc
library(nlme);
## a vanilla version
lrm(fkt2, data=x);
Declaring a model object
One can declare an object via:
model_2 <- lm(fkt2);
This will not produce an output on screen but you can display the model via typing it or
get a more exhaustive presentation via
2
summary(model_2);
This gives you an opportunity to access individual instances of the model (lingo!!!). for
example the coefficients;
model_2$coef;
which gives a vector of the coefficients. And hence individual coefficients trough
model_2$coef[1];
which is the intercept.
One can also use an number of inbuilt data accessing function, such as getting the
residuals:
Res <-resid(model_2);
These include
add1 deviance formula predict step
alias drop1 kappa print summary
anova effects labels proj vcov
coef family plot residuals
For a more details se chapter 11 in “An Introduction to R”.
If one declare these one can access object within them and use for other purposes,
model_1=lm(fkt2,data=x);
add_model=add1(model_1,scope=x, test="F"); ## values for
adding one of the omited values in x
add_model; ## print the result in screen
## now you can add one with a loop
for(i in 2:length(add_model$F)){ ## F is the F-statitica
if(add_model$F[i] ==
max(add_model$F[!is.na(add_model$F)])){
fkt2=update.formula(fkt2, paste(' ~. + ',
rownames(add_model)[i]));
} #end if
} #end for
model_3=lm(fkt2,data=x); ### re-estimate and re name, for
compairson
summary(model_1);
## print the model with the best
(through F-test selected) variable added
Batch
3
If one has written R programs that do specific task, one way of re-using them by putting
them in a batch file. And the run the file. This is a convenient way of monitoring the
launch sequence and also one can alternate between different program. Even though it is
possible to encapsulate R in C/C++ and vice versa, if the tasks that you have chosen are
independent one can launch them in a batch file in the order of your choosing. It is also a
convenient way of launching programs after office hours, by timing the batch. An
example of a batch:
cd c:\batch_R\
datalift.exe
cd C:\Program Files\R\R-2.6.0\bin\
Rterm.exe --no-restore --no-save <c:/batch_R/testR.R>
c:/batch_R/datalyft.outR
This bat file will find the C/C++ program datalift, which in this instance lifts a lot of data
from a dataware house via SQL, and prints out the data in a flat file.
Then invoke R, finds the R-file test.R in <c:/batch_R/testR.R>
Reads in the data produced by the datalift program, does its R things.
4
Download