Department of Statistics Stockholm University R - Study Group - Session V Statistical procedures Anders Fornell-Söderberg This is a highly selective summary of a few selected topics from in chapter 11 in “An Introduction to R” by Venables and Smith. With an attempt of practical usage on some real data. And a short description of running R in batch mode. Formulas The basis of statistical procedures in R is the formula. It is written in what can be describe as variable only mode. (The template is the ordinary least square regression.) y ~ emal36 where y in this case is the response/dependent/target values and emal36 is a predictor/ independent/ auxiliary, or what ever you want to call it, variable. It does not state anything about the nature of the relationship. The actual model is then a function call which takes the formula and the data and applies it to the function. There is usually a good understanding of how the model must be build depending on the formula and the data. For instance, attach(x); fkt <- as.formula(y ~ emal36); model_1 <- lm(fkt); ## lift some data ## create a formula ## is an ordinary linear ## regression model the function lm, which we have previously seen completes the model with an intercept term on default. The model will automatically use numeric independents as covariates and characters as factors. There is (at least not for the lm function) no problem of mixing covariates and factors. One can create more elaborate formulas, such as y ~ geog + emal36 + geog:emal36); ## the interaction term y ~ geog/(1+emal36) -1); ### separate regr. lines on geog classes y ~ I(emal36 + ant_amal36);### where I gives the arithmetic meaning. y ~ factor(emal36) + ant_amal36;### tell the model that you want to use the variable emal36 as categorical even though it is numerical 1 More examples are collected in chapter 11 in “An Introduction to R” by Venables and Smith. You can create and/or edit a formula through the prompt or though fix() or edit(). fkt <- as.formula(y ~ emal36); edit(fkt); ## this will not work fkt2=edit(fkt); ## fix(fkt); ## ok but it is no longer a class formula how ever in lm it works However the edited formula fkt2 is no longer a formula and hence one needs to use fkt2<-as.formula(fkt2); Model functions This was lm( ) function. One could use other inbuilt functions such as glm( ) which is the generalized linjar model, defined by: glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL,...) Such as, glm(fkt2, family=binomial(link = "logit")); Or one could possibly use none standard models, such as logistic regression model form function lrm() from the Design package. library(Design); ## requires hmisc library(nlme); ## a vanilla version lrm(fkt2, data=x); Declaring a model object One can declare an object via: model_2 <- lm(fkt2); This will not produce an output on screen but you can display the model via typing it or get a more exhaustive presentation via 2 summary(model_2); This gives you an opportunity to access individual instances of the model (lingo!!!). for example the coefficients; model_2$coef; which gives a vector of the coefficients. And hence individual coefficients trough model_2$coef[1]; which is the intercept. One can also use an number of inbuilt data accessing function, such as getting the residuals: Res <-resid(model_2); These include add1 deviance formula predict step alias drop1 kappa print summary anova effects labels proj vcov coef family plot residuals For a more details se chapter 11 in “An Introduction to R”. If one declare these one can access object within them and use for other purposes, model_1=lm(fkt2,data=x); add_model=add1(model_1,scope=x, test="F"); ## values for adding one of the omited values in x add_model; ## print the result in screen ## now you can add one with a loop for(i in 2:length(add_model$F)){ ## F is the F-statitica if(add_model$F[i] == max(add_model$F[!is.na(add_model$F)])){ fkt2=update.formula(fkt2, paste(' ~. + ', rownames(add_model)[i])); } #end if } #end for model_3=lm(fkt2,data=x); ### re-estimate and re name, for compairson summary(model_1); ## print the model with the best (through F-test selected) variable added Batch 3 If one has written R programs that do specific task, one way of re-using them by putting them in a batch file. And the run the file. This is a convenient way of monitoring the launch sequence and also one can alternate between different program. Even though it is possible to encapsulate R in C/C++ and vice versa, if the tasks that you have chosen are independent one can launch them in a batch file in the order of your choosing. It is also a convenient way of launching programs after office hours, by timing the batch. An example of a batch: cd c:\batch_R\ datalift.exe cd C:\Program Files\R\R-2.6.0\bin\ Rterm.exe --no-restore --no-save <c:/batch_R/testR.R> c:/batch_R/datalyft.outR This bat file will find the C/C++ program datalift, which in this instance lifts a lot of data from a dataware house via SQL, and prints out the data in a flat file. Then invoke R, finds the R-file test.R in <c:/batch_R/testR.R> Reads in the data produced by the datalift program, does its R things. 4