“WORKING WITH S-PLUS” Michael Elliott Assistant Professor of Biostatistics University of Michigan School of Public Health Overview • S-Plus is a commercial statistical software package – Plots or other graphics – Vector and matrix computation – Non-parametric estimation such as kernel density estimation and cubic spline regression – Simulations. • • • • • • • Importing and exporting data Executing standard statistical procedures Graphics Computation Robust statistics R: an improved(?) alternative to S-Plus Where to go for further help 2 Importing and Exporting Data •Can import or export ASCII data plus most major database management and statistical package data files (dBASE, Lotus, Excel, SAS, SPSS, Stata) •S-plus creates data objects: –Scalar –Vector –Matrix –Array –Factor (vector of categorical variables) –Frame (matrix of different types) –List (general collection of different types) 3 Example dataset • 1,000 young, first-time-licensed drivers in Michigan Age when Time ID licensed licensed Gender 001 17.5 3.4 0 002 16.1 6.2 0 . . . . . . . . . . . . N of Offenses 1 1 . . . Time to 1st Offense 1.0 3.3 . . . 4 Two ways to import and export data: • Use menu-driven commands • Use batch commands in the “Commands Window” 5 Menu-Driven Import/Export • Click on File – Import Data • From File • Fill in location and file type information 6 Batch Command Import/Export • Use the read.table or scan commands – Scan is a quick way to read in a small vector – Read.table generally works better for importing 2dimensional tables: splus_read.table(file="h:/splus/splus.txt", na.strings='.', col.names=c('id',"age","tol","male","noff","ttof")) Generally better to manage complex hierarchical data structures in, e.g., SAS and generate flat files for S-plus. 7 Basic operations • The usual operators are used: also ^ for exponentiation, %% for modulo, and _ or <- for assignment. • c(1,2,5) creates a 3x1 vector with the elements 1, 2, and 5; c(0:10) creates an 11x1 vector with elements 0 through 10; rep(1,10) creates a 10x1 vector of 1. • matrix(c(1,1,2,2),2,2) creates a 2x2 matrix with the 1st column being 1s and the 2nd column 2s. • What would array(rep(c(1,1,2,2),2),dim=c(2,2,2) generate? • To extract a portion of a vector, use name[a:b]; similarly, to read a row or column of a matrix, use name[a,] or name[,b] respectively. 8 • Logical vectors can be generated by conditional statements: >tf_(c(1,2,5)<=c(2,4,1)) >tf [1] T T F • Since these vectors have the property that T=1, F=0, they are very useful. For example, they could construct a linear spline design matrix: yi=0+1ti+2 (ti-1)+ + 3 (ti-2)+ +i X_cbind(t,c((t-theta1)*(t>theta1)), c((t-theta2)*(t>theta2))) 9 • sort(object) provides the values in object in alphanumeric order > sort(c(3,2,1,5)) [1] 1 2 3 5 • order(object) provides the location of elements in object in alphanumeric order > order(c(3,2,1,5)) [1] 3 2 1 4 10 Common statistical procedures • S-plus commands are functions of the form function(parameter1,parameter2,…) • Descriptive functions – summary(object) – hist(object) – stem(object) • Univariate and bivariate statistical tests – t.test(object,mu=value) – chisq.test(object) 11 • Regression models – Linear regression – Analysis of variance – Generalized linear models • Survival analysis – Nonparametric – Parametric – Semiparametic (Cox proportional hazard model) • Other – Random effects models – Factor analysis 12 • Example fitting a (Gaussian) linear regression model: yi=0+1xi1+2xi2+i where yi is the age at time of license, xi1 is the number of offenses, and xi2 is the total time of license for the ith subject. The ’s are fixed but unknown parameters, and the are (also unknown) normally distributed error terms. • The function lm() provides the least-square (maximum likelihood) estimates of together with numerous related diagnostics: myreg_lm(age~noff+tol,data=splus) 13 • myreg is a list containing a number of items – – – – myreg$coefficients myreg$residuals myreg$fitted.values see help(lm.object) for a complete list of available items • Use summary.lm(myreg) to obtain conventional regression analysis output. 14 • summary.lm(object) also provides useful items: – cov.unscaled: a p x p matrix of (unscaled) covariances of the regression coefficents – sigma: the square root of the estimated variance of the random error > summary.lm(myreg)$sigma*summary.lm(myreg)$cov.unscaled (Intercept) noff tol (Intercept) 0.01584473684 0.00005315618 -0.00250628666 noff 0.00005315618 0.00014034184 -0.00004702984 tol -0.00250628666 -0.00004702984 0.00042518954 – r.squared: the multiple R-squared statistic for the model. – fstatistic: numeric vector of length three giving the F test for the regression – see help(summary.lm) for a complete list of available items 15 ANOVA: aov(formula, data = <name of data>, projections = F, contrasts = NULL, ...) GLM: glm(formula, family = gaussian, data = <data>, weights = <weight vector>, na.action = na.fail, contrasts = NULL, ...) Survival: KM: kaplanMeier(Surv(time,cens), data= <name of data>, weights=NULL, na.action=na.fail, se.fit = T, conf.interval = "log", coverage = 0.95, …) G-rho family: survdiff(Surv(ttof,cens)~male,rho=<rho>) Cox PH: coxph(Surv(time,cens), method= <ties>, …) 16 Graphics • S-plus easily produces nice graphics, with fine control available for little effort. • Plot age at time of license versus number of offenses: tol_splus[,2] ttof_splus[,6] plot(tol,ttof,xlab=“Age of time of license”, ylab="Time to First Offense") 17 • You can easily control a variety of features, including – – – – – – Labels Plotting symbol Axis type Legends Titles Size. (xlab=“xxxx”, ylab=“yyyy”) (pch=‘symbol’, pch=n) (log=“x”, log=“y”, log=“xy”) (legend(x,y,legend=…,pch=…,lty=…) ) (title()) (pty=“s”, mfrow=c(2,2)) • It is also relatively easy to produce – – – – – – multiple graphs on a single page (par) Overlays of graphs (par) Barplots (barplot) Scatterplot matricies (brush) Contour plots (contour) 3-D plots (use menu) 18 Matrix Functions and Other Computations • Vector and matrix computation is very simply in S-plus. • To obtain the product of a scalar with a vector or matrix, use ‘*’; this also will give you element-by-element multiplication of two equal-size vectors, matricies, or arrays: >c(1,2,3)*c(4,5,6) [1] 4 10 18 • Vector or matrix multiplication can be done using %*%: • Transpose using t(). 19 • solve() inverts a square matrix or solves a set of linear equations: – solve(A)%*%A=I – A%*%solve(A,b)=b 20 • chol() returns the Choleski decomposition of a nonnegative symmetric matrix (note this is the transpose of the usual lower-triangular form): – t(chol(A))%*%chol(A)=A 21 • QR decomposition of an n x p matrix into an n x n orthonormal matrix Q and an n x p upper triangular matrix R is given by qr.Q(qr(A)) and qr.R(qr(A)). 22 • eigen() calculates the eigenvalues and eigenvectors of a square matrix: for A k x k square symmetric matrix, eigen(A)$vector and eigen(A)$values yield a k x k matrix and k x 1 vector such that, for each element j in eigen(A)$values and each column ej in eigen(A)$vector: Aej=jej <-> A=jejejT , where ejTej=1 and eiTej=0 if i<>j. 23 • Numerical optimization procedures are also available: nlmin(f, x, …) – f is an S-PLUS function which takes as its only argument a real vector of length p. – x is a vector of length p used as the starting point for the optimizer. Also nlminb for constrained optimization, nlregb for minimizing sums of squares of nonlinear functions subject to bound-constrained parameters, and nls for general nonlinear regression. 24 User-Defined Functions • You may define your own functions using the function() function: mydet_function(A,B){ detA_Re(prod(eigen(A)$values)) detB_Re(prod(eigen(B)$values)) detAB_detA*detB return(detA,detB,detAB) } 25 Robust Statistics • S-plus’ “niche” market is its extensive package of robust statistical procedures. 26 • As a measure of central tendency, the mean is highly sensitive to skewed or outlying observations, whereas the median is unaffected, but less optimal if the distribution is truly symmetric. – A simple compromise between the median and mean is the trimmed mean, which is the mean of the 1-2 central part of the distribution. This is implemented in S-plus via the trim parameter in the mean function: mean(x,trim=). – A major subspeciality in statistics is devoted to this area and S-plus has the function location() which can be used to implement them. 27 • Similarly, for the linear regression model yi=xi+i the least-square parameter estimate of that solve minb (yi-xib)2 are not as robust to outliers as the median (L1) estimator minb |yi-xib|. • Fit in S-plus using l1fit(X,y) where X is the design matrix and y the vector of dependent observations. – note this is the reverse of the usual order: design variables first, then outcome. 28 • Kernel density estimation provides a probability density function estimate from the histogram of observations: density(object,kernel=<>,bandwidth=<>) kernel=kernel-density smoother: "box" - a rectangular box (the default). "triangle" - a triangle (a box convolved with itself). "parzen" - the parzen function (a box convolved with a triangle). "normal" - the gaussian density function. bandwidth=the kernel bandwidth smoothing parameter. 29 • Robust and additive regression models extend the assumption of a linear relationship between E(Y) and X to the more general case of a smooth non-linear relationship. loess(formula,data=<dataset>,span=<>,degree=<1,2>,…) span=width over which tricube weights calculated degree=X linear or quadratic plot using plot.loess(loess(…)) smooth.spline(x, y, df = <>, spar = <>, all.knots = <T,F>, df.offset = 0, penalty = 1) Parameters allow knot choice and smoothing parameters to be pre-set or data-driven. 30 Simulations • A number of built-in procedures are available to generate random data. – rnorm(n,mu=0,sd=1) generates n N(0,1) random variables. – qnorm(p,mu=0,sd=1) and pnorm(q,mu=0,sd=1) gives q such that P(X<q)=p for X~N(0,1). – dnorm(x,mu=0,sd=1) gives the PDF of the N(0,1) distribution at x. – Similar procedures are available for 19 other standard distributions. • The for and if commands implement loops and conditional execution of statements. • Together these procedures allow for easy implementation of simulations. 31 betalb_rep(0,200) betaub_rep(0,200) count_0 for(i in 1:200){ x_runif(30,0,10) y_1+2*x+rnorm(30) myreg_lm(y~x) betalb[i]_myreg$coefficients[2]qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x)) betaub[i]_myreg$coefficients[2]+ qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x)) if (betalb[i]<2&&betaub[i]>2) count_count+1 } 32 Could delete the if statement and just compute 100-(sum(betalb>2)+sum(betaub<2))/2 33 Memory allocation in S-Plus • In loops, RAM is not released until the loop is completed. – In large simulations, later iterations will slow relative to earlier iterations. – This problem is made worse in nested loops (loops within loops). • Try to avoid loops, and especially nested loops, whenever possible – Use vector and matrix manipulations, instead. – Have yik, i=1,…,n, k=1,…,ni. Want yi.=Σyik. – Could loop across i and k, or be clever and use cumsum: cumsum(y)[c(n1,…, nn)]-c(0,cumsum(y)[c(n1,…,nn-1)]) 34 R An alternative to S-Plus is R. • Part of the GNU ("GNU's Not Unix") constellation of free Linux-based software. • Runs under Linux, UNIX and Windows. • Big difference: dynamic allocation of memory • Can run loops without chewing up memory. 35 • Most basic code is virtually interchangeable. A few differences: – Generalized linear models are implemented differently in R. (The same functionality is available but the components have different names.) – Missing data usually leads to failure in S unless the “na.omit” is used; "na.omit" is the default in R. – No-intercept models are defined by y~x+0 instead of y~x-1. – Write out “TRUE” and “FALSE” instead of using “T” and “F”. • More advanced functions (linear and non-linear mixed models, smoothing splines, adaptive quadrature, SEMs) have packages that need to be downloaded. 36 Where to go to get help • This has been a very brief introduction. • The on-line help menu in S-plus is very useful. • Venables and Ripley (1997) Modern Applied Statistics with S-Plus (2nd edition). • Websites: Splus: – www.mathsoft.com – Statlib site: lib.stat.cmu/S/ R: – http://www.R-project.org/ – http://www.gnu.org/gnu/the-gnu-project.html See http://cceb.med.upenn.edu/elliott/ for this presentation. 37