Presentation - University of Michigan

advertisement
“WORKING WITH S-PLUS”
Michael Elliott
Assistant Professor of Biostatistics
University of Michigan School of Public
Health
Overview
• S-Plus is a commercial statistical software package
– Plots or other graphics
– Vector and matrix computation
– Non-parametric estimation such as kernel density estimation and
cubic spline regression
– Simulations.
•
•
•
•
•
•
•
Importing and exporting data
Executing standard statistical procedures
Graphics
Computation
Robust statistics
R: an improved(?) alternative to S-Plus
Where to go for further help
2
Importing and Exporting Data
•Can import or export ASCII data plus most major database
management and statistical package data files (dBASE, Lotus,
Excel, SAS, SPSS, Stata)
•S-plus creates data objects:
–Scalar
–Vector
–Matrix
–Array
–Factor (vector of categorical variables)
–Frame (matrix of different types)
–List (general collection of different types)
3
Example dataset
•
1,000 young, first-time-licensed drivers in Michigan
Age when Time
ID licensed licensed Gender
001 17.5
3.4
0
002 16.1
6.2
0
.
.
.
.
.
.
.
.
.
.
.
.
N of
Offenses
1
1
.
.
.
Time to
1st Offense
1.0
3.3
.
.
.
4
Two ways to import and export data:
• Use menu-driven commands
• Use batch commands in the “Commands Window”
5
Menu-Driven Import/Export
• Click on File
– Import Data
• From File
• Fill in location and file type information
6
Batch Command Import/Export
• Use the read.table or scan commands
– Scan is a quick way to read in a small vector
– Read.table generally works better for importing 2dimensional tables:
splus_read.table(file="h:/splus/splus.txt",
na.strings='.',
col.names=c('id',"age","tol","male","noff","ttof"))
Generally better to manage complex hierarchical data
structures in, e.g., SAS and generate flat files for S-plus.
7
Basic operations
• The usual operators are used: also ^ for exponentiation,
%% for modulo, and _ or <- for assignment.
• c(1,2,5) creates a 3x1 vector with the elements 1, 2, and 5;
c(0:10) creates an 11x1 vector with elements 0 through 10;
rep(1,10) creates a 10x1 vector of 1.
• matrix(c(1,1,2,2),2,2) creates a 2x2 matrix with the 1st
column being 1s and the 2nd column 2s.
• What would array(rep(c(1,1,2,2),2),dim=c(2,2,2)
generate?
• To extract a portion of a vector, use name[a:b]; similarly,
to read a row or column of a matrix, use name[a,] or
name[,b] respectively.
8
• Logical vectors can be generated by conditional
statements:
>tf_(c(1,2,5)<=c(2,4,1))
>tf
[1] T T F
• Since these vectors have the property that T=1, F=0, they
are very useful. For example, they could construct a linear
spline design matrix:
yi=0+1ti+2 (ti-1)+ + 3 (ti-2)+ +i
X_cbind(t,c((t-theta1)*(t>theta1)),
c((t-theta2)*(t>theta2)))
9
• sort(object) provides the values in object in alphanumeric
order
> sort(c(3,2,1,5))
[1] 1 2 3 5
• order(object) provides the location of elements in object
in alphanumeric order
> order(c(3,2,1,5))
[1] 3 2 1 4
10
Common statistical procedures
• S-plus commands are functions of the form
function(parameter1,parameter2,…)
• Descriptive functions
– summary(object)
– hist(object)
– stem(object)
• Univariate and bivariate statistical tests
– t.test(object,mu=value)
– chisq.test(object)
11
• Regression models
– Linear regression
– Analysis of variance
– Generalized linear models
• Survival analysis
– Nonparametric
– Parametric
– Semiparametic (Cox proportional hazard model)
• Other
– Random effects models
– Factor analysis
12
• Example fitting a (Gaussian) linear regression model:
yi=0+1xi1+2xi2+i where yi is the age at time of license,
xi1 is the number of offenses, and xi2 is the total time of
license for the ith subject. The ’s are fixed but unknown
parameters, and the  are (also unknown) normally
distributed error terms.
• The function lm() provides the least-square (maximum
likelihood) estimates of  together with numerous related
diagnostics:
myreg_lm(age~noff+tol,data=splus)
13
• myreg is a list containing a number of items
–
–
–
–
myreg$coefficients
myreg$residuals
myreg$fitted.values
see help(lm.object) for a complete list of available items
• Use summary.lm(myreg) to obtain conventional
regression analysis output.
14
• summary.lm(object) also provides useful items:
– cov.unscaled: a p x p matrix of (unscaled) covariances of the
regression coefficents
– sigma: the square root of the estimated variance of the random
error
> summary.lm(myreg)$sigma*summary.lm(myreg)$cov.unscaled
(Intercept)
noff
tol
(Intercept) 0.01584473684 0.00005315618 -0.00250628666
noff 0.00005315618 0.00014034184 -0.00004702984
tol -0.00250628666 -0.00004702984 0.00042518954
– r.squared: the multiple R-squared statistic for the model.
– fstatistic: numeric vector of length three giving the F test for the
regression
– see help(summary.lm) for a complete list of available items
15
ANOVA:
aov(formula, data = <name of data>, projections = F,
contrasts = NULL, ...)
GLM:
glm(formula, family = gaussian, data = <data>,
weights = <weight vector>, na.action = na.fail, contrasts = NULL, ...)
Survival:
KM: kaplanMeier(Surv(time,cens), data= <name of data>,
weights=NULL, na.action=na.fail, se.fit = T, conf.interval = "log",
coverage = 0.95, …)
G-rho family: survdiff(Surv(ttof,cens)~male,rho=<rho>)
Cox PH: coxph(Surv(time,cens), method= <ties>, …)
16
Graphics
• S-plus easily produces nice graphics, with fine control
available for little effort.
• Plot age at time of license versus number of offenses:
tol_splus[,2]
ttof_splus[,6]
plot(tol,ttof,xlab=“Age of time of license”,
ylab="Time to First Offense")
17
• You can easily control a variety of features, including
–
–
–
–
–
–
Labels
Plotting symbol
Axis type
Legends
Titles
Size.
(xlab=“xxxx”, ylab=“yyyy”)
(pch=‘symbol’, pch=n)
(log=“x”, log=“y”, log=“xy”)
(legend(x,y,legend=…,pch=…,lty=…) )
(title())
(pty=“s”, mfrow=c(2,2))
• It is also relatively easy to produce
–
–
–
–
–
–
multiple graphs on a single page (par)
Overlays of graphs (par)
Barplots (barplot)
Scatterplot matricies (brush)
Contour plots (contour)
3-D plots (use menu)
18
Matrix Functions and Other Computations
• Vector and matrix computation is very simply in S-plus.
• To obtain the product of a scalar with a vector or matrix,
use ‘*’; this also will give you element-by-element
multiplication of two equal-size vectors, matricies, or
arrays:
>c(1,2,3)*c(4,5,6)
[1] 4 10 18
• Vector or matrix multiplication can be done using %*%:
• Transpose using t().
19
• solve() inverts a square matrix or solves a set of linear
equations:
– solve(A)%*%A=I
– A%*%solve(A,b)=b
20
• chol() returns the Choleski decomposition of a nonnegative symmetric matrix (note this is the transpose of the
usual lower-triangular form):
– t(chol(A))%*%chol(A)=A
21
• QR decomposition of an n x p matrix into an n x n
orthonormal matrix Q and an n x p upper triangular matrix
R is given by qr.Q(qr(A)) and qr.R(qr(A)).
22
• eigen() calculates the eigenvalues and eigenvectors of a
square matrix: for A k x k square symmetric matrix,
eigen(A)$vector and eigen(A)$values yield a k x k matrix
and k x 1 vector such that, for each element j in
eigen(A)$values and each column ej in eigen(A)$vector:
Aej=jej <-> A=jejejT , where ejTej=1 and eiTej=0 if i<>j.
23
• Numerical optimization procedures are also available:
nlmin(f, x, …)
– f is an S-PLUS function which takes as its only argument a real
vector of length p.
– x is a vector of length p used as the starting point for the optimizer.
Also nlminb for constrained optimization, nlregb for
minimizing sums of squares of nonlinear functions subject
to bound-constrained parameters, and nls for general nonlinear regression.
24
User-Defined Functions
• You may define your own functions using the function()
function:
mydet_function(A,B){
detA_Re(prod(eigen(A)$values))
detB_Re(prod(eigen(B)$values))
detAB_detA*detB
return(detA,detB,detAB)
}
25
Robust Statistics
• S-plus’ “niche” market is its extensive package of robust
statistical procedures.
26
• As a measure of central tendency, the mean is highly
sensitive to skewed or outlying observations, whereas the
median is unaffected, but less optimal if the distribution is
truly symmetric.
– A simple compromise between the median and mean is the
trimmed mean, which is the mean of the 1-2 central part of the
distribution. This is implemented in S-plus via the trim parameter
in the mean function: mean(x,trim=).
– A major subspeciality in statistics is devoted to this area and S-plus
has the function location() which can be used to implement them.
27
• Similarly, for the linear regression model
yi=xi+i
the least-square parameter estimate of  that solve
minb (yi-xib)2 are not as robust to outliers as the median
(L1) estimator minb |yi-xib|.
• Fit in S-plus using l1fit(X,y) where X is the design matrix
and y the vector of dependent observations.
– note this is the reverse of the usual order: design variables first,
then outcome.
28
• Kernel density estimation provides a probability density
function estimate from the histogram of observations:
density(object,kernel=<>,bandwidth=<>)
kernel=kernel-density smoother:
"box" - a rectangular box (the default).
"triangle" - a triangle (a box convolved with itself).
"parzen" - the parzen function (a box convolved with a triangle).
"normal" - the gaussian density function.
bandwidth=the kernel bandwidth smoothing parameter.
29
• Robust and additive regression models extend the
assumption of a linear relationship between E(Y) and X to
the more general case of a smooth non-linear relationship.
loess(formula,data=<dataset>,span=<>,degree=<1,2>,…)
span=width over which tricube weights calculated
degree=X linear or quadratic
plot using plot.loess(loess(…))
smooth.spline(x, y, df = <>, spar = <>, all.knots = <T,F>,
df.offset = 0, penalty = 1)
Parameters allow knot choice and smoothing parameters to
be pre-set or data-driven.
30
Simulations
• A number of built-in procedures are available to generate
random data.
– rnorm(n,mu=0,sd=1) generates n N(0,1) random variables.
– qnorm(p,mu=0,sd=1) and pnorm(q,mu=0,sd=1) gives q such that
P(X<q)=p for X~N(0,1).
– dnorm(x,mu=0,sd=1) gives the PDF of the N(0,1) distribution at x.
– Similar procedures are available for 19 other standard
distributions.
• The for and if commands implement loops and conditional
execution of statements.
• Together these procedures allow for easy implementation
of simulations.
31
betalb_rep(0,200)
betaub_rep(0,200)
count_0
for(i in 1:200){
x_runif(30,0,10)
y_1+2*x+rnorm(30)
myreg_lm(y~x)
betalb[i]_myreg$coefficients[2]qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))
betaub[i]_myreg$coefficients[2]+
qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))
if (betalb[i]<2&&betaub[i]>2) count_count+1
}
32
Could delete the if statement and just compute
100-(sum(betalb>2)+sum(betaub<2))/2
33
Memory allocation in S-Plus
• In loops, RAM is not released until the loop is completed.
– In large simulations, later iterations will slow relative to earlier
iterations.
– This problem is made worse in nested loops (loops within loops).
• Try to avoid loops, and especially nested loops, whenever
possible
– Use vector and matrix manipulations, instead.
– Have yik, i=1,…,n, k=1,…,ni. Want yi.=Σyik.
– Could loop across i and k, or be clever and use cumsum:
cumsum(y)[c(n1,…, nn)]-c(0,cumsum(y)[c(n1,…,nn-1)])
34
R
An alternative to S-Plus is R.
• Part of the GNU ("GNU's Not Unix") constellation of free
Linux-based software.
• Runs under Linux, UNIX and Windows.
• Big difference: dynamic allocation of memory
• Can run loops without chewing up memory.
35
• Most basic code is virtually interchangeable. A few
differences:
– Generalized linear models are implemented differently in R. (The
same functionality is available but the components have different
names.)
– Missing data usually leads to failure in S unless the “na.omit” is
used; "na.omit" is the default in R.
– No-intercept models are defined by y~x+0 instead of y~x-1.
– Write out “TRUE” and “FALSE” instead of using “T” and “F”.
• More advanced functions (linear and non-linear mixed
models, smoothing splines, adaptive quadrature, SEMs)
have packages that need to be downloaded.
36
Where to go to get help
• This has been a very brief introduction.
• The on-line help menu in S-plus is very useful.
• Venables and Ripley (1997) Modern Applied Statistics
with S-Plus (2nd edition).
• Websites:
Splus:
– www.mathsoft.com
– Statlib site: lib.stat.cmu/S/
R:
– http://www.R-project.org/
– http://www.gnu.org/gnu/the-gnu-project.html
See http://cceb.med.upenn.edu/elliott/ for this presentation.
37
Download