Splus - Purdue University

advertisement
S-Plus Statistical Software Package
S-Plus was originated from a programming language called S created at AT&T Bell Labs. It wasn’t
until the early 1990’s when a company known as Statistical Sciences (currently known as MathSoft)
began distributing a more commercial version of S that it was known as S-Plus. S-Plus was created to be
a powerful interactive computing tool used for different types of statistical and graphical analysis. It is
considered to be a computer coding command line based interface, but it also has some menu pull
down options. It is not as user-friendly as some of the other statistical packages, however, since it is
more programming based it does offer a lot more freedom to manipulate the data then other packages.
S-Plus is closely related the R Statistical Software Package and does have many of the similar commands
and functions found in R. The following report is meant to overview some of the key highlights provided
in the S-Plus Statistical Package. The methods may differ depending on which version you may be using
and this report only discusses some of the basic Statistical Techniques. It should be used as a briefing of
some of the many capabilities of S-Plus, but not as a sole source. We will focus on four basic areas
involved in Statistical Analysis: Graphics, Data Set-up, Data Summary, and Regression. A more detailed
explanation of the following features can be found in the resources provided in the References Section.
Creating, Importing, and Manipulating Data
There are numerous ways to set-up a dataset in S-Plus. Like many of the other statistical
software packages, you can easily import a data set (via excel, text file, etc.) or you could manually input
data in yourself. To import a data set you can use the scan() command, inputting the location and name
of the dataset as the parameter. S-Plus also has numerous built-in data sets that can be used for analysis.
You can create your own matrices, vectors, and even data frames in S-Plus to be used for analysis.
Names can be assigned to different objects and datasets by using the <- or _ symbol with the name
preceding the symbol and object to be named following it. In order to look at the observations for a
specific variable in a dataset you could list the name of the dataset, place a dollar sign, and the name of
the variable of interest in the dataset. This would display all of the observations for that variable in a
data set. Naming variables in your dataset is quite easy using a command like the name() or dimname()
command. Depending on what type of dataset you are working with (matrix, vector, dataframe) the
command would differ. You could even name multiple variables at one time using the c() command. You
could also look at a particular variable, row, or column by using dataname[x, y]. If you have a matrix or
data frame then you need at least 2 parameters, however if you only have a vector you only need one.
In the previous command dataname is the name of the dataset, x is the name of the row desired, and y
is the name of the column desired. This would display the value in row x column y. If you just left either
x or y blank then it would display all of the values in the row or column you left blank. You can easily
look at multiple rows or columns by inserting the c() command with the desired rows or columns
interested in.
Creating subsets are easy using a combination of these commands and specifying the rows
and/or columns you are interested in. Manipulating and creating new (transforming) variables can be
made by adding an additional column to the dataset and creating the appropriate variable operations in
the command line. You can also edit your data using the data.ed() command. As with any software you
need to be sure to classify your variables as either categorical or numeric. You can use the class()
command to determine what how your variable is classified in the program. If you want to make a
1
variable categorical you can use the factor() command to change it. With this command you can also
create or edit your own categorical variable by specifying the levels and labels. S-Plus denotes missing
values as NA. This can sometimes cause issues when doing model fits.
Graphics
S-Plus is known for its’ high quality and diverse types of graphics. You can generate almost any
type of graph needed with the right coding and parameter set-up. Some of the general plotting
commands include the par(mfrow=c()) command that allows you to save space by graphing multiple
plots on one page. Using this command and selecting the desired number of rows and columns allows
you to decide the number of graphs on one page. Using the title() command you can specify a main plot
title as well as the X and Y labels. By default if you do not specify the name of the x and y axis using the
xlab() and ylab() command the axes will be labeled as the variable names. In most of the plot command
functions you can also use various commands to add text, color, shading, points, lines, arrows, and even
axes to plots. When plotting graphs you can control the length of the X and Y axes, width of lines, and
type of characters plotted. Generating an X and Y plot with continuous variables is made easy using the
plot() command, which allows you select and plot an X and Y variable. You can even use the abline()
command to add a horizontal, vertical, or regression line to a the plot. Histograms and boxplots are
made using the hist() and boxplot() command function, respectively. With these functions you only need
to specify the one particular variable of interest in the parameter settings. Legends are useful in graphs
and can be done using either the legend() command. In this particular function you need to specify the
location on the plot that you want the legend to be as well as the text you would like to be displayed.
The qqnorm() command can be used to create QQ-Plots to check the normality assumption.
When you include the name of your dataset in the pairs() command, it will generate a pairwise
scatterplot Matrix of all the variables in your dataset. Contour plots can be created by specifying the X, Y,
and Z variables. Additional and more complex graphics such as a: times series plots, mesh plots, image
plots, multivariate bar plots, pie charts, cluster trees, survival curves, and even a USA map with state
abbreviations and colors added can also be generated . Transferring graphs to use in other programs is
quite easy. You can do this by copying the desired graphic to the clipboard and pasting it in a separate
application, exporting the graph, saving the graphics as an S-Plus graph sheet, or even using one of the
mini graphic device commands offered in S-Plus (pdf.graph for Adobe PDF format or converting it to
a .BMP and .JPEG file). Some of the graphics in S-Plus can be seen following in Figure 1 and 2.
Summarizing Data with Continuous and Categorical Variable Analysis
Summary statistics can be done in S-Plus using the Data Summaries command under the
Statistics menu option. After selecting what variables you would like to focus on and checking the boxes
for which summary statistics you want (mean, median, standard error, etc.), it will display output with
those results. For variables that are grouped you can specify this using the Group Variables command in
the Statistics menu options. A dialog box will prompt you to specifying what variables to group the
others by. Other things of interest such as a correlation matrix or covariance matrix of a dataset can
easily be viewed with the right commands.
Regression and Diagnostics
Simple linear regression can be accomplished using the lm() command. The parameters of this
command include the equation of interest and the dataset. There are various other options that can also
be used with this command to enhance its’ features. You can easily name the linear model you run using
2
the symbols as previously mentioned. This will make recall of the function easier later. Using the
summary() command for your linear model provides most of the standard regression output results
from the model. The plot() command for a linear model provides various diagnostic plots such as:
residuals, fitted vs. observed, qqplot, and cook’s distance. To create a polynomial model you can use
the poly() command, in which you specify what variable and degree order you want for your model. You
can even create a nonlinear regression model using the nls() command and specifying both the desired
equation and dataset to be used. There are a number of other model fits that can be utilized using
various commands. To conduct an Analysis of Variance Table you can use the aov() command. Using this
command you can look at a variety of different designs including: factorials, split plot designs, and both
balanced and unbalanced designs.
Figure 1. Above are images from the “S-Plus Tutorial” website of basic scatterplots.
Histogram of Dice Data
1
0
2
5
3
10
4
15
5
20
6
Boxplot of Dice Rolls
1
2
3
4
5
6
z
0.2
0.4
0.0
0.2
-0.2
0.4
0.6
0.8
largecap.ts@data[, 1]
Asset 4
Asset 3
Asset 2
Asset 1
0.0
1.0
Imaginary barplot of asset w eights
-2
Scenario A
Scenario B
-1
0
1
2
Quantiles of Standard Normal
Figure 2. Above are images from the “Graphics, Part 1” website of basic scatterplots.
A generalized linear model can be fit with the glm() command. In this command you can easily
use link functions to model the mean response and how it relates to a linear combination of it’s
predictor variables. Logistic regression and log-linear models are also done with this command. For this
particular function a particular distribution would need to be specified. For survival analysis you can
utilize the coxph() command. A Pearson’s chi-square test can be conducted on your model using the
chisq.test() function. The standard t-test can also be performed using the t.test() command. To assist in
3
selecting appropriate variables the step() function can be used to implement stepwise regression for all
variables specified. You can even specify the specific method desired: forward, backward, or forward
stepping. Nearby models can be checked using the add1() and drop1() feature to see how different
variables effect the model.
To investigate the numerous variables in your model Principal component analysis can be done
using the Principal Component option located in the Statistics menu option. In the dialog box you will be
asked to specify the dataset and the variables you are interested in analyzing. Also if you are interested
in a nesting, random, or mixed effect in a model there are various functions and options in the
command functions that will allow for this. There are quite a few standard features that can be used
regardless of what type of fit you have. The coef() command used with a model fit will provide all of the
model coefficients. The predict() command allows you to generate the predicted values. To report the
fitted values you can use the fitted() command. In order to modify an existing equation you can use the
Update() command. Residual calculations can be displayed with the resid() command.
Other Notes
There are a few important things to note that were not previously mention. S-Plus is a case
sensitive program language. This includes all variables, datasets, commands, arguments, etc. You can
easily quit S-Plus using the q() command at anytime. The program also allows you to create functions to
be used later using the function() command. With this command you need to first specify what
parameters the function should have, then in the body of the function you can write how those
parameters are to be manipulated. The ls() command provides you with a list of data objects. Data
objects that you rarely use should and can be removed by using the rm() command and specifying what
variables to permanently delete. It is important to note, that objects made in S-Plus are usually
permanent thus it is important that you free space occasionally by deleting objects. S-Plus has a whole
list of different command expressions and operators to make computation easy. Power and sample size
of a dataset can be determined using the Power and Sample Size Command under the Statistics menu
option. A dialog box with prompt you to specify several things such as: the alpha, standard mean, and
standard deviation. Overall, the S-Plus language gives you the freedom to manipulate and analyze data
as you please. It is a great programming language it should especially be used when interested in high
quality graphics and the ability to explore your data freely.
4
Reference
1. "S-PLUS Tutorial." Biology at the University of Alaska Fairbanks. N.p., n.d. Web. 30 Apr.
2010. <http://mercury.bio.uaf.edu/mercury/splus/splus.html>.
2. "Graphics, Part 1." University of Washington, Department of Statistics. N.p., n.d. Web.
30 Apr. 2010. <www.stat.washington.edu/.../spluscourse/UWCF%20SPLUS%20Lecture8.ppt>.
3. "Commonly Used Commands in Splus." Duke University, Department of Statistical
Science. N.p., n.d. Web. 30 Apr. 2010.
<http://www.stat.duke.edu/courses/Spring02/sta242/labs/SPlus.html#import>.
4. Spector, Phil. "Statistical Models and Graphics in Splus." University of California,
Berkeley, Department of Statistics. N.p., n.d. Web. 30 Apr. 2010. <statwww.berkeley.edu/users/spector/stat_s.pdf>.
5. StatSci. S-Plus: User's Manual Vol. 1. 1988. Reprint. Seattle: Statistical Sciences, Inc.,
1991. Print.
5
Download