S-Plus Statistical Software Package S-Plus was originated from a programming language called S created at AT&T Bell Labs. It wasn’t until the early 1990’s when a company known as Statistical Sciences (currently known as MathSoft) began distributing a more commercial version of S that it was known as S-Plus. S-Plus was created to be a powerful interactive computing tool used for different types of statistical and graphical analysis. It is considered to be a computer coding command line based interface, but it also has some menu pull down options. It is not as user-friendly as some of the other statistical packages, however, since it is more programming based it does offer a lot more freedom to manipulate the data then other packages. S-Plus is closely related the R Statistical Software Package and does have many of the similar commands and functions found in R. The following report is meant to overview some of the key highlights provided in the S-Plus Statistical Package. The methods may differ depending on which version you may be using and this report only discusses some of the basic Statistical Techniques. It should be used as a briefing of some of the many capabilities of S-Plus, but not as a sole source. We will focus on four basic areas involved in Statistical Analysis: Graphics, Data Set-up, Data Summary, and Regression. A more detailed explanation of the following features can be found in the resources provided in the References Section. Creating, Importing, and Manipulating Data There are numerous ways to set-up a dataset in S-Plus. Like many of the other statistical software packages, you can easily import a data set (via excel, text file, etc.) or you could manually input data in yourself. To import a data set you can use the scan() command, inputting the location and name of the dataset as the parameter. S-Plus also has numerous built-in data sets that can be used for analysis. You can create your own matrices, vectors, and even data frames in S-Plus to be used for analysis. Names can be assigned to different objects and datasets by using the <- or _ symbol with the name preceding the symbol and object to be named following it. In order to look at the observations for a specific variable in a dataset you could list the name of the dataset, place a dollar sign, and the name of the variable of interest in the dataset. This would display all of the observations for that variable in a data set. Naming variables in your dataset is quite easy using a command like the name() or dimname() command. Depending on what type of dataset you are working with (matrix, vector, dataframe) the command would differ. You could even name multiple variables at one time using the c() command. You could also look at a particular variable, row, or column by using dataname[x, y]. If you have a matrix or data frame then you need at least 2 parameters, however if you only have a vector you only need one. In the previous command dataname is the name of the dataset, x is the name of the row desired, and y is the name of the column desired. This would display the value in row x column y. If you just left either x or y blank then it would display all of the values in the row or column you left blank. You can easily look at multiple rows or columns by inserting the c() command with the desired rows or columns interested in. Creating subsets are easy using a combination of these commands and specifying the rows and/or columns you are interested in. Manipulating and creating new (transforming) variables can be made by adding an additional column to the dataset and creating the appropriate variable operations in the command line. You can also edit your data using the data.ed() command. As with any software you need to be sure to classify your variables as either categorical or numeric. You can use the class() command to determine what how your variable is classified in the program. If you want to make a 1 variable categorical you can use the factor() command to change it. With this command you can also create or edit your own categorical variable by specifying the levels and labels. S-Plus denotes missing values as NA. This can sometimes cause issues when doing model fits. Graphics S-Plus is known for its’ high quality and diverse types of graphics. You can generate almost any type of graph needed with the right coding and parameter set-up. Some of the general plotting commands include the par(mfrow=c()) command that allows you to save space by graphing multiple plots on one page. Using this command and selecting the desired number of rows and columns allows you to decide the number of graphs on one page. Using the title() command you can specify a main plot title as well as the X and Y labels. By default if you do not specify the name of the x and y axis using the xlab() and ylab() command the axes will be labeled as the variable names. In most of the plot command functions you can also use various commands to add text, color, shading, points, lines, arrows, and even axes to plots. When plotting graphs you can control the length of the X and Y axes, width of lines, and type of characters plotted. Generating an X and Y plot with continuous variables is made easy using the plot() command, which allows you select and plot an X and Y variable. You can even use the abline() command to add a horizontal, vertical, or regression line to a the plot. Histograms and boxplots are made using the hist() and boxplot() command function, respectively. With these functions you only need to specify the one particular variable of interest in the parameter settings. Legends are useful in graphs and can be done using either the legend() command. In this particular function you need to specify the location on the plot that you want the legend to be as well as the text you would like to be displayed. The qqnorm() command can be used to create QQ-Plots to check the normality assumption. When you include the name of your dataset in the pairs() command, it will generate a pairwise scatterplot Matrix of all the variables in your dataset. Contour plots can be created by specifying the X, Y, and Z variables. Additional and more complex graphics such as a: times series plots, mesh plots, image plots, multivariate bar plots, pie charts, cluster trees, survival curves, and even a USA map with state abbreviations and colors added can also be generated . Transferring graphs to use in other programs is quite easy. You can do this by copying the desired graphic to the clipboard and pasting it in a separate application, exporting the graph, saving the graphics as an S-Plus graph sheet, or even using one of the mini graphic device commands offered in S-Plus (pdf.graph for Adobe PDF format or converting it to a .BMP and .JPEG file). Some of the graphics in S-Plus can be seen following in Figure 1 and 2. Summarizing Data with Continuous and Categorical Variable Analysis Summary statistics can be done in S-Plus using the Data Summaries command under the Statistics menu option. After selecting what variables you would like to focus on and checking the boxes for which summary statistics you want (mean, median, standard error, etc.), it will display output with those results. For variables that are grouped you can specify this using the Group Variables command in the Statistics menu options. A dialog box will prompt you to specifying what variables to group the others by. Other things of interest such as a correlation matrix or covariance matrix of a dataset can easily be viewed with the right commands. Regression and Diagnostics Simple linear regression can be accomplished using the lm() command. The parameters of this command include the equation of interest and the dataset. There are various other options that can also be used with this command to enhance its’ features. You can easily name the linear model you run using 2 the symbols as previously mentioned. This will make recall of the function easier later. Using the summary() command for your linear model provides most of the standard regression output results from the model. The plot() command for a linear model provides various diagnostic plots such as: residuals, fitted vs. observed, qqplot, and cook’s distance. To create a polynomial model you can use the poly() command, in which you specify what variable and degree order you want for your model. You can even create a nonlinear regression model using the nls() command and specifying both the desired equation and dataset to be used. There are a number of other model fits that can be utilized using various commands. To conduct an Analysis of Variance Table you can use the aov() command. Using this command you can look at a variety of different designs including: factorials, split plot designs, and both balanced and unbalanced designs. Figure 1. Above are images from the “S-Plus Tutorial” website of basic scatterplots. Histogram of Dice Data 1 0 2 5 3 10 4 15 5 20 6 Boxplot of Dice Rolls 1 2 3 4 5 6 z 0.2 0.4 0.0 0.2 -0.2 0.4 0.6 0.8 largecap.ts@data[, 1] Asset 4 Asset 3 Asset 2 Asset 1 0.0 1.0 Imaginary barplot of asset w eights -2 Scenario A Scenario B -1 0 1 2 Quantiles of Standard Normal Figure 2. Above are images from the “Graphics, Part 1” website of basic scatterplots. A generalized linear model can be fit with the glm() command. In this command you can easily use link functions to model the mean response and how it relates to a linear combination of it’s predictor variables. Logistic regression and log-linear models are also done with this command. For this particular function a particular distribution would need to be specified. For survival analysis you can utilize the coxph() command. A Pearson’s chi-square test can be conducted on your model using the chisq.test() function. The standard t-test can also be performed using the t.test() command. To assist in 3 selecting appropriate variables the step() function can be used to implement stepwise regression for all variables specified. You can even specify the specific method desired: forward, backward, or forward stepping. Nearby models can be checked using the add1() and drop1() feature to see how different variables effect the model. To investigate the numerous variables in your model Principal component analysis can be done using the Principal Component option located in the Statistics menu option. In the dialog box you will be asked to specify the dataset and the variables you are interested in analyzing. Also if you are interested in a nesting, random, or mixed effect in a model there are various functions and options in the command functions that will allow for this. There are quite a few standard features that can be used regardless of what type of fit you have. The coef() command used with a model fit will provide all of the model coefficients. The predict() command allows you to generate the predicted values. To report the fitted values you can use the fitted() command. In order to modify an existing equation you can use the Update() command. Residual calculations can be displayed with the resid() command. Other Notes There are a few important things to note that were not previously mention. S-Plus is a case sensitive program language. This includes all variables, datasets, commands, arguments, etc. You can easily quit S-Plus using the q() command at anytime. The program also allows you to create functions to be used later using the function() command. With this command you need to first specify what parameters the function should have, then in the body of the function you can write how those parameters are to be manipulated. The ls() command provides you with a list of data objects. Data objects that you rarely use should and can be removed by using the rm() command and specifying what variables to permanently delete. It is important to note, that objects made in S-Plus are usually permanent thus it is important that you free space occasionally by deleting objects. S-Plus has a whole list of different command expressions and operators to make computation easy. Power and sample size of a dataset can be determined using the Power and Sample Size Command under the Statistics menu option. A dialog box with prompt you to specify several things such as: the alpha, standard mean, and standard deviation. Overall, the S-Plus language gives you the freedom to manipulate and analyze data as you please. It is a great programming language it should especially be used when interested in high quality graphics and the ability to explore your data freely. 4 Reference 1. "S-PLUS Tutorial." Biology at the University of Alaska Fairbanks. N.p., n.d. Web. 30 Apr. 2010. <http://mercury.bio.uaf.edu/mercury/splus/splus.html>. 2. "Graphics, Part 1." University of Washington, Department of Statistics. N.p., n.d. Web. 30 Apr. 2010. <www.stat.washington.edu/.../spluscourse/UWCF%20SPLUS%20Lecture8.ppt>. 3. "Commonly Used Commands in Splus." Duke University, Department of Statistical Science. N.p., n.d. Web. 30 Apr. 2010. <http://www.stat.duke.edu/courses/Spring02/sta242/labs/SPlus.html#import>. 4. Spector, Phil. "Statistical Models and Graphics in Splus." University of California, Berkeley, Department of Statistics. N.p., n.d. Web. 30 Apr. 2010. <statwww.berkeley.edu/users/spector/stat_s.pdf>. 5. StatSci. S-Plus: User's Manual Vol. 1. 1988. Reprint. Seattle: Statistical Sciences, Inc., 1991. Print. 5