S-PLUS Course Notes for STAT 462/862 2000 Edition for S-Plus 2000 for Windows 1996, 1997 Deanna Rothwell Updated 2000 by Andrew Day Table of Contents Table of Contents Table of Contents 2 Introduction to S-PLUS 6 Background S-PLUS Capabilities Available Documentation Starting and Ending an S-PLUS Session S-PLUS Language and Syntax Getting Help Objects in S-PLUS Intro to Objects Assigning Objects Storing Objects Listing or Removing Objects Object Names S-PLUS Namespace Displaying Objects Object Attributes Data Values Common Values Special Values Coercion Notes on Logical Values Notes on Missing Values Saving S-PLUS Source Code and Output in External Files S-PLUS Source Code S-PLUS Output Vectors Properties Attributes Creating Vectors Naming elements Subsetting Vector Arithmetic Matrices Properties Attributes Creating Matrices Subsetting Matrix Arithmetic Matrix Manipulation 2 6 6 6 7 7 7 9 9 9 9 10 10 10 10 10 12 12 12 12 12 13 14 14 15 16 16 16 16 16 16 17 19 19 19 19 19 20 20 Table of Contents Row and Column Names Lists Properties Attributes Creating Lists Naming Components Subsetting Attaching and Detaching Lists Data Frames Properties Attributes Creating a Data Frame Subsetting Attaching and Detaching Frames Matrix Manipulation Changing a Data Frame to a Matrix or Vice Versa Factors Properties: Attributes Creating a Factor from a Vector Labeling Levels of a Factor Categorizing Continuous Data Some Useful Functions Utilizing Factors 21 22 22 22 22 22 22 23 24 24 24 24 24 24 24 24 25 25 25 25 25 25 26 Reading Data into S-PLUS 27 Reading From the Keyboard Reading from ASCII Files 27 27 Functions in S-PLUS 29 Expressions Groups The IF/ELSE Conditional The FOR Loop The WHILE Loop Functions Creating, Updating, Editing Functions Returning More than one Object Arguments to Functions The .FIRST Function 29 29 29 29 30 30 31 31 31 32 Probability and Random Numbers 33 Function Syntax Selecting a Random Sample Graphics in S-PLUS Opening a Graphics Device 3 33 33 34 34 Table of Contents Simple Plots Setting the Plot Shape Creating Multiple Plots per Page Adding Titles Adding Axis Labels Setting Axis Limits Specifying Logarithmic Axes Selecting Plot Types Selecting Line Types Selecting the Plotting Character Adding Straight Lines to an Existing Plot Adding Points/lines to an Existing Plot Adding Text to a Plot Adding Legends Custom Graphics Parameters 34 35 35 35 35 36 36 36 36 36 37 37 37 37 38 Introduction to Statistics in S-PLUS 39 T-Tests Other Hypotheses Tests Summarizing Data in S-PLUS Classical Linear Models Updating Models Options to lm() Categorical Variables as Predictors Interactions Adding or Dropping Terms Summaries of Fits Designed Experiments and ANOVA 4 39 39 39 40 41 41 41 41 42 42 42 S-PLUS Intro 5 S-PLUS Intro Introduction to S-PLUS BACKGROUND S-PLUS is an extension of “S” which was developed by ATT’s Bell Labs in the late 80’s. It is currently produced by StatSci in Seattle, a division of MathSoft. S-PLUS is a programming environment, like MATLAB, but for statistics. S-PLUS is an interpretive language (not compiled). It lets you enter commands one-by-one and executes them as you enter them. S-Plus 2000 for Windows released in the summer of 1999 adds a GUI (graphical user interface) to the S-Plus environment. Now many of the common S-Plus tasks can be accomplished interactively though functions accessed via the toolbar and menus. This point and click environment is not available for UNIX S-Plus or earlier versions of PC S-Plus. The commands line which is used for all interfacing with the other S-Plus versions is still available in S-Plus 2000. S-PLUS CAPABILITIES Data management S-PLUS allows for easy storage, organization, retrieval, and manipulation of data. All data objects are defined as constants, vectors, matrices, etc so that calculations and data manipulation are quite simple. Graphics S-PLUS has many built-in functions for 2- and 3-dimensional plotting, interactive graphics, data visualization, multivariate graphics, survival curves, and custom graphics. S-PLUS does graphics very nicely and it is fairly easy to customize the graphs to your liking. Statistics S-PLUS has a rich set of built-in functions for hypothesis testing, ANOVA, design of experiments multiple regression, time series, survival analysis, quality control, generalized linear models, generalized additive models, local regression, tree-based regression, discriminant anlaysis, cluster analysis, principle component analysis, non-linear least squares, etc. Extensibility S-PLUS is a true programming language so you can write your own functions to automate calculations or analyses that the built-in functions or procedures don’t do. S-PLUS also has the ability to interface to Fortran, C, SPSS and other softare; and it is can directly import data in many formats including: SAS, Excell, SPSS, Quattro Pro, Paradox, Microsoft Access. AVAILABLE DOCUMENTATION There are a large source of references both in print and on line. For a list of available references in print see page 8 of the Users Guide (accessible through the S-Plus Help Menu). S-Plus 2000: Programmer's Guide is available at the Queens Bookstore. It is a recommended text for this course. 6 S-PLUS Intro STARTING AND ENDING AN S-PLUS SESSION Start S-Plus by selecting it from the programs list of the windows start menu. The data and objects that you create will be stored in a specified directory on the D drive. If this directory does not exist at startup then S-Plus will ask to create it. To get into the command line(which is where we will spend most of our time) choose commands window from the pull down Window menu. The commands window may also be started by selecting the appropriate icon from the toolbar. To exit S-PLUS close the outer S-Plus window or on the commands line type, > q() This is a function call, calling the function that quits S-PLUS. In S-PLUS all the commands are functions which take arguments so you must always use parentheses when calling a function, even if it takes no arguments. If you don’t type the parentheses, S-PLUS will define the function for you by listing its code. S-PLUS LANGUAGE AND SYNTAX Expressions You use the S-PLUS commands window by typing expressions after the prompt. Expressions are evaluated when you press the return key. Most expressions will be function calls. To type a series of expressions on the same line, separate them with semicolons. Spaces S-PLUS ignores spaces except in the middle of numbers or names. However, you may want to add spaces for aesthetics and for ease of debugging. Case S-PLUS is case-sensitive, this goes for S-PLUS objects, arguments, names, etc. Line Continuation If you enter an incomplete expression and press return, you will see a ‘+’ (continuation prompt) on the next line, giving you the opportunity to finish the command. For example, > 6* + 2 [1] 12 Canceling an Expression Hit the Esc key to stop S-Plus from evaluating an expression. GETTING HELP There are numerous ways to get help in S-Plus 2000. 7 S-PLUS Intro 1. From command line can get help on a function or item by calling the help function with the function as its argument or prefacing the function with a ? eg. >help(rev) >?rev both provide help for the rev function. 2. args(function name) This will tell you what arguments a function will take. 3. From the pull down Help menu you will find a thorough indexed help for all functions and language elements, as well as a complete set of on line reference manuals, visual demos, tips and direct links to the S-Plus web home page and S-Plus technical support. 8 S-PLUS Objects Objects in S-PLUS INTRO TO OBJECTS S-Plus can be thought of as an object oriented language in that it consists of objects. An object is a very general term that can be used to describe any component in a software system that can be defined by its type or class along with a set of properties(or attributes) and methods(which describe the actions of functionality of an object). There are three basic types of objects in S-Plus 2000. Computational engine objects include data sheets and data frames, functions, lists, matrices, and vectors. Interface objects are objects such as menu items, toolbars, and dialogs. Finally, document objects include graph sheets, reports, and scripts. Only Computational engine objects exist in S-Plus for UNIX and earlier versions of Windows S-Plus. In this course, when we refer to objects we mean Computational engine objects. In S-Plus you will create and manipulate computational engine objects. Each of the objects you create will have certain attributes, depending on its object type. The objects we will use include, vectors, matrices, lists, data frames, arrays, factors, and functions. The simplest object is the vector. To create a vector you can use the combine function c() to combine a set of numbers (e.g. 2,5,7,10), logical values (e.g. T,F,T,T), or character strings (e.g. ‘Jane’,’Henry’,’Bridgett’). The combine function forms a vector out of its arguments: e.g. > c(1,2,3) [1] 1 2 3 ASSIGNING OBJECTS The assignment operator (<-) is used to name and store data. For example, to store the value 20 in the variable x type: > x <- 20 The underscore ( _ ) can also be used for assignment: > x_20 but I suggest not using this method because a) it isn’t as obvious nor as readable; and b) it looks like a SAS variable name. STORING OBJECTS Objects you create are permanently stored in a designated directory called _Data. The location of the _Data directory depends on your installation and can be modified. S-PLUS searches for objects first in the designated data _Data directory. If the object is not found then S-PLUS looks through the directories or objects in its search path. To list the path, type search() To attach other directories to the search path use the attach() function. To remove extra search directories from the search path use the detach() function. 9 S-PLUS Objects For the S-Plus 2000 installed on the first floor computers of Jeffry Hall, the _Data directory resides on the D drive of the machines. Anyone can modify or remove the contents of this directory and it is only accessible from the given machine. If you wish to save your data (including functions) then copy your _Data directory (or selected objects from it) to a floppy disc, or FTP it to your UNIX account. You may also save ASCII (text) versions of objects to files by using the write() or dump() functions. LISTING OR REMOVING OBJECTS The functions objects() or ls() lists the objects in your working directory. > objects() lists all objects > ls() To remove objects from the working directory, use the rm() function: > rm(x,y,z) You can also remove objects form Windows by going into the _Data directory and deleting the objects. The _Data directory has files with the same names as S-PLUS objects but the files are readable only in S-PLUS. OBJECT NAMES Object names can be any combination of upper and lower case letters, numbers, and periods but they must all start with a letter. The following are all legal names: data data.cancer CancerData cancer.data.version.1 S-PLUS NAMESPACE The top value of any object is retrieved by default for objects of the same name. That is, if you give an object the same name as a pre-defined S-PLUS object you won’t rewrite the S-PLUS object but you will prevent S-PLUS from finding the pre-defined object since it will only see yours first in the working directory. Therefore, you should avoid choosing names that are the same as pre-existing S-PLUS functions or you will be prevented from accessing that function. Use the masked function to list objects which mask other objects. Avoid using the following single-character names – C, D, c, q, s, and t – as they are already defined as S-PLUS functions. DISPLAYING OBJECTS To look at the contents of any object, just type its name at the S-PLUS prompt: > names [1] “Matt” “Sam” “Lori” OBJECT ATTRIBUTES Every data object has a set of attributes, including: class = what kind of object class, if any (factor, data frame, list) 10 S-PLUS Objects length = # of components mode = what kind of values (‘logical’, ‘numeric’, ‘complex’, ‘character’) etc. (depending on the data structure) Helpful functions are: mode(<obj>) length(<obj>) attributes(<obj>) data.class(<obj>) 11 returns the mode of an object returns the length of an object returns all other attributes of an object returns the type of object S-PLUS Objects Data Values COMMON VALUES logical TRUE or T and FALSE or F represent binary data. numeric • ordinal decimal numbers (e.g. 27, -6.28, 81.02) • S-PLUS expressions that generate real values (e.g. pi, exp(2), 2/3) • scientific notation (e.g. 3.12e-4 represents 0.000312) • the special value Inf to represent infinity complex Similar to numeric except for complex numbers (e.g. a+bi) character Any character string enclosed in single or double quotes. There are some special characters: \t tab \n new line \" double quotes \' apostrophe \\ backslash SPECIAL VALUES NA Represents 'Not Available'. It is the code for 'missing' for logical, numeric, and complex data. A missing character value is represented by "". NULL Represents a 'non-value'. For example, if no values can be returned from a function call, S-PLUS will return NULL. COERCION Many objects in S-PLUS can only contain one kind of value (e.g. vectors, matrices). If you try to input different kinds of values into one of these objects, S-PLUS will 'coerce' them into a single mode. In doing so, S-PLUS tries to retain as much information as possible. When mixed values are entered into one of these objects, all the values are coerced to the mode of the value with the most information. The values with the most information are characters, followed by complex, numeric, then logical. e.g. > c('names',34,T) [1] "names" "34" "TRUE" > c(F,1,T,0) [1] 0 1 1 0 S-PLUS converts TRUE and FALSE to 1 and 0 respectively when coerced to numeric. NOTES ON LOGICAL VALUES Logical values in S-PLUS are written as T, F, TRUE, or FALSE. These are neither numbers nor character strings, so you don't have to write them in quotes. Comparisons between Objects You can compare objects using the following operators: 12 S-PLUS Objects less than less than or equal to greater than greater than or equal to equal to == not equal to < <= > >= != The result of a comparison is either a logical or a vector of logicals. > x<-1:3 > x>1 [1] F T T > x[x>1] [1] 2 3 Logical Operators Logical operators operate on logical objects: ! not & or && and | or || or where !, &, and | return vectors when possible and && and || are 'control operators' and always return a single logical value. e.g. > (c(3,5,2) > c(1,6,0)) & (1:3)>2 [1] F F T Functions Acting on Logical Vectors any(x) all(x) e.g. returns T if any elements in x are T returns T if all elements in x are T > if( all( x>0 ) ) y <- sqrt(x) NOTES ON MISSING VALUES S-PLUS uses NA to denote a 'missing' or 'not available' value. Like logical values, NA is neither a number nor a character string. Any operation involving NA yields NA. e.g. > x <- 1:3 > x == NA [1] NA NA NA The function is.na(x) works component-wise to yield a logical vector indicating which elements of x are NA. 13 S-PLUS External Files Saving S-PLUS Source Code and Output in External Files S-PLUS SOURCE CODE 1. Creating Source Files You will often want to save your code so that it can be referenced or re-used later. In fact, it is good to save code frequently to avoid losing work. There are many ways to save code in S-Plus 2000. You can create and edit the code in an external editor such as Notepad and save the contents as a file. You can open the history log (choose the accordion like icon from the toolbar) and save its contents to a file. You can use the script editor and save the file. You can copy and past contents of history log or script editor to an external editor such as notepad. 2. Reading Source Files There are several ways to open and execute previously saved code. You can copy and paste code from a text editor into the script editor or commands line. You can open the code file into the script editor. Chose open from the pull-down file menu. You can use the source command >source(‘code.txt’, auto.print=T) – reads and executes a file named code.txt from the S-Plus default work directory. The auto.print=T option is required to echo the commands and output to the commands window. >source(‘a:\\code.txt’, auto.print=T) – runs code.txt from the a: drive. Note: that the double backslash is required because a single backslash is used to identify special characters. When the source() function is used, S-PLUS reads in the file and executes the commands one at a time, outputting the results to the S-PLUS window as usual. When done, the prompt returns. If there are errors in the code, none of the assignments made in the source file are kept Hints build and test source files incrementally, re-editing the source file after finding errors use comments liberally both to clarify or to temporarily omit some code COMMENTING CODE Commenting means making code invisible to S-PLUS so that it doesn't read it in as executable code. Reasons for Commenting It’s useful for documenting what your commands are doing so you have a record of what you’re doing and why and others can follow your code. You should get in the habit of commenting your code. 14 S-PLUS External Files Commenting is also helpful for temporarily making some commands invisible to S-PLUS so that it doesn't read them in. This is useful when debugging code and playing around with it. To comment in S-PLUS, use the number key (#). S-PLUS ignores everything that comes after # on a line. e.g. # the following code defines x y <- 144 x <- sqrt(y) / z #x <- sqrt(y) print(x) # display the value of x [1] 6 S-PLUS OUTPUT By default S-PLUS only writes output to the window that you're running S-PLUS in. Chances are you'll want to save some or all of that output to an external file so you have a permanent copy. You may save output by copying it directly from the commands line our script window to a text editor and saving the file. Or you may use the sink() function: > sink("a:\\output") sends the output from that point on to the external file named "output" on you’re a: (floppy) drive. If you want S-PLUS to output the commands as well as the output, use the command option(echo=T) in S-PLUS before sinking the output. To cancel, type > sink() To append to an existing file, type > sink("a:\\output",append=T) 15 S-PLUS Objects Vectors A set of elements in a specified order PROPERTIES class-less all elements must be of same mode not a special case of a matrix ATTRIBUTES length = # elements mode = kind of values names = value labels CREATING VECTORS You can create vectors in several ways. You may use the Fill Numeric Columns feature of the SPLUS 2000 interface by selecting fill from the Data pull down menu. There are also several functions that allow you to create vectors. The most common of these functions are: Function scan c rep : seq vector logical numeric complex character Description read values (any mode) combine values (any mode) repeat values (any mode) numeric sequences numeric sequences initialize vectors initialize logical vectors initialize numeric vectors initialize complex vectors initialize character vectors Examples scan(), scan("myfile") c(1,3,2,6), c("yes", "no") rep(c(1,2), 3) 1:5, 1:-1 seq(-pi, pi, .5) vector(‘complex’, 5) logical(3) numeric(4) complex(5) character(6) Get help on the above functions for more information. NAMING ELEMENTS The names() function assigns names to elements of a vector. e.g. names(marks) <- c("HW1",”HW2”,”Mid”,”Final”) SUBSETTING Suppose x is a vector of any mode, then: 1. x[4] returns the 4th element of x. 2. If v is a vector of positive integers then x[v] returns the elements of x indicated by the integers in v. e.g. > x <- c(4,34,7,2,8,3) > v <- c(2,5) > x[v] [1] 34 8 3. If v is a vector of negative integers then x[v] returns all the elements of x except those indicated by the integers in v. 16 S-PLUS Objects e.g. > x <- c(4,34,7,2,8,3) > v <- c(-2,-5) > x[v] [1] 4 7 2 3 4. If v is a vector of logicals then x[v] returns the x[i] for which v[i]=T. e.g. > x <- c(4,34,7,2,8,3) > v <- c(T,F,F,F,T,T) > x[v] [1] 4 8 3 5. If v is a vector of character strings and x is a named vector then x[v] returns the elements of x which have the indicated names. e.g. > x <- c(4,34,7,2,8,3) > names(x) <- c('c1','c2','c3','c4','c5','c6') > v <- c('c3','c6') > x[v] c3 c6 7 3 You can also change selected elements of a vector by using any of the above rules: e.g. > x <- c(4,3,2,6,7,8) > x[ c(2,5) ] <- 0 > x [1] 4 0 2 6 0 8 VECTOR ARITHMETIC Arithmetic operations on vectors are performed element-by-element. e.g. > x > y > x [1] <- c(1,2,3) <- c(2,3,4) * y 2 6 12 'Short' vectors are 'recycled'. Whatever is missing is supplied by recycling the vector as often as needed. You may get a warning if the vectors’ lengths are not exact multiples of one another. e.g. > z <-c (8,9) > x + z [1] 9 11 11 Operators + * / ^ addition subtraction multiplication division exponentiation To list the precedence of operators, type > ?Syntax Elementary Functions The following familiar functions also work element-by-element: log(), log10(), exp(), sin(), cos(), tan(), sqrt(), abs(), ceiling(), floor(), trunc(), round(), signif(), etc. 17 S-PLUS Objects Summary Functions Some useful summary functions include: max(x) maximum element of x min(x) minumum element of x range(x) range of elements of x length(x) number of elements of x sum(x) sum of the elements of x prod(x) product of the elements of x mean(x) mean of the elements of x sort(x) elements of x sorted in ascending order rev(x) elements of x in reverse order sort.list(x) returns a vector of integers containing indices of the elements of x in ascending order > x<-c(4,3,8,1,9) > sort.list(x) [1] 4 2 1 3 5 Examples: 1. To compute x 2 i 1 2 use sqrt( sum( x^2 )) 2. To compute the sample variance of x1, x2, ... , xn , i.e. 1 xi x n 1 use sum( ( x-mean(x) ) ^2) / ( length(x)-1 ) 18 2 S-PLUS Objects Matrices Two-dimensional array of elements of the same mode PROPERTIES class-less all elements must be of the same mode ATTRIBUTES dim = dimensions (# rows, # columns) length = number of values mode = kind of values dimnames = row and column names CREATING MATRICES To create a matrix, use the matrix() function. By default values are stored by column and the number of columns=1. e.g. > matrix(1:12, ncol=3) [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 > matrix(1:12, nrow=4) [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 (Produces the same thing) To store values by row, use the optional argument byrow=T. e.g. > matrix(1:12, nrow=4, byrow=T) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 e.g. To create a 2 x 5 matrix of 0's: > matrix(0, nrow=2, ncol=5) e.g. To create a 1 column matrix: > matrix(1:12) SUBSETTING To get specific elements of a matrix, use either 1. matrix[i,j] to specify the ith row and jth column. 2. matrix[i] 19 S-PLUS Objects to specify the ith element, counting down by column. e.g. > A <- matrix(1:12, nrow=4) > A [,1] [,2] [,3] [,4] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 > A[2,1] [1] 2 > A[8] [1] 8 > A[1:3, 2:3] [,2] [,3] [1,] 5 9 [2,] 6 10 [3,] 7 11 Note: an empty subscript equals all subscripts e.g. > A[1:2,] [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 Note: subsetting can sometime lead to a vector e.g. > A[,2] [1] 5 6 7 8 To keep this as a matrix, you must use the following option > A[,2, drop=F] MATRIX ARITHMETIC Arithmetic on matrices is performed element-by-element. Therefore matrices must be conformable. MATRIX MANIPULATION nrow(A) ncol(A) dim(A) t(A) A%*%B cbind(A,B,z) rbind(A,B,z) crossprod(A,B) crossprod(A) diag(matrix) diag(vector) diag(n) solve(A,b) 20 # rows # columns c(nrow(A),ncol(A)) tranpose matrix multiplication binds the listed vectors/matrices by column binds the listed vectors/matrices by row t(A)%*%B crossprod(A,A) main diagonal of indicated matrix diagonal matrix with the vector on the diagonal n x n identity matrix solution x to the equation Ax=b S-PLUS Objects solve(A) matrix inverse ROW AND COLUMN NAMES To assign names to the rows and columns, you need to create a list (we'll learn more about lists later) with two components, the first for the row names and the second for the column names. Each component is a vector of character values of the appropriate length: > dimnames(M) <- list( c('row1','row2','row3'), + c('col1','col2','col3','col4') ) 21 S-PLUS Objects Lists An ordered collection of objects ('components') PROPERTIES Each list component can be any data object. Components can be of different modes and lengths. You can even have lists of lists. The most general and flexible data object in S-PLUS. Most S-PLUS functions return a list. ATTRIBUTES mode/class = "list" length = # of top-level components names = names of each top-level component CREATING LISTS To create a list, use the list() function. > list(names, levels, groups) creates a list with the objects names, levels, and groups > list(index=1:8, bar=rnorm(100), mat=B) creates a list and names its components NAMING COMPONENTS You can name components of a list directly using the above method or you can use the names() function. e.g. > names(mylist) <- c("index","bar",'mat') SUBSETTING To specify list components you can use one of two methods: 1. Use the index number of the component enclosed in double brackets. > data.list[[3]] 2. Specify the name of the list followed by a $ and the name of the component. > data.list$name Once you've specified the component, you can access parts of that component using the usual single bracket method. If you are specifying a component by using its name, it is useful to know that you don't actually need to type in the full name of the component but only enough of the component name for SPLUS to be able to distinguish it from the other components. For example in the list mylist above, you could specify the index component by: > mylist$i 22 S-PLUS Objects ATTACHING AND DETACHING LISTS You can 'attach' a list to the search path so that you need only refer to the list's components by name: > attach(mylist) You can also detach an attached list: > detach("mylist") 23 S-PLUS Objects Data Frames A special list whose components are vectors of the same length PROPERTIES vector components are bound column-wise into a matrix-like structure the preferred object for storing datasets each column acts as a variable ATTRIBUTES class = "data frame" names = column names (required) row.names = row names (optional) CREATING A DATA FRAME To create a data frame, use the data.frame() function. e.g. > x <- 1:5 > myframe <- data.frame(x=x, x2=x*x, x3=x^3) > myframe x x2 x3 1 1 1 1 2 2 4 8 3 3 9 27 4 4 16 64 5 5 25 125 SUBSETTING A data frame can be treated as either a matrix or a list when extracting components. > myframe[,2] > myframe[1,] > myframe$x3 (picks out the 2nd column) (picks out the 1st row) (picks out the 3rd column named x3) ATTACHING AND DETACHING FRAMES You can attach and detach data frames as you would a list because all data frames are essentially special cases of lists. MATRIX MANIPULATION Matrix operators (e.g. %*%, etc) unfortunately do not work on data frames until you coerce the data frame into a matrix. CHANGING A DATA FRAME TO A MATRIX OR VICE VERSA To change a data frame to a matrix, use the data.matrix() function. > mymatrix <- data.matrix(myframe) To change a matrix to a data frame, use the as.data.frame() funtion. > myframe <- as.data.frame(mymatrix) 24 S-PLUS Objects Factors A vector of categorical/discrete data PROPERTIES: the set of allowed categorical values is finite values are called categories or levels ATTRIBUTES class = "factor" (Generic factor) "ordered" (Ordinal categorical data) levels = possible levels of the factor Examples of a generic factor: experimental status = treatment/control gender = male/female Examples of an ordered factor: educational status income class CREATING A FACTOR FROM A VECTOR If gender is a vector of length 100 with a bunch of 'M's and 'F's, then the following creates a gender factor: e.g. > gen.factor <- factor(gender) > gen.factor [1] F M M M F M F F .... attr(,levels): [1] "F" "M" To create an ordered factor, use the ordered() function. Suppose educ is a vector of length 100 with various values of education level: "E", "H", "U", and "P". Then the following creates an ordered education factor: e.g. > educ.ord.fac <- ordered(educ, levels = c('E','H','U','P')) LABELING LEVELS OF A FACTOR You can provide other labels to the levels of a factor, other than the default ones, by using the labels= option: e.g. > gen.factor <- factor(gender, labels = c("Female","Male")) CATEGORIZING CONTINUOUS DATA Use the cut() function to produce categorical data from continuous numerical data. e.g. 25 > income <- 0:40 > income.cat <- cut(income, breaks = c(0,10,30)) > income.cat [1] NA 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 [18] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 NA NA NA S-PLUS Objects [35] NA NA NA NA NA NA NA attr(, "levels"): [1] " 0+ thru 10" "10+ thru 30" Values falling outside the limits receive NA. Or alternatively, > income.cat <- cut(income, breaks = 3) > income.cat [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 [27] 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 attr(, "levels"): [1] " -0.4+ thru 13.2" " 13.2+ thru 26.8" [3] " 26.8+ thru 40.4" SOME USEFUL FUNCTIONS UTILIZING FACTORS split() Converts a vector into a list of components according to the values in another vector (usually a factor). Both vectors must be of the same length. e.g. > income.gender <- split(income, gen.factor) returns a list having two components "F" and "M", each containing incomes. table() Computes frequency tables from factors of equal lengths. e.g. > table(educ.ord.fac, gen.fac) produces a two-way table classifying and counting people into the levels of these two factors. tapply() Partitions a data vector according to the levels of a factor and applies a function to each partition. The vector and factor must be of the same length. e.g. > tapply(income, gen.factor, mean) returns a vector with two elements, the mean income for females and the mean income for males. apply() Partitions a matrix into either rows or columns and applies a function to each partition. e.g. > colmeans <- apply(mymatrix, 2, mean) returns column means. e.g. > rowmeans <- apply(mymatrix, 1, mean) returns row means. 26 S-PLUS Reading Data Reading Data into S-PLUS READING FROM THE KEYBOARD Vectors You can read in data from the keyboard in your S-PLUS session by using the c() function: > x <- c(3,6,5,8,3,9,4,5) but this is not very convenient. A better option, when you have several data points, is to use the scan() function: e.g. > x <- scan() 1: 3 6 5 4: 8 6: 3 9 4 5 9: > x [1] 3 6 5 8 3 9 4 5 By default, scan expects numeric. To specify character values, use: > x <- scan(what = "") If character values have internal spaces then you need to use quotes around each value. For example, "Joe Smith" reads as one value but Joe Smith reads as two. Optionally you can use the sep= argument to indicate a special separator, like tabs or commas. You may edit vectors by using the select data utility from the data menu. However, new data objects created from this utility will be defined as data sheets rather than vectors. Matrices For reading in long matrices, you can call scan from inside the matrix function by using the byrow=T option. > mymatrix <- matrix(scan(), byrow=T, ncol=3) 1: 80 75 3: 60 90 100 6: 50 7: You may use the select data utility to edit, but not create, matrices. e.g. READING FROM ASCII FILES For a large dataset, you'll likely want to read in the data from an ASCII file. To do so, use the file= option in the scan function. Vectors e.g. > my.vector <- scan(file="datafile") > my.char.vector <- scan(file="names", what="") Note: S-PLUS will look in the default datadirectory for the file. If you want to read in a file from a different directory, specify the path name in quotes as well. 27 S-PLUS Reading Data Matrices e.g. > my.matrix <- matrix(scan(file='results'), byrow=T, nrow=12) Data Frames e.g. > read.table("infile") > read.table("infile", header=T) The second example uses the first row in the file to assign column names. Get help for more details when you have a column names inside the data file. 28 S-PLUS Functions Functions in S-PLUS EXPRESSIONS In S-PLUS, everything you type is an expression. Each expression has a value which can be assigned to other S-PLUS objects. e.g. z <- 1:4 is an expression, but so is 1:4. This means you can have multiple assignments: e.g. > y <- z <- 1:4 GROUPS Expressions can be collected by means of a matching pair of braces. Semicolons are needed between expressions on the same line. > {temp <- x; x <- y; y <- temp} or > { temp <- x x <- y y <- temp } Both swap x with y. e.g. The value of a group equals the value of the last expression in the group. THE IF/ELSE CONDITIONAL if (condition) expression1 [ else expression2 ] where the condition is a logical value, the expressions are usually groups, and the else statement is optional. The if statement is itself an expression whose value is either expression1 or expression2. e.g. > x <- 10 > z <- if( x>11 ) x else 0 > z [1] 0 THE FOR LOOP for (name in expression1) expression2 where name is the name of the counter (e.g. i ). The for loop is an expression whose value is the value of the last expression executed. Here 'name' is a variable that counts through the values in expression1. > f <- 1 > for( i in 1:8 ) { f <- f*i } Here f gets assigned to 8 factorial (8!) e.g. Vectorizing 29 S-PLUS Functions e.g. > x <- 1:8 > for (i in + { x[i] <+ x[i] <> x [1] 3 5 7 9 1:8) 2*x[i] x[i]+1 } 11 13 15 17 Note however that > x <- 2*x + 1 does exactly the same thing as the above function. This is called 'vectorizing' and is always preferable to looping. Important: S-PLUS is very slow at doing loops. You should avoid using loops whenever possible. THE WHILE LOOP while (condition) expression The while loop is an expression whose value is expression. > n <- 8 > f<-1 > while(n>0){ f <- f*n; n <- n-1 } This results in f = 8! e.g. Watch out for infinite loops!! FUNCTIONS name <- function(arguments) expression Arguments are objects brought into the function from outside and are separated by commas. All objects used in a function must be either brought in as an argument or defined inside the function. Functions return the value of the expression (i.e. the value of the last line of the expression). To return nothing, use the expression invisible() as the last line in the expression. > fact <- fuction(n){ + f <- 1 + for(i in 1:n) f <- f*i + } This is the factorial function. It returns the value of f. e.g. Assignments within functions are temporary and are lost upon exiting the function. e.g. > fact(4) [1] 24 e.g. > fact() Error in fact: Argument "n" is missing, with no default: fact() . . . Dumped e.g. > fact 30 S-PLUS Functions function(n) { f <- 1 for(i in 1:n) f <- f * i } CREATING, UPDATING, EDITING FUNCTIONS You can create, update or edit a function using the fix() function which opens a text editor. fix(function) If the function doesn't exist, it creates a blank template which you can edit. If the function already exists, it displays the current function for you to make changes. If the edits you make have incorrect syntax, then no changes are saved and S-PLUS gives you a warning. To beginning re-editing use > fix() This returns you to the last function you were 'fixing'. To return to S-PLUS you must first quit the editor. You cannot use both at the same time. RETURNING MORE THAN ONE OBJECT To return more than one object from a function, put them all in a list and return the list. e.g. Then stats <- function(x) { n <- length(x) m <- mean(x) s <- sqrt(var(x)) list(n=n, mean=m, sd=s) } > z <- stats(1:10) > z $n: [1] 10 $mean: [1] 5.5 $sd: [1] 3.02765 ARGUMENTS TO FUNCTIONS Default Values To set a default value for an argument, assign it in the function definition. e.g. fact <- function(n=1) { prod(1:n) } This assigns the default value of 1 to the argument n. Arguments that have default values can be omitted on function calls, however arguments without default values must be supplied. e.g. 31 > fact() [1] 1 S-PLUS Functions In Function Calls Arguments may be specified either by value (e.g. fact(8)) or by name (e.g. fact(n=8)). If supplied by value, arguments must be specified in the same order as in the function definition. Otherwise, arguments must be specified by name. Note, argument names may be abbreviated so long as they are distinguishable by S-PLUS. Variable Number of Arguments The special name ... is used when there can be an arbitrary number of arguments. It can be used inside an argument list and inside the body of the function to stand for those arbitrary arguments. e.g. > first<-function(...){ c(...)[1] } This function returns the first element of all its arguments. THE .FIRST FUNCTION The .First function is used to customize the S-PLUS session. Commands inside the .First function are executed each time S-PLUS starts up. e.g. .First<-function() { options(prompt='Yes master?') } To source the .First function after you've edited it, type > .First() There is also a .Last function which is automatically sourced at the end of each session. 32 S-PLUS Probability Functions Probability and Random Numbers FUNCTION SYNTAX The probability and random number functions all begin with one of the following letters: r p d q random number generator cumulative distribution function density function quantiles Possible Distributions beta binom cauchy chisq exp f gamma geom hyper lnorm logis nbinom pois stab t unif weibull wilcox Examples 1. How do I generate a vector of 5 random numbers with mean = 4 and standard deviation = 4? > rnorm(5,mean=4,sd=4) [1] 8.169303 1.253711 1.783170 1.878477 6.829833 2. What's the probability that a standard normal is 1.96? > 1-pnorm(1.96) [1] 0.0249979 3. What is the result of the standard normal density function evaluated at 0? > dnorm(0) [1] 0.3989423 4. What is the quantile of the standard normal distribution for the probability 0.5? > qnorm(.5) [1] 0 5. What is the quantile of the standard normal distribution for the probability 0.95? > qnorm(.95) [1] 1.644854 6. What is the quantile for the t-distribution on 8 degrees of freedom for the probability 0.975? > qt(.975,8) [1] 2.306004 7. What are the quantiles of the uniform distribution going from 0 to 4 for probabilities of 0.25 and 0.75? > qunif(c(.25,.75),0,4) [1] 1 3 SELECTING A RANDOM SAMPLE To randomly select items from a finite population use the sample() function. Use the S-PLUS help to show you how. 33 S-PLUS Graphics Graphics in S-PLUS OPENING A GRAPHICS DEVICE When a graph is created in S-Plus 2000 a graphics window is opened if one is not already open. Subsequent calls to plotting functions will produce graphics in this window. To turn off the graphics window you may close the window or issue the command: > dev.off() SIMPLE PLOTS Plotting Vectors 1. Index Plots To plot the elements of a numeric vector x, type > plot(x) to plot x[i] against i. 2. Scatter Plots To plot the vector y on the vertical and x on the horizontal, type > plot(x,y) where x and y are both numberic and of the same length. Plotting a Matrix Suppose M is a matrix with two numeric columns. Then > plot(M) plots the 1st column of M on the horizontal and the 2nd column of M on the vertical. Plotting Complex Numbers If z is a vector of complex values, then > plot(z) plots the real part of z on the horizontal and the imaginary part of z on the vertical. Plotting Mathematical Functions To plot a mathematical function, you need two create two vectors: one to hold a range of values over which you want to display the function another to hold the result of the function over that range. e.g. To plot the sine function from 0 to 20 > x <- seq(0, 20, by=.1) > y <- sin(x) > plot(x, y) Or similarly > plot(x, sin(x)) Hint: Vary the number of points to obtain smoother or rougher plots. 34 S-PLUS Graphics SETTING THE PLOT SHAPE The default shape of the plotting box is rectangular. To make the shape of the plotting box square for subsequent plots, type > par(pty='s') To return to the default (rectangular) plotting shape, type > par(pty='') CREATING MULTIPLE PLOTS PER PAGE The default number of plots displayed on the screen is one. To create a display with more than one plot, use the mfrow or mfcol arguments to the par() function. To create a 3x2 matrix of figures filled in by row, type > par(mfrow=c(3,2)) To create a 3x2 matrix of figures filled in by column, type > par(mfcol=c(3,2)) To start a new screen before a multiple plot is finished, just issue another par(mfrow=...) or par(mfcol=...) command. To return to the default one figure per page, type > par(mfrow=c(1,1)) ADDING TITLES You can add either main titles or subtitles to a plot in either of two ways directly in the plot() function using the argument main or sub after the plot() function using the function title() and the argument main or sub e.g. > plot(time, pct.fat, main='Percent Body Fat over Time', + sub='Patient #14') e.g. > plot(age, ps) > title(main='Performance Status by Age', sub='Placebo Group') Note: The main title appears above the plot and the subtitle appears below the plot. ADDING AXIS LABELS S-PLUS will automatically label your axes by the names of the variables you are plotting. This usually doesn't look very nice and you'll probably want to add your own labels for clarity. To do so, use the xlab= and/or ylab= arguments e.g. > plot(wk, no.cig,xlab='Time (in weeks)', + ylab='Number of Cigarettes') You can also use these options inside the title() function. Hint: If you don't want any labels to appear, use xlab='' or ylab=''. 35 S-PLUS Graphics SETTING AXIS LIMITS The limits are automatically set by the S-PLUS plotting functions, but you can override this and choose your own. S-PLUS will round your specified limits to 'sensible' limits. e.g. > plot(time, pct.fat,xlim=c(0,100), ylim=c(0,1)) To maintain the same axis limits over future plots, type > par(xaxs='d', yaxs='d') to freeze the axis limits to those of the last plot. If you only want to control one axis, drop one of the arguments as appropriate. To return to the default, type > par(xaxs='', yaxs='') SPECIFYING LOGARITHMIC AXES To put the log scale on the x-axis, type > plot(time, pct.fat, log='x') To do so for the y-axis, use log='y', and for both axes, use log='xy'. SELECTING PLOT TYPES Inside the plot() or other graphics function you can specify any of the following line types to display your data using the type= option. type='p' points type='l' lines type='b' both lines and points (isolated points) type='o' both lines and points (overstruck points) type='h' high-density plot (vertical line for each data point) type='s' stairstep plot type='n' empty plot (axis and labels only) SELECTING LINE TYPES When your plot involves lines you can also select the type of line you want to display. By default S-PLUS will plot a solid line. To change this, use the lty= option in the plotting function you are using. e.g. > plot(time, rate, lty=2) plots a dotted line. There are eight different line types to choose from (i.e. lty=0,...,lty=8), where lty=0 is 'no line', lty=1 is the default solid line and the others are variations on dotted lines. SELECTING THE PLOTTING CHARACTER The default plotting character is '*' or a dot, depending on the graphics device and plotting function. To select a different character, use the pch= option in the plotting function * by typing pch=n where n is an integer from 0 to 18 (results in squares, circles, ...) * by specifying the symbol in quotation marks (e.g. pch='#') 36 S-PLUS Graphics ADDING STRAIGHT LINES TO AN EXISTING PLOT Sometimes it's useful to be able to display straight lines on your plot. To overlay a line with a given slope and intercept, use > abline(intercept, slope) To add a least-squares regression line you can do the following > plot(x,y) > abline(lm(y~x), lty=2) (The function lm() fits a linear model using the method of least squares). ADDING POINTS/LINES TO AN EXISTING PLOT Often you want to show several lines on a plot or add additional data to a plot. The lines() function adds lines to the current plot. The points() function adds points to the current plot. Both functions work almost exactly like the plot() function. All the optional arguments above (lty= , type= , pch= ) can also be used with these functions. e.g. > + > > plot(time, group1, type='l', lty=2, xlim=c(0,100), ylim=c(1,20), xlab='Days', ylab='Body Fat') lines(time, group2, lty=4) lines(time, group3, lty=5) Note If the data you add to a plot have a greater range than the limits in the existing plot, you will receive a warning message and those points outside the range will not be plotted. To solve this problem, set up appropriate axis limits in the first call to the plotting function. ADDING TEXT TO A PLOT To add text to an existing plot, use the text() function e.g. > text(x=10, y=2,'Placebo Group') uses x- and y-coordinates to place the text. e.g. > text(locator(1),'Experimental Group') allows you to select the location of the text using the mouse. Hint: The default positioning of the text is centered at the point you choose. You can change this using the adj= option. ADDING LEGENDS If you're making graphs with several sets of data and line types or characters, you'll generally want to provide a legend for clarity. The legend() function does this. > plot(year,series1,pch='*') > lines(year,series2,lty=2) > lines(year,series3,type='o',lty=5,pch='+') > legend(locator(1),c('Series 1','Series 2','Series 3'), + pch='* +',lty=c(0,2,5)) Note the deliberate space in the pch= option to indicate there is no plotting character for Series 2. 37 S-PLUS Graphics CUSTOM GRAPHICS PARAMETERS To personalize your plots you have to change the graphics parameters. Layout affects entire page can be changed only using par() changes last until next change or until the end of the session changes affect only the current device High-Level used only in high-level graphics functions, never inside par() changes are only in effect for the function call e.g. plot(x,y,log='xy') General can be changed either inside a high-level graphics function or inside par() if set inside par() changes stay in effect until next change or end of session e.g. plot(x,y,lty=4) par(lty=4) Look up the par() function from the online help to get a complete list of graphical parameters. 38 Statistics in S-PLUS Introduction to Statistics in S-PLUS T-TESTS One-Sample t-Test Question: Given the data in the vector x, how do I test H0: = 44 vs. H1: 44? Answer: t.test(x,mu=44) The default value for is 0. The default confidence level is 95% The default action is a two-sided test Two-Sample t-Test Question: Given data in the vectors x and y, how do I test H0: x = y vs. H1: x < y? Answer: t.test(x,y,alternative='less') The default assumes equal variances. Paired t-Test e.g. t.test(x,y,paired=T) Note: All these tests produce an object of the class 'htest' containing details of the test and its conclusion. OTHER HYPOTHESES TESTS The following functions also return objects of the class 'htest'. var.test binom.test prop.test wilcox.test kruskal.test friedman.test cor.test chisq.test fisher.test mcnemar.test mantelhaen.test SUMMARIZING DATA IN S-PLUS General Idea: Before beginning modeling, you should examine the data first. plot() summary() These two functions are 'generic' functions in the sense that they can recognize the class of object that they're given and react accordingly. summary(myframe) produces a printed summary of all variables produces mean, median, quartiles, extremes for numeric variables produces frequencies and table of contents for factors lists counts of missing values plot(myframe) 39 Statistics in S-PLUS summarizes the distribution of the variables shows a quantile plot for numeric variables shows a graph of counts for each level of factors plot(formula, myframe) produces scatter plots of the variables specified in the formula e.g. plot(z~x+y, myframe) produces scatter plots of z vs x and z vs y. if the left side of the formula is omitted it produces distribution plots e.g. plot(~x+y, myframe) produces two distribution plots pairs(~x+y+z, myframe) produces a 'matrix' of scatter plots scatter.smooth(myframe$z ~ myframe$x) plots a scatter plot and overlays a smooth curve using non-parametric regression. CLASSICAL LINEAR MODELS lm(formula, dataframe) e.g. Suppose you have a dataframe called 'myframe' containing variables w,x,y,z. The formula y~x+z+w is interpreted by lm to mean 'y is modeled as a linear combination of x, z, and w plus an intercept'. Mathematically this means y = a + bx + cz + dw. To fit this model in S-PLUS, type lm(y~x+z+w,myframe) Note: lm() uses least squares to fit models. It creates an 'lm' object which can be used by other functions to analyze and modify the fit. The call fit<-lm(y~x+z+w,myframe) doesn't produce any output. Instead the object fit holds all the essential information. It is a list with a number of components, for example fit$coefficients fit$residuals fit$fitted.values You can extract information from fit using the functions coef(fit) resid(fit) fitted(fit) To print the lm object just type its name. This gives a brief summary of the fit. For a more technical statistical description, use summary(fit) More on the Formula 40 The individual terms on the right-hand side can be numeric vectors (one coefficient), factors (one coefficient for every level), numeric matrices (one coefficient for every column). Logical and character vectors are turned into factors. The response cannot be a factor. The special name "." can be used on the right-hand side to stand for all the variables in the data frame other than the response. Statistics in S-PLUS e.g. lm(y~., myframe) Operators (+ : * ^ / -) have special meaning on the right-hand side. To include terms that use these operators in the usual sense, you have to protect them with the identity function I(). e.g. y~I(x+z) has a single predictor = x+z To omit the intercept, put a term "-1" in the model formula. e.g. y~x+z-1 If you omit the data argument from the call, S-PLUS will search its path for any variables that it needs. It is often useful to attach a dataframe before fitting. You can save a formula as an object. e.g. myform<-formula(y~x+z-1) UPDATING MODELS e.g. 1. fit1 <- lm( y ~ x + z, myframe) 2. Add the predictor w to the model: fit3 <- update( fit1, . ~ . + w) 3. Get rid of the predictor w: fit.old <- update( fit3, . ~ . - w) 4. Add w2 to the model: fit4 <- update( fit1, . ~ . + I(w^2)) 5. Change the response to sqrt(y): fit5 <- update( fit1, sqrt(.) ~ .) OPTIONS TO lm() 1. subset = <index vector indicating rows of data frame> This fits the model to only the indicated subset of the data frame. 2. weights = <vector of non-negative weights, same length as the #rows in the data frame> This performs weighted regression. 3. na.action=na.omit Drops any rows of the data frame for which any of the variables included in the fit have missing values. (S-PLUS can't deal with missings in the predictor or response). CATEGORICAL VARIABLES AS PREDICTORS Suppose I'm fitting the model salary ~ age + gender Then the lm() function fits dummy variables for the categories of gender. INTERACTIONS a) b) c) 41 gender:age gender*age (x+y+z)^2 Indicates the interaction of gender and age. Equivalent to gender + age + gender:age. Symbolizes all terms involving x,y,z of the order 2 or less i.e. x + y + z + x:y + x:z + y:z Statistics in S-PLUS ADDING OR DROPPING TERMS Another way to update the lm object: drop1(fit) Produces statistics obtained by dropping each term out of the model, one at a time. add1(fit,c('v','log(v)') Produces statistics obtained by adding the indicated terms one at a time. SUMMARIES OF FITS plot(fit) summary(fit) qqnorm(resid(fit)) Diagnostics of the fit Printed summary of the fit Quantile-quantile plot of residuals DESIGNED EXPERIMENTS AND ANOVA 1. One-Factor Experiments Given: A data frame with 2 components, a factor holding treatments and a numeric variable holding the responses. The number of rows equals the number of experimental units. plot.design(myframe) Plots mean response for each factor level and overall mean response plot.design(myframe,fun=median) Plots median responses plot.factor(myframe) Shows boxplots of response, one for each factor level aovfit <- aov(response ~ treatment, myframe) Runs an analysis of variance summary(aovfit) fitted(aovfit) resid(aovfit) hist(resid(aovfit)) qqnorm(resid(aovfit)) plot(fitted(aovfit),resid(aovfit)) Displays ANOVA table Returns fitted values Returns residuals Makes histogram of residuals Makes Q-Q plot of residuals Plots residuals against fitted values 2. Two-Factor Experiments Given: A data frame with 3 components, a factor holding treatments, a factor holding block, and a numeric variable holding the responses. The number of rows equals the number of experimental units. plot.design(myframe) Plots mean response for each level of each factor and overall mean response plot.factor(myframe) Shows two sets of boxplots, one for each factor interaction.plot(treatment,block,response) Plots the response against treatment for each level of block on the same graph. 42 Statistics in S-PLUS aovfit <- aov(response ~ treatment + block, myframe) Runs an analysis of variance 43