Chapter 8 – R What is R? Unlike MS Excel, SPSS, and Minitab, yet similar to SAS, R is a commands-driven programming environment to execute statistical analysis. Unlike all of the other software packages we have discussed which are proprietary 1 (including SAS), R is an open-source program that is free and readily available via download from the internet. Of all the packages, we acknowledge that both R and SAS represent substantial challenges for students. However, like SAS, R is among the most analytically comprehensive and most flexible of the statistical software applications. Furthermore, R is becoming quite popular in quantitative analysis in many fields including statistics, social science research (Psychology, Sociology, Education, etc.), marketing research, business intelligence, etc. R is an implementation of the S-Plus programming language that was originally developed by Bell Labs in the 1970s. Therefore, S-Plus and R code are most often interchangeable and instructions for one program will be applicable to the other. Obtaining R Before importing the WidgeOne example data into R and subsequently tackling the core STAT 3010 statistical concepts, we will first discuss how to obtain R. R is available on the citrix server, however, it is also available for free download from the internet 2. Follow these steps to download and install R. And therefore very expensive. We recommend that you download R and install it on your local computer so you will not be negatively impacted by large demand on the server during peak times. 1 2 Step 1. 3 The official R website is called CRAN (The Comprehensive R Archive Network). Therefore, search for CRAN3 in your favorite internet search engine (see Figure 8.1). The URL for CRAN is http://cran.r-project.org/ 2 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.1: Searching for CRAN, the R Website. 3 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Step 2. 4 From the main CRAN page, select "Download R for MacOS X" or "Download R for Windows". Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Step 3. 5 Next, select "base" to download the basic R program. Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Step 4. 6 Next, select "Download R XX for XX" to download the R installation program (where XX is the version number). Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Step 5. 7 Save and then run the R-XX.exe file (where XX is the version number). Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Step 6. Follow the steps of the R setup wizard. R Basics & Orientation After installing and launching R for the first time, the user is presented with the basic R interface (see Figure 8.2). The main component of this interface is the R Console (see Figure 8.2). This is where the user submits commands to the program AND where R prints the results of those commands. However, typing commands directly into the console is often not done because it is easy to make an error and difficult to re-create what you did at a later time. Therefore, one can also write, develop (debug), and submit R code from a separate savable file called a script. If you are a SAS user, an R script is very much like your SAS programming file that you develop in the Enhanced Editor (i.e., the .sas file). One can create a new script or open an existing script by going to the File option in the R main menu (see Figure 8.2). A few important facts about R scripts include: 1) Files with a .R file extension are associated (easily recognizable) by the R program, 2) R script files, regardless of the file extension, are simple text files (so, you could open and view them with any text editing software, however, they will only run in R), 3) when you go to save a R script (it is highly recommended that you SAVE OFTEN, no matter what software package you are using), unlike most software packages, R does NOT automatically save the script with the .R extension. The user actually has to type in the .R extension at the end of the file name in the File name field when saving the script file. We will discuss script files and using them in more detail later on. 8 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.2: The R Interface. The R Console 9 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.3: The R Interface. 10 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University R as a Calculator For some, it may be easy to feel overwhelmed with the R environment at first. However, it is simple. At first, just think of R as a super graphing-calculator, much like your old Texas Instruments TI-83, but more. Notice that one can simply type mathematical expressions into the R console, hit "Enter" and the result is printed in the R console for your viewing (see Figure 8.4). Figure 8.4: R as a Calculator. 11 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Much like any good calculator, there are a large number of mathematical and statistical functions that are available to the user. Table 8.1 presents a few of these. Figure 8.5 shows some of these examples implemented in R. Try them out. Notice that the main argument in the mathematical functions in Table 8.1 is a single real number. However, the statistical functions have multiple real numbers as the main argument. This brings up a very important point in understanding how R operates and/or "thinks", if you will. R is often called an object oriented programming language. This means that all of the data are stored in objects and that all R functions operate on objects. Objects can be single numbers or character strings, a list of numbers or character strings (conceptualized as a vector, but you can simply think of it as a column in a data set), or multiple lists of numbers or character strings (conceptualized as a matrix, very much like a data set in SPSS or SAS and a worksheet in MS Excel). Put these ideas on "hold" for the moment and we will return to them shortly. For now, realize that, like any good graphing calculator, R can be used to make variable assignments. These assignments allow the user to generalize and re-use code (less typing for us!!). Variable assignment is done using an assignment statement with the "<-" (pronounced "gets") operator. The gets operator is nothing special: it literally is the less than sign (<) followed immediately by the hyphen (-). Therefore, the statement: is read/pronounced "a gets 4". 12 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Table 8.1: Some Basic Functions in R. Function/ Operation/ Symbol Description Example Result 3+4 3-4 3*4 3/4 7 -1 12 0.75 2^2 sqrt(4) 4 2 Mathematical + * / x^2 sqrt(x) log(x) log(x,base=10) exp(x) sin(x) cos(x) tan(x) asin(x) round(x) Addition Subtraction Multiplication Division The power function. The square root of x. The natural logarithm of x (default base of e = 2.718281…) The logarithm of x (base of 10) The exponential of x. The sine function of x. The cosine function of x. The tangent function of x. The arc-sine function of x. The rounding function. log(100) 4.60517 log(100,base=10) 2 exp(10) 22026.47 sin(100) -0.50637 cos(100) 0.862319 tan(100) -0.58721 asin(.5) 0.523599 round(4.60517) 5 Statistical mean(x) median(x) sd(x) var(x) min(x) max(x) The mean of x. The median of x. The standard deviation of x. The variance of x. The minimum of x. The maximum of x. mean(c(3,4,5)) median(c(3,4,5)) sd(c(3,4,5)) var(c(3,4,5)) min(c(3,4,5)) max(c(3,4,5)) 4 4 1 1 3 5 Figure 8.5: Basic expressions and functions in R. 14 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, anytime "a" is used in an expression or function call, 4 is substituted for a. So, "a" is an object. In particular, it is called a scalar (a special mathematical name for a single real number). Now let's create a list of numbers (a vector). Once again, this is done with the gets operator, however, now we need to tell R that the variable (or object) is a list of numbers. We do this using, think about... yes!: a function (Everything in R is performed using functions). In this case it is the concatenate function or simply "c" for short. For an example, let's enter the first five values for the years on the job (YRONJOB) variable from the example WidgeOne data set into a new variable simply named "b". Essentially, this statement reads: "b gets the list of values of 11.10, 11.00...". We hit the "Enter" button on our keyboard after typing this. Notice that we do not get any feedback from R. Nothing happens. This is actually a good thing. If we did it wrong, we would get an error. For example, if we forgot the concatenate function (the "c") then we would get something like: Not good. So, the fact that we did not get any feedback earlier when we entered the statement in correctly is ok. The object (or vector or list of values or variable) has been properly saved in R's working memory with the name "b". If we want to actually see it, we must type its name. As a side note, the user can always get a list of all objects currently saved in the work space using the ls() function. 16 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University So, right now, we have two objects saved in the work space: a & b. Once again, they are saved in R's temporary working memory. If we were to close the program, these are erased. We will talk about saving a session permanently later on. As another side note, the user can always click on the R console and press "Ctrl+L" to clear the console (when it gets cluttered). Now, realize since we have defined b as a list of numbers, we can use the statistical functions in Table 8.1 and specify "b" as the main argument. This saves us from having to type all the data again! We can also save these values as variables and then use them in subsequent expressions. Here we save the mean of the vector b as a new variable called simply "m" for short. 17 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Ok, it is not part of the STAT 3010 curriculum, but you most likely remember Z-scores from elementary statistics (one of the prerequisites for 3010, check your transcripts!). For a refresher, remember we subtract the mean from each value of the variable of interest and then divide by its standard deviation to get a Z-score for each value. Here is the formula (that I'm sure you know and love): We are using this example because it is SO EASY to do in R and really showcases R's power and utility. Check this out: It really is that simple. Once again, the first statement reads "a new vector (variable) called z gets the value of b minus the mean of b divided by the standard deviation of b". You would not believe how difficult this is to do in a SAS DATA step...(of course, there is a special SAS procedure for this, however, it is still WAY too complicated to do in a DATA step...). Also, this showcases how R performs operations element-wise. This means that R performs a given operation on each value of a vector separately and produces an entire vector of results whose length (the number of values or elements in a vector) is equal to the length of the input vector (this is true unless specific matrix algebra operations are called (you all do not need to worry about this for purposes of this course)). 18 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now pretend that we had already saved both the mean and standard deviation of b before we wanted to calculate the Z-scores. Then, the statement to calculate the Z-scores is even simpler: Note: R is case-sensitive. That means that objects named "m" and "M" are different. For example: 19 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University The object m was previously defined as the mean of the vector b. However, the object M has not been previously defined, therefore R produces an error message. Working with Scripts As mentioned previously, it is often not easiest to continue to type R code directly into the console for a number of reasons. Therefore, we use scripts. Scripts allow the user to develop, debug, and save code for later use during an R session. To open a brand new script, select New Script from the File drop down menu. A new window within the R session will appear entitled simply "Untitled - R Editor" (see Figure 8.6). We suggest resizing the script window and placing it side by side with the console (see Figure 8.7: the blank script is on the right). Now you can write R code and double-check it before submitting it to the console. To submit code to the console from the script, you have two options: 1) highlight the desired piece of code (most often one does not want to submit a whole script at once) and copy and paste it into the console(we suggest using "Ctrl+C" and "Ctrl+V") or 2) highlight the desired piece of code and press "Ctrl+R". We recommend option #2. Often times, R users will not write brand-new code for a new project, but instead work from existing code that they developed in the past. For example, there is a sample script entitled stat.3010.R that contains all the code necessary to perform a full STAT 3010-style analysis of the WidgeOne data. In order to open an existing script, select Open script... from the File drop down menu, navigate to where the desired script is saved, and either double-click on the file or single click on the file and then select Open (see Figure 8.8). Note: you probably noticed that there are several lines in the stat.3010.R file that begin with the hash mark (#). The hash mark in R signifies the beginning of a comment. A comment in typical computer programming is a note to the human-users that aids in understanding the purpose of code. These comments are not processed by the computer. In R, comments begin with a hash and continue for the rest of that line. 20 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.6: A New Script in R. 21 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.7: A Resized Script in R. 22 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.8: Opening an Existing Script. 23 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Getting Help in R Obtaining help documentation in R is rather simple; however, the usefulness of that documentation is debatable. Because most everything in R is accomplished using functions, the typical R user will have questions about the use of one or more functions. In order to obtain the help page for a given function submit one of the two options below to the R console: help(function-name) ?function-name In the following example, we obtain the R help page for the log function. or When either of these commands is submitted to the R console, the appropriate help page is opened in your primary internet browser (however, you do not have to be currently connected to the internet. R just uses the browser as a document viewing protocol). Our example of the log help page is presented in Figure 8.9. 24 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.9: A Typical R Help Page. 25 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, as hinted at earlier, the utility of these help pages is debatable. It has been our experience that they often are written for people who already know a great deal about R, and therefore are not very useful to the nascent user. Consequently, it is a good idea to have more "help resources" in your toolbox. The most powerful of these is the official R Help list serv. We highly recommend that you use the R-Help list serv. You can either browser the existing discussions for a situation like the one you are encountering (see https://stat.ethz.ch/pipermail/r-help/) or you can email the list serv a specific question/issue that you are dealing with. Most often when you email the list serv, you will obtain top-notch assistance for your specific problem from half a dozen "R professionals" within a short amount of time. For more information about the list serv, go to http://www.r-project.org/mail.html. Warning: be sure to read the posting guide before emailing the list serv (see http://www.r-project.org/posting-guide.html): There are standards for online etiquette. 26 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Saving and Loading a Workspace There are actually some options here; however, we have a lot to cover, so we are only going to present the easiest approach to saving your work in R and returning to it at a later date. To save your work, left-click on the R console in a null space so that it becomes active. Next, from the File drop down menu, select Save Workspace... Now, specify the desired physical location and file name to which you want to save the file and select Save. This is a very nice function: it saves all objects (data) in the current working memory as well as your script and any changes to settings that you have made in the console. 27 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University To load or re-start a previously saved R session, yup, you guessed it: Launch a new session of R, select Load Workspace... from the File drop-down menu, navigate to the appropriate sub-directory (folder), select the desired file, and select Open. Getting Data into R (Importing Data) From what we have discussed so far, it would be possible for us to enter our data into R one column at a time using the concatenate function. However, we have better things to do with our time (Besides, if you graduate with a minor in statistics, you will be over-qualified for simple data entry...). So, what do we do? This is R: We use a function! The base R package has a number of functions that can be used to import data. The most common one is read.table(). However, our data (i.e., the WidgeOne example data set) are saved in MS Excel. Admittedly, importing data from Excel to R is something that R does not do very well. There are some special add-on packages (see xlsx & xlsReadWrite) for this task, however, it is our experience that they are not very reliable (in other words, sometimes they work and sometimes they don't...). However, R is very good at importing nonproprietary file formats (*.txt, *.csv, *.dbf, etc.). Therefore, the most reliable and stable method for importing MS Excel data into R is to open the file in MS Excel, save it as a .csv file (comma-separated file), and then use the proper function in R to import the .csv file. To save a MS Excel file as a .csv file, do the following: 1) Open the file in MS Excel. 2) Select the Office Button. 3) Select "Save As..." (see Figure 8.10). 4) In the resulting dialog box, select CSV (Comma delimited) (*.csv) from the Save as type drop down menu (see Figure 8.11). 28 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.10: The Save As Option in MS Excel. 29 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.11: The CSV File Type in the Save As Type Field. 30 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now we are ready to submit the proper function call to R to import these data. We could use the read.table() function, however, we would need to customize the function call and it is easier to use the read.csv() function which is already customized to import comma delimited data. The call to read.csv() is presented in the STAT 3010 example script entitled stat.3010.R. We highly recommend that you open that file and follow along with this discussion on your own computer from here on out. Notice that in the example script, we have specified the pathway (the physical location of where the CSV file resides) that is specific and unique to each computer setup. You will need to customize this pathway to your situation. The easiest way to do this is to open My Computer, navigate to the sub-directory (i.e., the folder) where you saved the CSV file, then copy the pathway from the address bar in that dialog box (see Figure 8.12) and paste it into the proper location in the R script. Note: You do have to specify both the pathway and the file name in the call to read.csv(), so when pasting the pathway into the R script, do not highlight the file name. That way you only replace the pathway during this process, not the file name (this is desired). Next, and this is VERY IMPORTANT: the backslash character (\) in R is a special character, so after you copy and paste the pathway, you WILL NEED to add a second backslash for the pathway to be correctly specified in R parlance. Therefore, every \ in the pathway needs to become \\. Once you have made the necessary changes to the call to read.csv(), notice what it does: you are giving R instructions to read in data from the WidgeOne.csv file and save it to an R object named widge (remember, R is case-sensitive, so widge is not the same as Widge or WIDGE). When you are ready, highlight the code and press "Ctrl-R" to submit it to the R console (see Figure 8.13). Notice that the command is copied to the console, however, nothing else happens. This is ok. Most often during assignment statements, no feedback from the console is good news. Your next step should be to verify that the data were correctly imported into R. The easiest way to do this is to simply view the data. As mentioned previously, we view objects in R by typing their name and pressing the "Enter" key (see Figure 8.14). So far, everything looks good! 31 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.12: Copying the Pathway to the Raw Data using the Address Bar in My Computer. The address bar with the pathway highlighted 32 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.13: Importing data in to R. 33 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Figure 8.14: Viewing the WidgeOne Data After Importation in the R Console. 34 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Often times, when you are working with very large datasets, it is not useful to print the entire data set at once. R has a very nice function called head() that prints only the first 5 rows of data with the corresponding column names. A few side notes here are important to be aware of from here on out: 1) The read.table() and read.csv() functions return a special kind of R object: the data frame. In other words, the widge data as currently saved in R's working memory is a data frame. A data frame is a special kind of matrix. A matrix can be thought of as a collection of column vectors (or simply columns of data). However, in R, a matrix must consist of all numeric or all character vectors. Statistical data, however, is most often a combination of both numeric and character data. As mentioned a moment ago, a data frame is a special kind of matrix: it is a matrix that may consist of a mixture of numeric and character column vectors: Exactly what we need for most statistical applications. 2) Often times we need to work with only parts of a data frame (or matrix, or vector). There are a number of ways to subset objects in R. a) We may want to perform an operation (using a function!) on just one column of the widge data frame (in others, just one variable in the WidgeOne data). We may do this using a combination of the data frame name and the column name. The two are delimited by the special character $. For example, earlier we obtained the 35 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University mean of the first five observations of the variable years on the job (YRONJOB). Now, let's obtain the mean for all N = 40 observations of that variable: Once again, notice that we delimited the object name from the column name by a $. Try this with any numeric variable in the WidgeOne data. b) We can perform the same operation using explicit subsetting of the parent data frame (the source of the data, in this case the widge data frame). For example, in order to perform the exact same operation using subsetting, we specify the widge data frame name with the square brackets [ ]. R expects two arguments with the square brackets: the rows to be used and the columns to be used. These are delimited within the brackets with a single comma (,). Furthermore, if we leave one (or both) of these blank, R assumes we want to select all rows and/or columns. Let's look at some examples: YRONJOB is the eighth column or variable in the widge data frame (counting from left to right). Therefore, in order to select (in this case print) all N = 40 observations of YRONJOB, we submit the following to the R console: 36 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University If we want the mean of YRONJOB for all N = 40, then: Notice we obtain the same exact result that we obtained when we specified YRONJOB using names in item #1 above. Now, perhaps we want the mean of only the first five observations of YRONJOB. We could use either of the following: This instructs R to obtain the mean of YRONJOB for observations 1 through (:) 5. Notice that there is only one argument within the square brackets (there is no comma separating the rows and columns. In other words, 1:5 is considered as a single row specification by R. Furthermore, because widge$YRONJOB is a column, we do not need to specify a column number like the example above where the object to subset (the widge data frame) had multiple columns). 37 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Alternatively, we could subset the data frame. Here we will need to supply both a row and column argument within the square brackets: Notice that here we have specified the first five observations (1:5) of the 8th column of widge. We obtain the same results. c) Now, we often want to work with variables in R and let's face it, typing the data frame name along with the $ character is a pain. We can make temporary copies of all columns in an object (either a data frame or matrix) to R's working memory. Then, we could refer to them just by the column name. This is easily done using the attach() function. If we attempt to access the YRONJOB variable BEFORE attaching the widge data frame, R essentially tells us that it does not exist: 38 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, let's attach it and attempt to access the data using the exact same call to the column name: Now, we can obtain the mean of YRONJOB with the following AFTER attaching the widge data frame: 39 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 3) We talked about subsetting a moment ago. In a similar vein, you can always obtain the total number of rows and the total number of columns of a data frame or matrix by using the dim() function: The dim function returns an object (i.e., a vector) of length 2: The first element is the total number of rows, the second the total number of columns. Therefore, we now know that the widge data consists of N = 40 employees and 9 characteristics (traits, variables, columns, etc.) for those individuals. The dim() function is appropriate for multi-dimensional arrays (i.e., matrices and data frames). In order to obtain the length of a single column (vector), we use the length() function in like manner: 4) Before moving on, you should be aware that R uses the missing place holder "NA" for missing data. This is much like a period for missing numeric data in SAS or SPSS. Therefore, do not be alarmed if you see "NA" values peppered throughout your data. 40 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 5) We have already discussed how to create a new object using the gets operator. FYI: In order to remove or delete an object from R's working memory, we use the remove() or rm() (either one works!) functions: If we want to remove multiple objects at once, we delimit their names by commas in the reference to them in the remove() function: Free R Documentation 7 Manuals There are a number of free, readily-available manuals for R on the internet. We recommend the following: 1) 2) 3) 41 This manual! R for SAS and SPSS Users by Bob Muenchen at: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf The Quick-R website at: http://www.statmethods.net/ Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 8.1 Using R for: Measurements of Central Tendency We have already seen a demonstration of the mean() function. We can obtain the median using the median() function in similar fashion. We can also obtain the mean or median (or any other summary function4) for multiple variables at once. To do this we simply specify the appropriate columns from the widge data frame using subsetting operations we discussed previously: Generally in statistics, a summary function is any statistical function that "summarizes" a random variable of length N in N-1values. In other words, a summary function summarizes a random variable in usually 1, but at the very least N-1 or fewer values than the length of the random variable. Essentially, it is a dimension reduction. Examples include the mean, median, standard deviation, range, quartiles, etc. The use of the term summary function here should not be confused with the actual summary function in R (The next topic of discussion). 4 42 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University However, if we want the measures of central tendency AND other distributional information for several columns at once, then this approach is inefficient. Alternatively, we can use the summary() function: This is very nice: We get the mean, median, first and third quartiles, and the minimum and maximum for all numeric variables in the data set and a basic frequency count for all character variables. IMPORTANT: WARNING: CAUTION: Notice that R analyzes the Employee ID numbers. Is this an appropriate/meaningful/useful analysis? Obviously the computer does not know any better, however, you, as the analyst, are held to a higher standard. If this is confusing, you need to read Part 2 of the STAT 3010 Supplemental Text for a discussion of identifier variables. Notice that the TRUE quantitative variables in the WidgeOne data reside in columns 5 through 9 in the widge data frame. Therefore, using what we learned about concerning subsetting objects in the last section, we can obtain summary results for ONLY the quantitative variables with the following call to the summary() function: 43 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 44 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University A Stratified Analysis in R Remember we had a detailed discussion of a stratified analysis in SAS in the STAT 3010 Supplemental Text. Stratified analyses can also be obtained in R using the by() function: The by() function has many (4) arguments here. The first argument is the numeric vector to be analyzed. In this example, we are interested in estimating the average number of years on the job (YRONJOB). Therefore, we specify the 8th column of the widge data frame which is the variable YRONJOB. Next, we specify the stratification factor. Here we want a separate analysis for each of two groups, males and females. Therefore, we specify Gender as the stratification factor. We could have also typed widge[,3] because Gender is the third column vector in the widge data frame. Here, Gender works because we previously attached the widge data frame (we would have received an error otherwise!). Next, we specify which summary function is of interest. Here we instruct R to return the mean. Last, the na.rm argument instructs R how to deal with missing values. This argument take two values: TRUE or T will remove any rows with missing values on either the analysis variable or the stratification factor while FALSE or F will not remove rows with missing values (In this case, if missing values do exist, R returns NA (missing) for the value of the function). FALSE is the default. 45 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Consider the following call to the by() function. What is being asked? See the end of this chapter for the answer. Now, in order to obtain frequency tables for categorical variables outside of the summary function, we use the table() function: 46 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 8.2 Using R for: Measurements of Dispersion Much like the mean() and median() functions, we can obtain measures of dispersion in R. The standard deviation and the variance of a variable are obtained with the sd() and var() functions. Just like with the other summary functions, we can obtain the measures of dispersions for multiple variables at once using subsetting operations on the data frame of interest: 47 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Using R to Categorize a Continuous Variable Often it is of interest to categorize or create meaningful groups or "bins" out of a continuous variable. This is often done in applied biomedical and social science research. For example, researchers often take continuous attributes like age, income, etc. and create groups from them. This can easily be accomplished in R using assignment statements with the subsetting operator [ ]. See the example code below. Here is what is happening with these lines of code. First, we are creating a new column in the widge data frame called Jobten. Moreover, the values of Jobten are conditional on the value of YRONJOB. The first statement essentially says, any row (and remember rows in this data frame represent employees) with a value less than 5 for YRONJOB gets the value "New" for Jobten. Next, any row with a value of YRONJOB between 5 and just less than 10 gets the value "Experienced" for Jobten. Finally, any row with a value of greater than or equal to 10 for YRONJOB gets the value of "Mature" for Jobten. (REMEMBER, we can reference YRONJOB directly here because we attached the parent data frame (widge), otherwise we would need to specify widge$YRONJOB). 48 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, we double-check our work by printing the data frame: Notice that the new variable Jobten was added as the 10th column to the widge data frame and the values of Jobten are conditional on the corresponding values of YRONJOB. Look back at the code: We didn't have to type much code in order to do this: R is very efficient at operations like this. 49 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Since we added a new column to widge, we re-attach the data frame so that Jobten is available via column name only and then we obtain a frequency table of the newly created variable in order to summarize the amount of professional experience of these 40 employees. Notice, as we re-attach the data frame, R gives us a warning that it is copying over the old attached versions of the column vectors. 8.3 Using R for: Visualization/Organization of Univariate Data Unlike all of the other software packages discussed (with the possible exception of Minitab), R has excellent graphing capabilities and allows the user to create and customize presentation-quality graphics. 50 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University To replicate the pie chart developed in Chapter 2, execute the following code: Notice now nothing happens in the R console, but another graphics window opens up and the pie chart is printed to the new window. 51 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 52 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University To replicate the bar chart in Chapter 2, execute the following code. Here we add an informative x-axis label using the xlab argument. This argument is can be used in almost every call to an R graphing function. 53 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 54 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, what do you think Tufte would say about this graphic? Is it appropriate? Is it correct? NO!! Why not? The answer is because the variable Jobten is an ordinal variable and this graphic does not reflect the natural order of the categories. Therefore, more revision is necessary in order to get this right. BTW, read Part 5 of the STAT 3010 Supplemental Text for more information on this topic. In order to specify any variable as an ordinal variable in R, we specify it as an ordered factor. A factor is a special variable type that instructs R that a variable is categorical by nature. We specify a variable as an ordered factor using the ordered() function: Notice that the old reference to the variable Jobten is now replaced by: ordered(Jobten,c("New","Experienced","Mature")) This is the beauty of R: you don't even have to create a new variable in order to do this (although you could...) and because functions can be called within other functions (this is called nesting or nested functions) you can do all of this in a few simple lines of code5. For the ordered() function, the first argument is the input variable that you want to be treated as an ordinal variable. The second argument is a character vector (notice the values are enclosed in quotes and delimited by commas) using the concatenate function (c()). This character Calling a function within another function call is often done in more advanced R programming. When one function call resides within another function call these are "called" (HA!) nested functions. 5 55 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University vector communicates the proper order of the ordinal variable values to R. See the resulting figure on the next page. 56 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 57 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University The histogram is generated in R using the hist() function. 58 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 59 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University A simple box plot is generated using the boxplot() function. 60 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 61 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Side by side box plots are also generated using the boxplot() function. However, the structure of the arguments is quite different here. If you want side by side box plots, boxplot() expects that you specify an expression in the form of: "a quantitative variable is modeled as (the tilde (~) in R is read as "is modeled as") the categorical variable (or stratification factor)". So, in the example below, we are obtaining side by side box plots of job satisfaction stratified by job position. JOBSAT~POSITION is read as "job satisfaction is modeled as (or by) job position". Notice now we must include the data= argument in the call to boxplot(). 62 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 63 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now what would Tufte think about this graphic? Does it stand on its own? No! There are abbreviations (for Hourly & Management for the x-axis tick mark labels)! These abbreviations are an unnecessary source of confusion that should be avoided at all costs. Professional presentation quality statistical evidence (usually in the form of tables and graphs) should not be confusing. Instead they should be clear, concise, easily-digestible for the audience, and informative! We can correct this graphic using the following where we explicitly tell R what we want printed as the x-axis tick mark labels. 64 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 65 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 8.4 Using R for: Visualization/Organization of Multivariate Data We can also obtain 2-way contingency tables using the table() function: we simply add another column name as a second argument (and, of course, arguments are delimited by commas). Remember, N-way contingency tables are appropriate for summarizing the joint and marginal distributions of 2 or more categorical variables. Here notice that the first column will be the row variable (Plant) and the second column will be the column variable (Gender) in the resulting contingency table: Likewise, we can obtain total percents6 for the 2-way table above by specifying the table() function as the argument to the prop.table() function. This is an excellent example of nested functions, which we introduced earlier. We still call them percents even though prop.table() returns proportions. REMEMBER: In order to transform a proportion into a percent simply multiple it by 100. 6 66 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University In order to obtain row percents for this table, we add an optional second argument to the prop.table() function (REMEMBER: You can learn more about prop.table() by submitting either: help(prop.table) or ?prop.table to the R console). 67 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Alternatively, you could assign the results of the table() function to a matrix called t1, for example, and then submit the call to prop.table() using t1 as the first argument: 68 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University You can also obtain the column percents in like manner: The stacked bar chart is an excellent visualization of a 2-way contingency table. Like the simple bar chart, the stacked bar chart is also generated in R using the barplot() function. Notice here that the first argument to this call to barplot() is not the raw widge data, but rather the results of the table() function: Another example of nested functions. Notice, also, that a legend is necessary for this graphic to be meaningful and we are supplying information for the legend to be extracted from the row names of the results of the table() function. 69 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 70 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Notice we have abbreviation issues again. Therefore, we do it again, and explicitly tell R what we want printed in the legend using the concatenate function (c()). Realize, however, it is helpful to generate the incorrect graph once so we know for sure the order of the groups in the legend. Then, we refine it and generate a final product appropriate for our audience. Now, we redo it: 71 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 72 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, this graphic is incorrect for the same reason that our first univariate bar chart was incorrect: It suffers from misrepresenting the ordered nature of the ordinal variable Jobten. Just like before, we use the ordered() function nested within barplot() to instruct R how to order the categories: Notice that the old reference to the variable Jobten is now replaced by: ordered(Jobten,c("New","Experienced","Mature")) The resulting graphic is printed on the next page. 73 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 74 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, we are not as draconian about this, but you will notice that the printing of the legend looks a little less than ideal here. We can actually tell R where to print the legend (do this in your assignments and REALLY impress us!). Consider the following code: Here we are telling R to suppress the printing of the legend through the barplot() function and using a separate call to the legend() function where we have more control. Obtain the R help page on legend() for more details on how this works. BTW, we figured out the appropriate x and y coordinates for the placement of the legend here just by trial and error. The final graphic is printed on the next page. 75 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 76 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, we can easily generate a 100% stacked bar chart simply by nesting the table() function within the prop.table() function in the call to barplot() (Yes, there is a lot of nesting going on here. Don't forget a parenthesis!!). So, essentially what we are doing is generating our 100% stacked bar chart from the column percents. The only problem is that prop.table() returns these in the form of proportions, not percents. As a result, the y-axis of our resulting graphic ranges between 0 and 1.0. See the next page. 77 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 78 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, here is how cool R is: You can actually specify a mathematical expression within the call to barplot(). Therefore, all we have to do to correct this is to multiple the column proportions from prop.table() by 100 WITHIN the call to barplot(). Notice we also added a y-axis label using the ylab argument and we forced to change the y coordinate specification (the 2nd argument) in the call to the legend() function. 79 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 80 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University A scatterplot is generated using the plot() function. The first argument is the x-axis variable, the second the yaxis variable. In order to obviate abbreviations from the start, we use the xlab and ylab arguments to provide proper labeling for the audience. 81 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 82 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 8.5 Using R for: Random Number Generation and Simple Random Sampling As we have seen through our previous STAT 3010 studies, there is great utility in the ability to generate random numbers ranging from sampling applications to random assignment of observations and developing computer simulations (ok, simulations are beyond the scope of 3010, but you will encounter these if you continue on your journey in studying statistics). R is extraordinarily effective and efficient as a random number generator. Like the other packages, R uses the computer clock time as the default seed for all random number functions. To generate uniformly distributed random numbers, we use the runif() function: In the example above, we generate 40 random numbers and store them in the vector named Ran and then print them to the console7. The runif() function has one mandatory argument, the number of random numbers to generate. The default is to generate numbers between 0 and 1 (which is nice). Pretend for a moment that we really wanted a set of N = 40 random whole numbers that varied between 0 and 100. We could obtain this by multiplying Ran by 100 and using the round function in order to round the numbers to the nearest whole number (here we named the result R100, but this is completely arbitrary): 7 Obviously, you should not expect to obtain the same exact results as we do here do to the use of the default seed. 83 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Verify that the seed is set to the clock time by re-submitting the same code. You should obtain different values for your N = 40 generate numbers. Next, use the set.seed() function to set the seed so that you can obtain the same exact results at a later date (this is often desirable): 84 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Note: it is necessary to call the set.seed() function with the same initial value (here the value 974) before each new call to the random number generating function. Also, it is important to keep in mind that set.seed() only uses the integer portion of the initial seed value. Therefore, if a fractional value is supplied to the function, set.seed() automatically rounds it to an integer (be mindful!). Now if we desired to create statistically independent groups from the WidgeOne data, we use simple assignment statements much like we did when we categorized a continuous variable. Notice that the first assignment statement in the example above reads "the new variable Group appended to the widge data frame gets a value of1 if the associated random number is less than .5". The second statement is read in similar manner. We then print the results in order to confirm the effectiveness of our code. After performing random group assignment, it is often desirable to sort the data by the new group membership. This is easily done in R using the order() function specified within the square bracket operators: 85 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 86 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Notice that we began by re-attaching the data frame (otherwise we would have to specify widge$Group instead of simply Group when referencing the new group membership). In this example, we not only sorted the data by group membership, but then within groups we sorted by employee ID. Notice that we created a new version of the widge data frame (widge2) that is sorted. The operative statement reads something like "a new data frame named widge2 gets the old version of widge after it is sorted in ascending order (the default) by Group and then by employee ID within Group". Also, it is important to realize that the order() function is called in the area within the square brackets that is associated with rows. Therefore, we are sorting rows, not columns. Packages like MS Excel, SPSS, and SAS only allow for sorting of this nature of rows, however, R is much more flexible in this regard. In order to obtain a simple random sample of the WidgeOne data, we use: Yes! the sample() function! In the example below, we desire to sample the rows of the parent data frame, so the sample() function is specified just like the order() function in the example above (i.e., in the row area within the square brackets): 87 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 88 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Here we create a new data frame named sam1. The first argument of the sample() function is the row numbers of the parent object to sample from. Therefore, we want to sample from 1 through 40 (the nrow() function returns the maximum number of rows of a 2-dimensional R object (like a matrix of data frame)). The second argument is the size of the sample. So in this example, we want a sample of 30 employees from the original data containing N = 40 employees. Finally, we specify not to perform sampling with replacement so that the same employee cannot be chosen twice for inclusion in the sample. 8.6 Using R for: Confidence Intervals Unlike SPSS and SAS, we are not aware of a "canned" (i.e., ready-made) function in R that calculates confidence intervals (CIs)for the user. HOWEVER, this is a great opportunity to showcase how easily this kind of thing can be done with a little bit of user generated code. The following code performs the CI calculation and generates a little report: There is a lot going on here. First, we set the alpha level to .05 which, of course, corresponds with a 95% confidence level. Notice that alpha is not a function or an argument to a function. Here it is a simple userdefined (which means that we made it up...) R object (in this case, it is a scalar). Then, we count the number of 89 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University non-missing values of the vector JOBSAT. FYI: The is.na() function returns a logical vector of the same length of the input vector with a TRUE or FALSE for each element answering the question "is this value/element missing (NA)?": We obtained 40 FALSE's because there are no missing values of the variable JOBSAT. Now, the sum function works here because, just like SAS, R interprets TRUE as 1 and FALSE as 0. So, sum(is.na(JOBSAT)) counts the number of missing values in the JOBSAT vector. Now, we want the number of non-missing values, so we add the ! operator to the expression. The ! operator means NOT in R. 90 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University As a result, we are now counting the number of non-missing values of the input vector. Of course, we use this information to determine the degrees of freedom in the calculation of the margin of error of these CIs. Next, we calculate both the lower confidence limit for the mean (lclm) using a number of R functions (e.g., round(), mean(), sd(), sqrt(), and qt()). Thus far, we have discussed all of these except qt(). Like any good statistical package, R contains a number of functions to obtain values of reference statistical distributions like the normal, t, chi-square, and F-distributions). The qt() function returns the appropriate quantile from Student's t-distribution given a probability value (here, 1-alpha/2) and the correct degrees of freedom (here, n-1). We then do the same for the upper limit of this interval. Next, we calculate the associated sample mean value. Finally, we use the cbind() (short for column bind) function to "paste" or bind the four computed scalars into a little matrix (with only 1 row, sort of like a row vector) for ease of printing and viewing. This is very much like the output one would obtain from SAS, however, we customized it to exactly the information we wanted. REMEMBER: When reporting CIs ALWAYS, ALWAYS, ALWAYS provide the appropriate interpretation of the results. For example, “Based on a representative sample of 40 employees, we are 95% confident that job satisfaction for all employees is between 6.53 and 7.17”. 91 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 8.7 R Lagniappe Part 1: Writing Your Own Functions As we have seen so far, R's utility and power is a result of its efficiency and ease in customizing your own programs and results. Let's take this a step further. We introduced and discussed several functions that are available to the user through the base package. Additionally, there are a number of add-on packages that allow you to use functions that other users have written and developed (see http://www.statmethods.net/interface/packages.html for more information on R packages). Now, we can also write functions of our own...cool. Let's use our code for generating CIs in the previous section. What if we could generalize and package that code so that all the user had to do is type 1 line of code to call all of our source code and compute and print the CIs for any variable they want? It's actually pretty easy to do in R (If you are a SAS user, this would be like writing your own procedure, however, that is not an option in SAS). How do we write our own function? This is R!: We use a function! And, in this case, it is actually called function (ok, we did not mean to be confusing here...). Check this out: 92 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University So, here CI gets or is defined as a function (it's not an object!) with 2 arguments: x (which we assume is a continuous random variable8) and alpha, the significance level associated with the desired confidence level. Then the curly braces are used to instruct R that everything within the braces is the body of the function. Notice we made some slight changes (added a field for the variable name, the confidence level, and the margin of error (me)). Now after we define the function, from now on all we or anyone else with this function loaded into their R session has to do is call the CI function while supplying the appropriate information for the 2 arguments, and the function returns the desired confidence limits and all the information associated with them. We provide 3 instances of calling the function and obtaining the results in the example below. Pretty sweet! Here the term "random variable" is used as it is used in statistical theory: "random variable" or stochastic variable refers to a variable whose value results from a measurement on some type of random process. It should not be confused with random number generation, the topic of the previous section. 8 93 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 94 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Now, let's put this in hyper-drive. Let's add a default value to the alpha argument and another argument, an optional argument, that allows the user to specify the decimal precision of the results (i.e., the number of decimal places used in the results). Here we add alpha=.05. Then .05 becomes the default value of alpha. The user can change it, however, if they don't specify anything, they get 95% CIs (just like SAS!). Also we add the dec=3 specification in the call to the function() function (HA!) and replace the value of the digits argument with dec. Now look at the sample calls to this function on the next page. 95 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Part 2: Outputting Results from R Ok, if one were working on a... homework, for example, one may desire to output the results they receive to a format that can easily be used in a homework document. In that case, we will discuss 2 options for outputting R results for 1) tabular output and 2) graphics. Outputting Tabular Output in R Arguably, this is another major shortcoming of R: There is no function at the present time that allows the user to easily create properly formatted tables from R output9. The best way to create presentation-quality tables from R output is to copy and paste the results from the console into MS EXCEL and then properly format the tabular information in EXCEL (e.g., adding titles, table lines, replacing abbreviations, etc.). Unfortunately, even this approach requires several steps. 9 In other words, there is no analog to SAS's ODS RTF statement in R. 96 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 1) Highlight and copy output from the R console. 97 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 2) Paste R output into MS EXCEL. Unfortunately, these "pastes" are often pasted into a single cell in EXCEL. Therefore, the user will often have to use the Text to Columns function in the Data tab. 98 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 3) Select Finish from the resulting dialog box. 99 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 4) The information is now separated into separate columns. 100 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 4) Next, use basic MS EXCEL functionality to properly format the table. 101 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 5) Finally, copy and paste this formatted table into a word processing document. 102 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Outputting Graphics in R The easiest way (this is R, there are a number of ways to do this) to incorporate a graphic generated in R into a word processing document (we are going to use MS Word for this example) is to instruct R to generate a JPEG (*.jpg) image file of the graphic of interest and then insert that image into MS Word. First, to instruct R to generate the JPEG file, we use the jpeg() function: Notice that we specify only 1 argument in the jpeg function: the physical location (pathway) and filename in quotes. If you copy and paste the pathway from My Computer (as discussed at the very beginning of this chapter), REMEMBER you will need to change the single backslashes (\) to double backslashes (\\) for R to read the pathway correctly. Next, specify the desired call to an R graphing function (we highly recommend that you develop, debug, and confirm that this function call is error-free BEFORE attempting to use it to generate the JPEG file). It is best to specify only 1 graphics call with each call to the jpeg() function. Next, we turn off the jpeg output stream using the dev.off() function. 103 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University One can view the contents of the sub-directory (folder) where the file was saved using My Computer (in Windows) (This is optional): 104 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University The *.jpg file is a simple image file which is accessible via a number of software packages. When we double-click on the file, the file opens using HP MediaSmart Photo (this depends on the particular software that is set as the default image-viewing software on your machine). To insert this image into MS Word, navigate to the location within the Word document where you want the image and select the Insert tab. From the Insert tab, select Picture in the Illustrations group: 107 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University In the resulting Insert Picture dialog box, navigate to the sub-directory where you instructed R to save the JPEG file, select the file, and select the Insert button. 108 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University The image is inserted into the document. 109 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University It is most appropriate to center these usually. With the graphic highlighted (the default after inserting it), navigate to the Home tab. 110 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Select the Center button in the Paragraph group. Note: The lines around the graphic will disappear when you select a null space in the document. Nice = presentation quality. 111 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University 8.8 R Chapter Answers (Actually there is only one...) This call to by() requests a stratified analysis of the standard deviation of the Productivity Scores by Plant while removing rows with missing values. 112 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University Congratulations. You are now a Geek. Take a bow. 113 Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University