Tutorial for Applied Statistics

Tutorial for Applied Statistics Wendy Post Marijtje van Duijn Anne Boomsma Mark Huisman Faculty of Behavioural and Social Sciences University of Groningen September 3, 2013 Contents Preface 5 1 Introduction to R 7 1.1 How to get started with R in the student environment . . . . . . . . . . . . . . . . . 7 1.2 How to execute R commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Datasets, Packages and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 How to create and assign values to variables, and how to perform operations on them 10 1.5 How to generate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 How to create factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.7 How to use script files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Input and output files 23 2.1 How to check the data and retain output . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 How to work with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Descriptive statistics 30 3.1 How to load a data file and to attach variable names . . . . . . . . . . . . . . . . . . 30 3.2 How to summarize categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 How to explore continuous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Location measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Dispersion measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Displaying frequency distributions and outlier detection . . . . . . . . . . . . 33 3.3.4 Exploring bivariate relations 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Simple null hypothesis tests 36 4.1 From research question to null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Descriptive analyses to compare two groups . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Tests for comparing continuous variables in two groups . . . . . . . . . . . . . . . . . 38 4.4 Tests for comparing categorical variables in two groups . . . . . . . . . . . . . . . . . 39 4.5 Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 R Tutorial for Applied Statistics 4.6 4.5.1 Optional: The order of the factors in ANOVA . . . . . . . . . . . . . . . . . . 42 4.5.2 Optional: Multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.3 Optional: Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Checking assumptions in ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.6.1 Normally distributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.6.2 Homogeneity of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Linear models 5.1 47 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.1 Checking assumptions in regression analysis and outlier detection . . . . . . . 49 5.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Some remarks on more advanced linear models . . . . . . . . . . . . . . . . . . . . . 52 References 54 Appendix A 55 Appendix B 56 R Tutorial for Applied Statistics 5 Preface Part of the Research Master Behavioral and Social Sciences is learning to work with R, which software will be used for the statistical analyses in the compulsory course Applied Statistics starting in February. R is a software environment for data manipulation, simulation, calculation and graphical display. R analyzes data very effectively and it has the graphical capabilities for very sophisticated graphs and displays. R can be used interactively and has the option to execute programs from script files. R is made available through the Internet. It can be downloaded and used for free. It is also installed on the central computer system of University of Groningen and can be found under the RuG menu Mathematics & Statistics. In Appendix A it is shown how to download R to a personal computer and how to access it from the central server of the University of Groningen. Using R software is different from working with SPSS, the familiar statistical software package for social scientists. While SPSS nowadays works with fixed, preprogrammed statistical procedures (modules), accessible through pull-down menus, and using syntax files giving some more flexibility, performing statistical analyses in R is based on the S programming language. The use of R implies structuring one’s own statistical analysis with the help of open-source statistical functions. Such functions have been developed by different statisticians all over the world and stored in different statistical libraries, collections of functions for mathematical operations, data manipulation, statistical modeling, graphics, and more. R requires a deeper understanding of statistical procedures on the one hand, while providing more flexibility on the other. As with any new software, it takes time to learn and master it. This learning process is worth the effort, however, because R is the statistical and graphical tool of the future, as recent statistical publications and textbooks clearly show. Appendix B gives a limited overview of relevant introductory books and sites, focused on statistical analysis and probability theory. In this tutorial we will sometimes use Crawley (2005), a statistics book using R. Some of the exercises come from this book, as well as some of the data sets that are used. The data sets are freely available from the accompanying website http://www.bio.ic.ac.uk/research/crawley/statistics/. This tutorial is structured by five practical sessions, treating standard and slightly more advanced statistical techniques and features of R. Outline of the sessions 1. Introduction to R. 2. Input and output files. 3. Descriptive statistics. 4. Simple null hypothesis tests. 5. Linear models. Each session ends with a set of assignments, in which R commands are practiced, based on a data set constructed in the first session, and on data sets coming from Crawley (2005). They were partly prepared by José Piest. The assignments will prepare you for analyzing own data with R during the 6 R Tutorial for Applied Statistics Applied Statistics course. During the weeks of the tutorial, the assignments have to be completed and sent to the teaching assistant, who will grade and return the assignment. More information on the course, the tutorial, the assignments and teaching assistant will be provided during the first lecture and can be found on Nestor. R Tutorial for Applied Statistics 1 7 Introduction to R Outline of the session 1. How to get started with R in the student environment. 2. How to execute R commands. 3. Datasets, packages and libraries 4. How to create and assign values to variables, and how to perform operations on them. 5. How to generate data. 6. How to create factors. 7. How to use script files. By the assignment at the end of this session, one has learned how to create a data set, ready to be analyzed with R. 1.1 How to get started with R in the student environment To start the R program, click on the Start icon of the lower tool bar of the personal computer, go to the RUG Menu, to Mathematics & Statistics, and click on the icon R for Windows. A window similar to the one below (the console window) pops up then. 8 R Tutorial for Applied Statistics First determine where on the network (i.e., in which directory) you are allowed to work and write. Note that students do not have permission to write on network drive Z: where the R program is located. Therefore, create your own working directory, for example X:\R. The working directory can be changed by clicking on File and Change Dir of the upper toolbar of R’s graphical user interface (GUI). This produces a window in which one can modify the working directory with the Browse function. After choosing drive location X:\R and clicking the OK button, the working directory has been changed to X:\R. It is very important to change the working directory any time an R session has been started. Note that a directory should exist before it can be assigned as working directory. So, if necessary, create a working directory before starting R. Later in this session it will be explained how to use the R function setwd() in a script file (similar to a syntax file); the working directory can then ‘automatically’ be set at the start of an R session. 1.2 How to execute R commands In the console window, commands can be entered on the line after the prompt >. After pressing Enter, the program executes the command that has been typed. The program then displays the results (if relevant) and is ready for more input after it returns the prompt > again. If a command is too long to fit on a line or if your command is incomplete, the plus sign + appears. The first objective is to check whether the working directory is now indeed X:\R. To that purpose type the command getwd() after the prompt >, exposed as follows: > getwd() getwd() is a function, and it means get working directory. The R program returns by displaying [1] "X:/R" This result confirms that the working directory has been successfully changed. Rather important and frequently used functions are help commands. The question mark ? followed by a keyword, i.e., any function or command known to R, provides available help information. An alternative help function is the function help(). The general commands are > ?keyword > help(keyword) In case the program indicates that the requested keyword is unknown, it is recommended to use a double question mark ?? followed by a keyword. This function is much easier to apply than the equivalent help.search() function. Both help functions return all R libraries or packages in which a specific keyword is found. The general command is R Tutorial for Applied Statistics 9 > ??keyword Arrow keys of the keyboard can be used to recover and edit previous commands. For example, by pressing the key ↑ a number of times, previously entered commands will reappear on the display. One can leave R by closing the large RGui window or by typing q() in the RConsole window. R will respond with the question Save workspace image? It is best to reply with No, thus preventing potential problems in later R sessions (cf. Braun & Murdoch, 2007, p. 31). 1.3 Datasets, Packages and Libraries R comes with a number of sample datasets that are available for analysis; type data() to see the available datasets. The result of this command depends on the packages that are loaded. To obtain details on a sample data set use the command > help(data set name) The data sets that are used in the book of Crawley (2005) can be found on the accompanying website http://www.bio.ic.ac.uk/research/crawley/statistics/. Packages are collections of R functions, data, and compiled code in a well-defined format. R comes with a standard set of packages, which are stored in a directory called the library (the location of the library can be found by typing .libPaths()). The command > library() opens a new window listing all packages that are installed. Once installed, they have to be loaded into the session to be used. By typing search() one can see which packages are currently loaded and ready to use. Other packages are available for download and installation. A complete list of contributed packages is available from CRAN, containing a large amount of state-of-the-art (statistical) analysis techniques. By clicking on Packages and Install package(s) of the upper toolbar of R’s graphical user interface (GUI) a new package can be installed. The program asks to select a CRAN mirror (e.g., Netherlands (Utrecht)) and the package to be installed. After installation, give the command > library(package name) to load it into the current session. Note that a package has to be installed only once, but for every session in which it is used the package has to be be loaded (using the given command). 10 1.4 R Tutorial for Applied Statistics How to create and assign values to variables, and how to perform operations on them One of the simplest possible tasks in R is to enter an arithmetic expression (i.e., use R as a calculator). The R language includes the usual arithmetic operations: addition (+), subtraction (−), multiplication (*), division (/), powers (ˆ). For example, the following calculations can be performed in R: > 2 + 5 [1] 7 > 6 - 2 [1] 4 > 4^2 - 3 * 2 [1] 10 Note that the usual arithmetic rules are applied: multiplication comes before subtraction. In addition to these arithmetic operators, R includes many other functions, including functions for statistical analysis. Function arguments have to be specified within parentheses after the function name. For example, to calculate the natural logarithm of 200, that is ln(200), give the command: > log(200) [1] 5.298317 Numerous other arithmetic functions are available in R, such as sqrt(x) (the square root of x), abs(x) (the absolute value of x) , pi (the number π = 3.141593), exp(x) (the exponential function ex ), or log10(x) (the logarithm of x with base 10). [In the electronic version of this document commands are displayed in red, output in darkblue.] The function c(), which stands for concatenate (in Dutch ‘aaneenschakelen’), provides a simple way to create variables. This function combines terms together in a numeric vector. A vector is a one-dimensional data structure. Suppose, for example, that estimated intelligence scores of five subjects are available: 104, 140, 125, 89 and 110. These estimated values can be assigned to a vector variable IQ, and it can be checked whether this operation was successful, by entering two commands: > IQ <- c(104, 140, 125, 89, 110) > IQ R returns the following output: [1] 104 140 125 89 110 R Tutorial for Applied Statistics 11 In the first command line, five values are assigned to the object IQ, as a variable is called in R. The assignment operator is an arrow formed by <-. The equals sign = is also allowed, but most users prefer the arrow sign (cf. Braun & Murdoch, 2007, p. 7). The values of object IQ are returned by simply typing its name, as shown in the second command line. Note that even though R represents these numbers in a row, IQ is a column vector. It should be noted that R makes a distinction between uppercase and lowercase letters: R commands are case-sensitive. If iq would be asked for instead of IQ, the message Error: object ‘iq’ not found would be displayed. Exercise 1.1 Manipulating or operating on variable IQ is rather simple. Try the following examples. 1. Add 5 to each value of IQ and assign these values to a new variable IQp5. > IQp5 <- IQ + 5 > IQp5 2. Subtract 10 from each value of IQ and assign them to a new variable IQm10. > IQm10 <- IQ - 10 > IQm10 3. Take the square and assign these values to a new variable IQsq. > IQsq <- IQ * IQ or, equivalently, IQsq <- IQ^2 > IQsq 4. Take the mean of the variable IQ and assign this variable IQmean. > IQmean <- mean(IQ) > IQmean Apply the functions sqrt() (gives the square root), median() (gives the median), and var() (gives the variance) to the variable IQ. To get the number of observations in a data set, that is, the number of rows or observations, the length() function could, for example, be used: > length(IQ) [1] 5 Other useful functions that can be applied to variables like IQ (i.e., column vectors in R) are sum() (sum all scores of the variable), prod() (calculates the product of all values of the variable), 12 R Tutorial for Applied Statistics cumsum() and cumprod() (cumulative sums and products), and sort() (sort the values of the variable). In R, vectors can contain both numbers and characters. This means that vectors (and therefore variables) can be of different types. Some examples are given by the following commands > a <- c(1,2,5.3,6,-2,4) > a # numeric vector [1] 1.0 2.0 5.3 6.0 -2.0 4.0 > b <- c("one","two","three") > b # character vector [1] "one" "two" "three" > c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) > c # logical vector [1] TRUE TRUE TRUE FALSE TRUE FALSE Consider the variable IQ, as constructed before. Each element of IQ (i.e., the individual scores on the variable) can be manipulated separately. For example, in order to change the value of the second element of IQ to 100, give the command > IQ[2] <- 100 > IQ [1] 104 100 125 89 110 To display only the first two elements of IQ enter the command > IQ[c(1,2)] [1] 104 100 This command combines different features. The brackets indicate the elements of the vector. Within these brackets the function c(1,2) defines a vector of length 2 with values 1 and 2. The combination thus produces the first two elements of object IQ. Note that > IQ[1:2] produces the same result. Exercise 1.2 Which command(s) will result in displaying the third and fifth element of IQ? Note that all variables that are entered in R are stored in the workspace. To see which variables are stored in the workspace, use the function ls() to list them. Currently, there are R Tutorial for Applied Statistics 13 > ls() [1] "IQ" "IQp5" "IQm10" "IQsq" "IQmean" "IQsqrt" "IQmedian" "IQvar" To create a vector with a fixed sequence of values the function seq() can be used. Type, for example, > seq(0, 25, 5) [1] 0 5 10 15 20 25 The first element in the arguments of the function seq() is the starting value, the second is the last value, and the third is the step size or increment. Similarly, a vector with elements (1,2,3,4,5) can be constructed and assigned to object x: > x <- seq(1, 5, 1); x # Notice: two commands on one line [1] 1 2 3 4 5 An alternative, more efficient way of assigning values to the vector x, and having the result returned, is to use the command > (x <- 1:5) # Notice: two implicit commands [1] 1 2 3 4 5 One last type of operator treated in this section is the logical operator, using two examples. The double equality sign == indicates whether the elements of an object satisfy a certain condition. For example, if one wants to know which elements of IQ are equal to 100, give the command > IQ == 100 [1] FALSE TRUE FALSE FALSE FALSE The result shows that only the second element of IQ is equal to 100. The function which() can be used to detect which elements of IQ are equal to 100, as follows > which(IQ == 100) [1] 2 The operators > and < can be used to examine inequalities. Exercise 1.3 Give the command to investigate which elements of IQ are larger than 100, and check whether the result is correct. 14 R Tutorial for Applied Statistics Data can be structured in several different ways. Thus far we have used vectors, one-dimensional arrays of numbers, character strings, or logical values (or even combinations). Note that R treats these vectors as column vectors and that (single) numbers are treated as vectors of length 1 (socalled scalars). Other data structures that are available in R are factors, data frames, and matrices. Factors are one-dimensional arrays of classification levels. Data frames and matrices are twodimensional arrays. The former are data tables that consist of a collection of (possibly different types of) vectors, where the rows represent the observations and the columns the variables. Matrices are two-dimensional arrays of elements of the same type (i.e., a matrix of numbers, or a matrix of characters). Examples of all data structures will be given in the reminder of the chapter. Matrices can be constructed with the function matrix() > A <- matrix(1:9, 3, 3) It creates a matrix from the given vector of values, which is assigned to object A here. The vector within the parentheses, 1:9, shows the values of the elements of this matrix,and the last two arguments define the dimensions of the matrix (3 × 3), that is, we have a matrix of 3 rows and 3 columns. By default, the columns of A are filled first, then the rows; more information about the function matrix() can, of course, be invoked by the command ?matrix(). It is good practice to check the correctness of matrix A > A [1,] [2,] [3,] [,1] [,2] [,3] 1 4 7 2 5 8 3 6 9 An alternative way to create a matrix is by using the function dim() to define the dimensions of a vector, as follows > A <- 1:9 > dim(A) <- c(3,3) > A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 By using the argument byrow = T in the function matrix(), the matrix will be filled in a row-wise rather than column-wise > A <- matrix(1:9, 3, 3, byrow = T); A R Tutorial for Applied Statistics [1,] [2,] [3,] 15 [,1] [,2] [,3] 1 2 3 4 5 6 7 8 9 Other useful matrix functions are %*% (matrix multiplication), t() (matrix transpose), det() (determinant of a square matrix), solve() (inverse of a square matrix), and eigen() (eigenvalues and eigenvectors). The function as.matrix() coerces an object into a matrix object. 1.5 How to generate data In R it is easy to generate random numbers from several probability distributions. The basic functions consist of r (for random) and the first letters of the distribution. For example, if 100 numbers from the standard normal distribution need to be generated, the following command can be applied > rnorm(100, 0, 1) The result consists of 100 pseudo-random numbers drawn from the standard normal distribution. They are pseudo-random numbers because the sequence of numbers is deterministic, conditional upon a starting or seed value. The arguments of the function rnorm() are the sample size (the number of draws), the population mean, and the population standard deviation, respectively. Random numbers from a member of the family of binomial distributions can be generated by the function rbinom(n,k,p), where n is the sample size to be drawn from the binomial distribution with parameters k (the number of experiments) and p (the probability of success). Note that for k=1 the binomial distribution equals a Bernoulli distribution with parameter p. Exercise 1.4 Generate 10 draws from a binomial distribution with k=4 and success probability 0.7, and 10 samples from a Bernoulli distribution with success probability 0.5. A random sample from a multinomial distribution can be generated with function rmultinom(n,k,p), which has the same arguments or parameters as rbinom(n,k,p). Because a multinomial distribution has usually more than two categories, like for instance small, medium, large, p is a vector with length equal to the number of categories and values corresponding to the probability that an outcome falls in that category. 16 R Tutorial for Applied Statistics Exercise 1.5 Execute the following command, and interpret the results. > rmultinom(10, 12, c(0.1, 0.6, 0.3)) What are the differences with the generation of binomial random samples? To generate a sample from the standard uniform distribution, that is, drawing real numbers within the range [0,1], the function runif(n,0,1) can be used. This function thus generates n pseudorandom real numbers having values uniformly distributed between 0 and 1. Exercise 1.6 Generate 10 standard uniformly distributed numbers, and evaluate the results. Not all sample generating functions are available from the standard R packages or libraries (i.e., those available after R has been installed). The function that generates random numbers from a multivariate normal distribution, mvrnorm(), for example, is part of the MASS package (Venables & Ripley, 1999). To load this library, and to subsequently obtain documentation on the function mvrnorm(), use the commands > library(MASS) > ?mvrnorm The mvrnorm() function has three arguments: the sample size n, the vector of population means mu, and the population covariance matrix sigma. To generate samples from a bivariate normal distribution, values have to be assigned to the vector mu and to the matrix sigma first. Define the mean vector and the covariance matrix, for example, as follows > mu <- c(10,20) > sigma <- matrix(c(10,-3,-3,15), 2, 2) The function matrix() creates a (2 × 2) matrix from the given set of values, which is assigned to object sigma here. Check the correctness of the matrix sigma : > sigma [1,] [2,] [,1] [,2] 10 -3 -3 15 R Tutorial for Applied Statistics 17 The result shows that sigma[1,1] = 10, sigma[1,2] = sigma[2,1] = -3 and that sigma[2,2] = 15. The diagonal elements of sigma are the population variances of the two random variables: sigma[1,1] = 10 and sigma[2,2] = 15. The off-diagonal elements of sigma are the covariances between the row-column pairs of random variables. For instance, sigma[1,2] is the population covariance between the first and second random variable. The covariance matrix is symmetric, by definition. Recall that the correlation between two variables is defined as their covariance divided by the product of the standard deviations of the two variables. Now that values have been assigned to the arguments or parameters of mvrnorm(), 10 random draws from the specified bivariate normal distribution can be obtained by the command > y <- mvrnorm(10,mu,sigma) The result is a sample of 10 pairs of bivariate normal distributed variables stored in vector y. In the population distribution, the first variable has mean 10 and variance 10, the second variable has mean 20 and variance 15. The population covariance between the two variables equals -3, the population correlation -0.24. Exercise 1.7 Find out what y looks like. Check whether the variances, covariance and correlation resemble the ‘true’ values in this small sample. Use the commands var(y), cov(y), cov(y[,1],y[,2]), and cor(y). Do the same after taking a (much) larger sample from the bivariate normal distribution. How large does the sample need to be before accurate estimates are obtained? 1.6 How to create factors In statistics it is very important to distinguish categorical variables such as gender (male and female) from quantitative or numerical variables such as body length and intelligence. In this section some attention will be paid to working with categorical variables. Take, for example, the variable education with three categories: 0="low", 1="middle", and 2="high". If the education scores of five persons are to be stored and assigned to object educ, this can be accomplished by using the numerical values or by using the categories or levels of educ. Take the following example > educ <- c(1,2,2,0,1); educ [1] 1 2 2 0 1 Here, the entries of the vector educ have a numerical code or value. R treats the object educ as a numerical variable; the mean of educ can be calculated. (Simply type > mean(educ).) 18 R Tutorial for Applied Statistics If, for some reason, a researcher should want the object educ to be treated as a categorical variable or factor with, say, three ordinal categories or levels, this has to be specified accordingly. The function factor() can be used to encode the variable educ as a factor as follows, defining names or labels for the numerical outcomes > educf <- factor(educ,labels=c("low","middle","high")); educf [1] middle high high low middle Levels: low middle high The continuous variable educ has been transformed to a categorical variable educf; the mean of educf cannot be calculated. (Try typing > mean(educf).) The names of the labels are specified by assigning the names of the categories to the variable labels. For some statistical functions and models, the specification of factors is essential for a proper statistical analysis, as will be seen in the next sessions. 1.7 How to use script files Not only to work efficiently with R on the assignments of the Applied statistics course, but as a general principle, it is strongly advised to use so-called script files. An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R. Script files contain a list of executable commands, comparable to syntax files in SPSS, which can be edited and saved for future applications. For the use of a script file, after having entered the R system, go to File in the upper toolbar and click on New script. A fresh window, Untitled - R editor, then pops up (see the figure on the next page). This editor window can be used to create a script file. Enter the commands into this new window instead of in the console window. The commands in scripts should not be preceded by the R prompt >. To check whether the commands are correct, it is recommended to execute the commands sequentially. This can be done by marking the first command or a set of commands, clicking on the right mouse button and clicking Run line or by selecting Ctrl+R. Then the first command is pasted to the R console and executed. Do this for all commands, and when each of them is executed correctly, the script file can be saved, using a file name with extension .R. In a script file, on each line text after # is ignored. Thus, comments documenting the commands can be easily added, either on the same line or by using lines starting with #. R Tutorial for Applied Statistics 19 Assignment 1. How to create a data set Before starting with the assignment, it is important to have studied the text of this session and to have tried the rather simple exercises. All elements practiced above can be used in this assignment, including the help function. The objective of the first assignment is to create a data set by generating values of a number of variables, and to store them in a file called mydata. Data have to be generated for 150 fourth grade (primary school) children with a unique identification number (ID) in a primary school. These children are taught arithmetic by one of two instruction methods, either method A or method B (the numerical and factor versions method and methodf, respectively). The arithmetic skills of the children are measured before the instructions start, the baseline measurement, and after one year at the end of the fourth grade (arith0 and arith1). The intelligence score at baseline is also measured (IQ). Relevant background variables are gender (sex and sexf) and the highest education (in three categories, low, middle or high) of either parent (educ and educf). The table on the next page summarizes the variables of the data file to be generated. The properties of the variables necessary for constructing the data set are given below. Note the research question under study: in general, researchers want to know whether the two instruction methods have a differential effect on the arithmetic scores of pupils in the fourth grade of elementary school. 20 R Tutorial for Applied Statistics Detailed instructions 1. It is recommended to set the random seed to a fixed number, the seed, before the data are generated. By taking such a fixed seed you can replicate the data generation process exactly, that is, if you restart with the same seed and execute the same consecutive commands, you will get exactly the same results. For convenience you might take the last three numbers of your student identification number, for example 631; fixing the seed is then accomplished by the R command set.seed(631). Variable ID arith0 arith1 IQ method methodf educ educf sex sexf Description identification number arithmetic test score at baseline or pretest level arithmetic test score at posttest level intelligence score arithmetic instruction method (numerical version) arithmetic instruction method (factor version) educational level of parents (numerical version) educational level of parents (factor version) gender (numerical version) gender (factor version) 2. Create a variable to identify children in the sample data by making a vector of increasing numbers up to 150, the sample size. Give this variable the name ID. 3. Each child has to be randomly assigned to one of the two instruction methods. To that purpose, generate a variable method, indicating whether the child was assigned to method A or method B of arithmetic instruction. Use pseudo-random sampling from a binomial distribution with success probability 0.5. Make two versions of this variable, a numerical version with variable method having values 0 and 1, and a qualitative version with factor variable methodf having factor levels A and B. 4. Do the same for the gender of the child, except that the binomial success probability for boys equals 0.4. Use the variable names sex and sexf for the numerical and factor version, respectively. 5. The highest education of the parents can be generated by sampling from a multinomial distribution with three categories with success probabilities 0.3, 0.5 and 0.2, respectively. The results of this generating process are stored in a matrix. Take a look at this matrix before you proceed. Note that R provides numeric row and column names (e.g., [3,] is the third row, [,7] is the seventh column). To summarize these in a column vector having the values 0 (low education), 1 (middle) or 2 (high) according to the multinomial probabilities, you have to do some matrix manipulations. To that purpose, you may use the following features: (i) to apply an arithmetic operator to R Tutorial for Applied Statistics 21 each element of row i of a matrix, you can use matrix[i,], and (ii) to apply arithmetic to each element of column j you can use matrix[,j] and the function t() to transpose the matrix. Hint. After having inspected the data, conceptualize for the first few rows which number should be assigned to the corresponding elements in the column vector. Then think of how this number can be obtained by summing the row elements, with appropriate weights or multiplication factors. Next apply this procedure to the (columns of the) matrix. Make a numerical version of this variable labeled as educ, and a factor version identified as educf. 6. Take a random sample from a bivariate normal distribution. The mean and variance of the first variable intelligence, abbreviated as IQ, are 100 and 64, respectively. For the second variable, the baseline or pretest arithmetic score arith0, the mean and the variance are 130 and 100, respectively. Let the covariance between the two variables be 40, which implies a moderate correlation of 0.50 between IQ and arith0. First assign the bivariate normal random variables to a matrix, say X. Next, assign the two columns of that matrix X to the variables IQ and arith0, respectively. 7. Generate three normally distributed variables to represent measurement errors, each having a population mean 0 and population standard deviation 3. Label these independent random variables as e1, e2 and e3. 8. Finally, compute the variable arithmetic score after one year of arithmetic instruction, arith1, using the linear equation arith1 = 0.8 × method + 0.3 × (arith0 + e1) + 1.5 × (IQ + e2) + 0.6 × sex + e3 . 9. Store all the variables generated in this exercise, except for the measurement error variables e1, e2 and e3, in a data frame labeled as mydata using the command mydata <- data.frame(ID, arith0, arith1, IQ, method, methodf, educ, educf, sex, sexf). 10. To save the data frame mydata in R-format to file, use the command > save(mydata, file="mydata.Rdata") By this command an external representation of R objects, as stored in file mydata.Rdata, will be written to the working directory. The extension .Rdata is comparable to .sav in SPSS. For practical reasons it is important to give this file the same name as the data frame. The data can be read back as an R object from this file, using the command load("mydata.Rdata"). Note • Do not forget to save the script file before leaving the R editor, since the script files will be used again in next sessions of Applied Statistics. Do not paste the wrong commands to the script because that would create problems in the next session! Send the script file document to the lecturers after completion of the assignment. 22 R Tutorial for Applied Statistics • For any statistical analysis of the generated sample data in subsequent sessions, the personal data frame has first to be loaded into R by using the command load("mydata.Rdata").

Tutorial for Applied Statistics

Related documents

Products

Support

Tutorial for Applied Statistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib