Tutorial for Applied Statistics

advertisement
Tutorial for Applied Statistics
Wendy Post
Marijtje van Duijn
Anne Boomsma
Mark Huisman
Faculty of Behavioural and Social Sciences
University of Groningen
September 3, 2013
Contents
Preface
5
1 Introduction to R
7
1.1
How to get started with R in the student environment . . . . . . . . . . . . . . . . .
7
1.2
How to execute R commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.3
Datasets, Packages and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4
How to create and assign values to variables, and how to perform operations on them 10
1.5
How to generate data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.6
How to create factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.7
How to use script files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2 Input and output files
23
2.1
How to check the data and retain output . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2
How to work with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3 Descriptive statistics
30
3.1
How to load a data file and to attach variable names . . . . . . . . . . . . . . . . . .
30
3.2
How to summarize categorical variables . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3
How to explore continuous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.1
Location measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.2
Dispersion measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.3
Displaying frequency distributions and outlier detection . . . . . . . . . . . .
33
3.3.4
Exploring bivariate relations
33
. . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Simple null hypothesis tests
36
4.1
From research question to null hypothesis . . . . . . . . . . . . . . . . . . . . . . . .
36
4.2
Descriptive analyses to compare two groups . . . . . . . . . . . . . . . . . . . . . . .
37
4.3
Tests for comparing continuous variables in two groups . . . . . . . . . . . . . . . . .
38
4.4
Tests for comparing categorical variables in two groups . . . . . . . . . . . . . . . . .
39
4.5
Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4
R Tutorial for Applied Statistics
4.6
4.5.1
Optional: The order of the factors in ANOVA . . . . . . . . . . . . . . . . . .
42
4.5.2
Optional: Multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.5.3
Optional: Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Checking assumptions in ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.6.1
Normally distributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.6.2
Homogeneity of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5 Linear models
5.1
47
The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.1.1
Checking assumptions in regression analysis and outlier detection . . . . . . .
49
5.2
Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.3
Some remarks on more advanced linear models . . . . . . . . . . . . . . . . . . . . .
52
References
54
Appendix A
55
Appendix B
56
R Tutorial for Applied Statistics
5
Preface
Part of the Research Master Behavioral and Social Sciences is learning to work with R, which software will be used for the statistical analyses in the compulsory course Applied Statistics starting in
February. R is a software environment for data manipulation, simulation, calculation and graphical
display. R analyzes data very effectively and it has the graphical capabilities for very sophisticated
graphs and displays. R can be used interactively and has the option to execute programs from script
files. R is made available through the Internet. It can be downloaded and used for free. It is also
installed on the central computer system of University of Groningen and can be found under the
RuG menu Mathematics & Statistics. In Appendix A it is shown how to download R to a personal
computer and how to access it from the central server of the University of Groningen.
Using R software is different from working with SPSS, the familiar statistical software package for
social scientists. While SPSS nowadays works with fixed, preprogrammed statistical procedures
(modules), accessible through pull-down menus, and using syntax files giving some more flexibility,
performing statistical analyses in R is based on the S programming language. The use of R implies
structuring one’s own statistical analysis with the help of open-source statistical functions. Such
functions have been developed by different statisticians all over the world and stored in different statistical libraries, collections of functions for mathematical operations, data manipulation, statistical
modeling, graphics, and more.
R requires a deeper understanding of statistical procedures on the one hand, while providing more
flexibility on the other. As with any new software, it takes time to learn and master it. This learning
process is worth the effort, however, because R is the statistical and graphical tool of the future, as
recent statistical publications and textbooks clearly show. Appendix B gives a limited overview of
relevant introductory books and sites, focused on statistical analysis and probability theory. In this
tutorial we will sometimes use Crawley (2005), a statistics book using R. Some of the exercises come
from this book, as well as some of the data sets that are used. The data sets are freely available
from the accompanying website http://www.bio.ic.ac.uk/research/crawley/statistics/.
This tutorial is structured by five practical sessions, treating standard and slightly more advanced
statistical techniques and features of R.
Outline of the sessions
1. Introduction to R.
2. Input and output files.
3. Descriptive statistics.
4. Simple null hypothesis tests.
5. Linear models.
Each session ends with a set of assignments, in which R commands are practiced, based on a data
set constructed in the first session, and on data sets coming from Crawley (2005). They were partly
prepared by JoseĢ Piest. The assignments will prepare you for analyzing own data with R during the
6
R Tutorial for Applied Statistics
Applied Statistics course. During the weeks of the tutorial, the assignments have to be completed
and sent to the teaching assistant, who will grade and return the assignment. More information
on the course, the tutorial, the assignments and teaching assistant will be provided during the first
lecture and can be found on Nestor.
R Tutorial for Applied Statistics
1
7
Introduction to R
Outline of the session
1. How to get started with R in the student environment.
2. How to execute R commands.
3. Datasets, packages and libraries
4. How to create and assign values to variables, and how to perform operations on them.
5. How to generate data.
6. How to create factors.
7. How to use script files.
By the assignment at the end of this session, one has learned how to create a data set, ready to be
analyzed with R.
1.1
How to get started with R in the student environment
To start the R program, click on the Start icon of the lower tool bar of the personal computer, go
to the RUG Menu, to Mathematics & Statistics, and click on the icon R for Windows. A window
similar to the one below (the console window) pops up then.
8
R Tutorial for Applied Statistics
First determine where on the network (i.e., in which directory) you are allowed to work and write.
Note that students do not have permission to write on network drive Z: where the R program is
located. Therefore, create your own working directory, for example X:\R.
The working directory can be changed by clicking on File and Change Dir of the upper toolbar of
R’s graphical user interface (GUI). This produces a window in which one can modify the working
directory with the Browse function.
After choosing drive location X:\R and clicking the OK button, the working directory has been
changed to X:\R. It is very important to change the working directory any time an R session has
been started. Note that a directory should exist before it can be assigned as working directory. So,
if necessary, create a working directory before starting R.
Later in this session it will be explained how to use the R function setwd() in a script file (similar
to a syntax file); the working directory can then ‘automatically’ be set at the start of an R session.
1.2
How to execute R commands
In the console window, commands can be entered on the line after the prompt >. After pressing
Enter, the program executes the command that has been typed. The program then displays the
results (if relevant) and is ready for more input after it returns the prompt > again. If a command
is too long to fit on a line or if your command is incomplete, the plus sign + appears.
The first objective is to check whether the working directory is now indeed X:\R. To that purpose
type the command getwd() after the prompt >, exposed as follows:
> getwd()
getwd() is a function, and it means get working directory. The R program returns by displaying
[1] "X:/R"
This result confirms that the working directory has been successfully changed.
Rather important and frequently used functions are help commands. The question mark ? followed
by a keyword, i.e., any function or command known to R, provides available help information. An
alternative help function is the function help(). The general commands are
> ?keyword
> help(keyword)
In case the program indicates that the requested keyword is unknown, it is recommended to use a
double question mark ?? followed by a keyword. This function is much easier to apply than the
equivalent help.search() function. Both help functions return all R libraries or packages in which
a specific keyword is found. The general command is
R Tutorial for Applied Statistics
9
> ??keyword
Arrow keys of the keyboard can be used to recover and edit previous commands. For example, by
pressing the key ↑ a number of times, previously entered commands will reappear on the display.
One can leave R by closing the large RGui window or by typing q() in the RConsole window. R
will respond with the question Save workspace image? It is best to reply with No, thus preventing
potential problems in later R sessions (cf. Braun & Murdoch, 2007, p. 31).
1.3
Datasets, Packages and Libraries
R comes with a number of sample datasets that are available for analysis; type data() to see the
available datasets. The result of this command depends on the packages that are loaded. To obtain
details on a sample data set use the command
> help(data set name)
The data sets that are used in the book of Crawley (2005) can be found on the accompanying
website http://www.bio.ic.ac.uk/research/crawley/statistics/.
Packages are collections of R functions, data, and compiled code in a well-defined format. R comes
with a standard set of packages, which are stored in a directory called the library (the location of
the library can be found by typing .libPaths()). The command
> library()
opens a new window listing all packages that are installed. Once installed, they have to be loaded
into the session to be used. By typing search() one can see which packages are currently loaded
and ready to use.
Other packages are available for download and installation. A complete list of contributed packages
is available from CRAN, containing a large amount of state-of-the-art (statistical) analysis techniques. By clicking on Packages and Install package(s) of the upper toolbar of R’s graphical user
interface (GUI) a new package can be installed. The program asks to select a CRAN mirror (e.g.,
Netherlands (Utrecht)) and the package to be installed. After installation, give the command
> library(package name)
to load it into the current session. Note that a package has to be installed only once, but for every
session in which it is used the package has to be be loaded (using the given command).
10
1.4
R Tutorial for Applied Statistics
How to create and assign values to variables, and how to perform operations
on them
One of the simplest possible tasks in R is to enter an arithmetic expression (i.e., use R as a calculator). The R language includes the usual arithmetic operations: addition (+), subtraction (−),
multiplication (*), division (/), powers (ˆ). For example, the following calculations can be performed
in R:
> 2 + 5
[1] 7
> 6 - 2
[1] 4
> 4^2 - 3 * 2
[1] 10
Note that the usual arithmetic rules are applied: multiplication comes before subtraction.
In addition to these arithmetic operators, R includes many other functions, including functions for
statistical analysis. Function arguments have to be specified within parentheses after the function
name. For example, to calculate the natural logarithm of 200, that is ln(200), give the command:
> log(200)
[1] 5.298317
Numerous other arithmetic functions are available in R, such as sqrt(x) (the square root of x),
abs(x) (the absolute value of x) , pi (the number π = 3.141593), exp(x) (the exponential function
ex ), or log10(x) (the logarithm of x with base 10). [In the electronic version of this document
commands are displayed in red, output in darkblue.]
The function c(), which stands for concatenate (in Dutch ‘aaneenschakelen’), provides a simple
way to create variables. This function combines terms together in a numeric vector. A vector is
a one-dimensional data structure. Suppose, for example, that estimated intelligence scores of five
subjects are available: 104, 140, 125, 89 and 110. These estimated values can be assigned to a
vector variable IQ, and it can be checked whether this operation was successful, by entering two
commands:
> IQ <- c(104, 140, 125, 89, 110)
> IQ
R returns the following output:
[1] 104 140 125 89 110
R Tutorial for Applied Statistics
11
In the first command line, five values are assigned to the object IQ, as a variable is called in R. The
assignment operator is an arrow formed by <-. The equals sign = is also allowed, but most users
prefer the arrow sign (cf. Braun & Murdoch, 2007, p. 7). The values of object IQ are returned by
simply typing its name, as shown in the second command line. Note that even though R represents
these numbers in a row, IQ is a column vector.
It should be noted that R makes a distinction between uppercase and lowercase letters: R commands
are case-sensitive. If iq would be asked for instead of IQ, the message Error: object ‘iq’ not
found would be displayed.
Exercise 1.1
Manipulating or operating on variable IQ is rather simple. Try the following examples.
1. Add 5 to each value of IQ and assign these values to a new variable IQp5.
> IQp5 <- IQ + 5
> IQp5
2. Subtract 10 from each value of IQ and assign them to a new variable IQm10.
> IQm10 <- IQ - 10
> IQm10
3. Take the square and assign these values to a new variable IQsq.
> IQsq <- IQ * IQ or, equivalently, IQsq <- IQ^2
> IQsq
4. Take the mean of the variable IQ and assign this variable IQmean.
> IQmean <- mean(IQ)
> IQmean
Apply the functions sqrt() (gives the square root), median() (gives the median),
and var() (gives the variance) to the variable IQ.
To get the number of observations in a data set, that is, the number of rows or observations, the
length() function could, for example, be used:
> length(IQ)
[1] 5
Other useful functions that can be applied to variables like IQ (i.e., column vectors in R) are
sum() (sum all scores of the variable), prod() (calculates the product of all values of the variable),
12
R Tutorial for Applied Statistics
cumsum() and cumprod() (cumulative sums and products), and sort() (sort the values of the
variable).
In R, vectors can contain both numbers and characters. This means that vectors (and therefore
variables) can be of different types. Some examples are given by the following commands
> a <- c(1,2,5.3,6,-2,4)
> a
# numeric vector
[1] 1.0 2.0 5.3 6.0 -2.0 4.0
> b <- c("one","two","three")
> b
# character vector
[1] "one" "two" "three"
> c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
> c
# logical vector
[1] TRUE TRUE TRUE FALSE TRUE FALSE
Consider the variable IQ, as constructed before. Each element of IQ (i.e., the individual scores
on the variable) can be manipulated separately. For example, in order to change the value of the
second element of IQ to 100, give the command
> IQ[2] <- 100
> IQ
[1] 104 100 125 89 110
To display only the first two elements of IQ enter the command
> IQ[c(1,2)]
[1] 104 100
This command combines different features. The brackets indicate the elements of the vector. Within
these brackets the function c(1,2) defines a vector of length 2 with values 1 and 2. The combination
thus produces the first two elements of object IQ. Note that > IQ[1:2] produces the same result.
Exercise 1.2
Which command(s) will result in displaying the third and fifth element of IQ?
Note that all variables that are entered in R are stored in the workspace. To see which variables
are stored in the workspace, use the function ls() to list them. Currently, there are
R Tutorial for Applied Statistics
13
> ls()
[1] "IQ" "IQp5" "IQm10" "IQsq" "IQmean" "IQsqrt" "IQmedian" "IQvar"
To create a vector with a fixed sequence of values the function seq() can be used. Type, for
example,
> seq(0, 25, 5)
[1] 0 5 10 15 20 25
The first element in the arguments of the function seq() is the starting value, the second is the last
value, and the third is the step size or increment.
Similarly, a vector with elements (1,2,3,4,5) can be constructed and assigned to object x:
> x <- seq(1, 5, 1); x
# Notice: two commands on one line
[1] 1 2 3 4 5
An alternative, more efficient way of assigning values to the vector x, and having the result returned,
is to use the command
> (x <- 1:5)
# Notice: two implicit commands
[1] 1 2 3 4 5
One last type of operator treated in this section is the logical operator, using two examples. The
double equality sign == indicates whether the elements of an object satisfy a certain condition. For
example, if one wants to know which elements of IQ are equal to 100, give the command
> IQ == 100
[1] FALSE TRUE FALSE FALSE FALSE
The result shows that only the second element of IQ is equal to 100.
The function which() can be used to detect which elements of IQ are equal to 100, as follows
> which(IQ == 100)
[1] 2
The operators > and < can be used to examine inequalities.
Exercise 1.3
Give the command to investigate which elements of IQ are larger than 100, and check
whether the result is correct.
14
R Tutorial for Applied Statistics
Data can be structured in several different ways. Thus far we have used vectors, one-dimensional
arrays of numbers, character strings, or logical values (or even combinations). Note that R treats
these vectors as column vectors and that (single) numbers are treated as vectors of length 1 (socalled scalars). Other data structures that are available in R are factors, data frames, and matrices.
Factors are one-dimensional arrays of classification levels. Data frames and matrices are twodimensional arrays. The former are data tables that consist of a collection of (possibly different
types of) vectors, where the rows represent the observations and the columns the variables. Matrices
are two-dimensional arrays of elements of the same type (i.e., a matrix of numbers, or a matrix of
characters). Examples of all data structures will be given in the reminder of the chapter.
Matrices can be constructed with the function matrix()
> A <- matrix(1:9, 3, 3)
It creates a matrix from the given vector of values, which is assigned to object A here. The vector
within the parentheses, 1:9, shows the values of the elements of this matrix,and the last two
arguments define the dimensions of the matrix (3 × 3), that is, we have a matrix of 3 rows and 3
columns. By default, the columns of A are filled first, then the rows; more information about the
function matrix() can, of course, be invoked by the command ?matrix().
It is good practice to check the correctness of matrix A
> A
[1,]
[2,]
[3,]
[,1] [,2] [,3]
1
4
7
2
5
8
3
6
9
An alternative way to create a matrix is by using the function dim() to define the dimensions of a
vector, as follows
> A <- 1:9
> dim(A) <- c(3,3)
> A
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
By using the argument byrow = T in the function matrix(), the matrix will be filled in a row-wise
rather than column-wise
> A <- matrix(1:9, 3, 3, byrow = T); A
R Tutorial for Applied Statistics
[1,]
[2,]
[3,]
15
[,1] [,2] [,3]
1
2
3
4
5
6
7
8
9
Other useful matrix functions are %*% (matrix multiplication), t() (matrix transpose), det() (determinant of a square matrix), solve() (inverse of a square matrix), and eigen() (eigenvalues and
eigenvectors). The function as.matrix() coerces an object into a matrix object.
1.5
How to generate data
In R it is easy to generate random numbers from several probability distributions. The basic
functions consist of r (for random) and the first letters of the distribution. For example, if 100
numbers from the standard normal distribution need to be generated, the following command can
be applied
> rnorm(100, 0, 1)
The result consists of 100 pseudo-random numbers drawn from the standard normal distribution.
They are pseudo-random numbers because the sequence of numbers is deterministic, conditional
upon a starting or seed value. The arguments of the function rnorm() are the sample size (the
number of draws), the population mean, and the population standard deviation, respectively.
Random numbers from a member of the family of binomial distributions can be generated by the
function rbinom(n,k,p), where n is the sample size to be drawn from the binomial distribution
with parameters k (the number of experiments) and p (the probability of success). Note that for
k=1 the binomial distribution equals a Bernoulli distribution with parameter p.
Exercise 1.4
Generate 10 draws from a binomial distribution with k=4 and success probability 0.7, and
10 samples from a Bernoulli distribution with success probability 0.5.
A random sample from a multinomial distribution can be generated with function rmultinom(n,k,p),
which has the same arguments or parameters as rbinom(n,k,p). Because a multinomial distribution has usually more than two categories, like for instance small, medium, large, p is a vector
with length equal to the number of categories and values corresponding to the probability that an
outcome falls in that category.
16
R Tutorial for Applied Statistics
Exercise 1.5
Execute the following command, and interpret the results.
> rmultinom(10, 12, c(0.1, 0.6, 0.3))
What are the differences with the generation of binomial random samples?
To generate a sample from the standard uniform distribution, that is, drawing real numbers within
the range [0,1], the function runif(n,0,1) can be used. This function thus generates n pseudorandom real numbers having values uniformly distributed between 0 and 1.
Exercise 1.6
Generate 10 standard uniformly distributed numbers, and evaluate the results.
Not all sample generating functions are available from the standard R packages or libraries (i.e.,
those available after R has been installed). The function that generates random numbers from a
multivariate normal distribution, mvrnorm(), for example, is part of the MASS package (Venables
& Ripley, 1999). To load this library, and to subsequently obtain documentation on the function
mvrnorm(), use the commands
> library(MASS)
> ?mvrnorm
The mvrnorm() function has three arguments: the sample size n, the vector of population means
mu, and the population covariance matrix sigma. To generate samples from a bivariate normal
distribution, values have to be assigned to the vector mu and to the matrix sigma first. Define the
mean vector and the covariance matrix, for example, as follows
> mu <- c(10,20)
> sigma <- matrix(c(10,-3,-3,15), 2, 2)
The function matrix() creates a (2 × 2) matrix from the given set of values, which is assigned to
object sigma here. Check the correctness of the matrix sigma :
> sigma
[1,]
[2,]
[,1] [,2]
10
-3
-3
15
R Tutorial for Applied Statistics
17
The result shows that sigma[1,1] = 10, sigma[1,2] = sigma[2,1] = -3 and that sigma[2,2]
= 15. The diagonal elements of sigma are the population variances of the two random variables:
sigma[1,1] = 10 and sigma[2,2] = 15. The off-diagonal elements of sigma are the covariances
between the row-column pairs of random variables. For instance, sigma[1,2] is the population
covariance between the first and second random variable. The covariance matrix is symmetric, by
definition. Recall that the correlation between two variables is defined as their covariance divided
by the product of the standard deviations of the two variables.
Now that values have been assigned to the arguments or parameters of mvrnorm(), 10 random
draws from the specified bivariate normal distribution can be obtained by the command
> y <- mvrnorm(10,mu,sigma)
The result is a sample of 10 pairs of bivariate normal distributed variables stored in vector y. In
the population distribution, the first variable has mean 10 and variance 10, the second variable
has mean 20 and variance 15. The population covariance between the two variables equals -3, the
population correlation -0.24.
Exercise 1.7
Find out what y looks like. Check whether the variances, covariance and correlation
resemble the ‘true’ values in this small sample. Use the commands var(y), cov(y),
cov(y[,1],y[,2]), and cor(y). Do the same after taking a (much) larger sample from
the bivariate normal distribution. How large does the sample need to be before accurate
estimates are obtained?
1.6
How to create factors
In statistics it is very important to distinguish categorical variables such as gender (male and
female) from quantitative or numerical variables such as body length and intelligence. In this
section some attention will be paid to working with categorical variables. Take, for example, the
variable education with three categories: 0="low", 1="middle", and 2="high". If the education
scores of five persons are to be stored and assigned to object educ, this can be accomplished by
using the numerical values or by using the categories or levels of educ. Take the following example
> educ <- c(1,2,2,0,1); educ
[1] 1 2 2 0 1
Here, the entries of the vector educ have a numerical code or value. R treats the object educ as a
numerical variable; the mean of educ can be calculated. (Simply type > mean(educ).)
18
R Tutorial for Applied Statistics
If, for some reason, a researcher should want the object educ to be treated as a categorical variable
or factor with, say, three ordinal categories or levels, this has to be specified accordingly. The
function factor() can be used to encode the variable educ as a factor as follows, defining names
or labels for the numerical outcomes
> educf <- factor(educ,labels=c("low","middle","high")); educf
[1] middle high high low middle
Levels: low middle high
The continuous variable educ has been transformed to a categorical variable educf; the mean of
educf cannot be calculated. (Try typing > mean(educf).) The names of the labels are specified
by assigning the names of the categories to the variable labels.
For some statistical functions and models, the specification of factors is essential for a proper
statistical analysis, as will be seen in the next sessions.
1.7
How to use script files
Not only to work efficiently with R on the assignments of the Applied statistics course, but as a
general principle, it is strongly advised to use so-called script files. An R script is simply a text file
containing (almost) the same commands that you would enter on the command line of R. Script
files contain a list of executable commands, comparable to syntax files in SPSS, which can be edited
and saved for future applications. For the use of a script file, after having entered the R system, go
to File in the upper toolbar and click on New script. A fresh window, Untitled - R editor, then
pops up (see the figure on the next page). This editor window can be used to create a script file.
Enter the commands into this new window instead of in the console window. The commands in
scripts should not be preceded by the R prompt >. To check whether the commands are correct,
it is recommended to execute the commands sequentially. This can be done by marking the first
command or a set of commands, clicking on the right mouse button and clicking Run line or by
selecting Ctrl+R. Then the first command is pasted to the R console and executed. Do this for all
commands, and when each of them is executed correctly, the script file can be saved, using a file
name with extension .R.
In a script file, on each line text after # is ignored. Thus, comments documenting the commands
can be easily added, either on the same line or by using lines starting with #.
R Tutorial for Applied Statistics
19
Assignment 1. How to create a data set
Before starting with the assignment, it is important to have studied the text of this session and to
have tried the rather simple exercises. All elements practiced above can be used in this assignment,
including the help function.
The objective of the first assignment is to create a data set by generating values of a number of
variables, and to store them in a file called mydata. Data have to be generated for 150 fourth grade
(primary school) children with a unique identification number (ID) in a primary school. These
children are taught arithmetic by one of two instruction methods, either method A or method B
(the numerical and factor versions method and methodf, respectively). The arithmetic skills of the
children are measured before the instructions start, the baseline measurement, and after one year at
the end of the fourth grade (arith0 and arith1). The intelligence score at baseline is also measured
(IQ). Relevant background variables are gender (sex and sexf) and the highest education (in three
categories, low, middle or high) of either parent (educ and educf). The table on the next page
summarizes the variables of the data file to be generated. The properties of the variables necessary
for constructing the data set are given below.
Note the research question under study: in general, researchers want to know whether the two
instruction methods have a differential effect on the arithmetic scores of pupils in the fourth grade
of elementary school.
20
R Tutorial for Applied Statistics
Detailed instructions
1. It is recommended to set the random seed to a fixed number, the seed, before the data are
generated. By taking such a fixed seed you can replicate the data generation process exactly,
that is, if you restart with the same seed and execute the same consecutive commands, you
will get exactly the same results. For convenience you might take the last three numbers of
your student identification number, for example 631; fixing the seed is then accomplished by
the R command set.seed(631).
Variable
ID
arith0
arith1
IQ
method
methodf
educ
educf
sex
sexf
Description
identification number
arithmetic test score at baseline or pretest level
arithmetic test score at posttest level
intelligence score
arithmetic instruction method (numerical version)
arithmetic instruction method (factor version)
educational level of parents (numerical version)
educational level of parents (factor version)
gender (numerical version)
gender (factor version)
2. Create a variable to identify children in the sample data by making a vector of increasing
numbers up to 150, the sample size. Give this variable the name ID.
3. Each child has to be randomly assigned to one of the two instruction methods. To that
purpose, generate a variable method, indicating whether the child was assigned to method A or
method B of arithmetic instruction. Use pseudo-random sampling from a binomial distribution
with success probability 0.5.
Make two versions of this variable, a numerical version with variable method having values 0
and 1, and a qualitative version with factor variable methodf having factor levels A and B.
4. Do the same for the gender of the child, except that the binomial success probability for
boys equals 0.4. Use the variable names sex and sexf for the numerical and factor version,
respectively.
5. The highest education of the parents can be generated by sampling from a multinomial distribution with three categories with success probabilities 0.3, 0.5 and 0.2, respectively. The
results of this generating process are stored in a matrix. Take a look at this matrix before
you proceed. Note that R provides numeric row and column names (e.g., [3,] is the third row,
[,7] is the seventh column).
To summarize these in a column vector having the values 0 (low education), 1 (middle) or 2
(high) according to the multinomial probabilities, you have to do some matrix manipulations.
To that purpose, you may use the following features: (i) to apply an arithmetic operator to
R Tutorial for Applied Statistics
21
each element of row i of a matrix, you can use matrix[i,], and (ii) to apply arithmetic to
each element of column j you can use matrix[,j] and the function t() to transpose the
matrix.
Hint. After having inspected the data, conceptualize for the first few rows which number
should be assigned to the corresponding elements in the column vector. Then think of how
this number can be obtained by summing the row elements, with appropriate weights or
multiplication factors. Next apply this procedure to the (columns of the) matrix.
Make a numerical version of this variable labeled as educ, and a factor version identified as
educf.
6. Take a random sample from a bivariate normal distribution. The mean and variance of the
first variable intelligence, abbreviated as IQ, are 100 and 64, respectively. For the second
variable, the baseline or pretest arithmetic score arith0, the mean and the variance are 130
and 100, respectively. Let the covariance between the two variables be 40, which implies
a moderate correlation of 0.50 between IQ and arith0. First assign the bivariate normal
random variables to a matrix, say X. Next, assign the two columns of that matrix X to the
variables IQ and arith0, respectively.
7. Generate three normally distributed variables to represent measurement errors, each having
a population mean 0 and population standard deviation 3. Label these independent random
variables as e1, e2 and e3.
8. Finally, compute the variable arithmetic score after one year of arithmetic instruction, arith1,
using the linear equation
arith1 = 0.8 × method + 0.3 × (arith0 + e1) + 1.5 × (IQ + e2) + 0.6 × sex + e3 .
9. Store all the variables generated in this exercise, except for the measurement error variables e1,
e2 and e3, in a data frame labeled as mydata using the command mydata <- data.frame(ID,
arith0, arith1, IQ, method, methodf, educ, educf, sex, sexf).
10. To save the data frame mydata in R-format to file, use the command
> save(mydata, file="mydata.Rdata")
By this command an external representation of R objects, as stored in file mydata.Rdata, will
be written to the working directory. The extension .Rdata is comparable to .sav in SPSS. For
practical reasons it is important to give this file the same name as the data frame. The data
can be read back as an R object from this file, using the command load("mydata.Rdata").
Note
• Do not forget to save the script file before leaving the R editor, since the script files will be
used again in next sessions of Applied Statistics. Do not paste the wrong commands to the
script because that would create problems in the next session! Send the script file document
to the lecturers after completion of the assignment.
22
R Tutorial for Applied Statistics
• For any statistical analysis of the generated sample data in subsequent sessions, the personal
data frame has first to be loaded into R by using the command load("mydata.Rdata").
Download