R Practical Exercise 2 - Mathematical & Computer Sciences

advertisement
R Exercise Practical 2
Follow the instructions described in this sheet. Type or paste any answers you are asked for
into a Microsoft Word document.
Note: You should have this sheet on your computer for reference and to copy and paste some
text.
1.
Obtain and attach the data set “cats” by typing the following:
>library(MASS)
>data(cats)
>attach(cats)
Have a look at the data by typing cats and also use the ?cats command.
Plot Hwt (Heart Weight) against Bwt (Body Weight). Choose Bwt to be on the horizontal axis.
>plot(Hwt~Bwt)
Obtain the best straight line through the points using
>abline(lm(Hwt~Bwt))
So far, the analysis has not looked at all at the gender of the cat.
Type in the R code
>boxplot(Hwt~Sex, xlab='gender', ylab='heart weight')
which clearly shows that heart weight depends on gender.
(use the subcommand horizontal =T if you want a different orientation).
Similarly, the R code
> plot(Hwt~Bwt, xlab='body weight (kg)', ylab='heart weight (g)', type='n')
> points(Hwt[Sex=='M']~Bwt[Sex=='M'])
> points(Hwt[Sex=='F']~Bwt[Sex=='F'], pch=3)
produces a plot of heart weight against body weight for male cats (denoted by o) and female
cats (denoted by +).
2.
You can use R to plot some well known mathematical functions. Try the following:
(a)
(b)
(c)
(d)
> curve(sin(x),from =0, to =2*pi)
> curve(2*x^3+x^2-2*x+10,from =-2, to =2)
> curve(log(x),from =0, to =10,col="magenta")
>curve(5-x,from =0, to =10,col="red", add=T)
3.
Download a new library to your R package using the command:
>install.packages(“UsingR”)
Choose any mirror site you like (St. Andrews worked for me).
Obtain and attach the data set “cfb” by typing the following:
>library(UsingR)
>data(cfb)
>attach(cfb)
One of the vectors in this data frame is called INCOME. It represents the yearly income
figures by households in a survey of consumer finances is the US in 2001.
(a)
Obtain a histogram of the INCOME data. What does it tell you?
(b)
Calculate the mean income using
>mean(INCOME)
How good is this statistic in describing the data?
(c)
A subcommand of the mean( ) command is “trim”. Use ?mean to investigate what this
subcommand does, and experiment with various values of “trim”.
(d)
Calculate the median of the INCOME data using
>median(INCOME)
Does this give a better “feel” for the data set than the mean?
(e)
Calculate the coefficients of symmetry and kurtosis of the INCOME data by defining
the functions discussed in the lectures (See lecture 3 on Vision).
4.
Consider the following three random samples from a population. Enter each dataset
into R using the “scan” command).
(i)
The data below are the number of hours between charging on a particular type of
mobile phone.
45.8
54.9
41.7
41.1
48.5
52.8
55.9
40.4
60.5
46.6
44.4
38.5
57.0
51.0
60.4
45.0
44.2
53.8
58.5
59.1
47.3
46.7
46.9
50.2
49.3
50.7
58.8
52.7
43.7
50.7
(ii)
The data below are the number of hours spent watching television per week for a
sample of 34 households.
23.1
20.7
17.0
22.5
15.9
15.3
21.2
8.3
21.0
17.7
17.9
2.5
26.0
19.1
24.7
30.4
25.1
22.7
21.1
14.7
21.9
17.2
24.2
14.6
19.1
16.6
26.3
22.7
18.2
25.8
24.0
16.5
9.4
24.7
(iii)
The data below are the number of miles travelled to and from work each day by a
sample of 12 company employees.
3.7
40.7
14.3
5.3
11.0
26.5
5.2
4.8
24.2
16.9
8.2
26.5
(a)
For each data set find the mean, median, standard deviation, quartiles and interquartile
range.
Use the commands
>summary(hours)
>sd(hours)
(b)
Obtain a boxplot for each dataset.
(c)
Obtain a histogram for each dataset.
(d)
Obtain a smooth curve through the histogram using the commands;
> hist(hours,freq=F)
> lines(density(hours))
(e)
Recall from your statistics lectures that when samples are taken the statistic measured
varies from sample to sample. Use the command
>t.test(hours)
to find a 95% confidence interval for the population mean for each sample.
5.
The built in data set islands contains the size of the world’s land masses that exceed
10 000 square miles. Make a stem-and-leaf plot, then compare the mean, median and 25%
trimmed mean. Are they similar?
Hints:
>data(islands)
>mean(islands, 0.25)
6.
The median absolute deviation is defined as
mad(x)=1.4826median(|xi – median(x)|)
This is a resistant measure of spread and is implemented using the mad( ) function. Explain in
words what it measures. Compare the values of the sample standard deviation, interquartile
range and median absolute deviation for the exec.pay data set in the Using R library.
Hints:
>library(UsingR)
>data(exec.pay)
7.
Consider the results below that give data on highway traffic flow. It shows the
relationship between traffic flow, x (in thousands of vehicles per day) and lead content, y
(micrograms per gram) of the bark of nearby trees.
(a) Enter these data into two columns in R, using for example, x=c(8.3,8.3,12.1,…………)
(b) Plot the data using plot(y~x), labelling the diagram appropriately.
(c) You now wish to find the regression line of y on x. This has the equation
y= a + bx
The values a and b can be obtained in R using the lm command
>lm(y~x)
This gives the equation for y in terms of x (or at least the gradient b and point that it cuts the y
axis, a). Assign this to a vector called “regress”.
>regress=lm(y~x)
Write down the equation of the best straight line and copy and paste the output from the
>summary(regress)
(e) The calculated straight-line equation can be drawn on the graph using
>plot(y~x)
>abline(lm(y~x))
Copy and paste this graph into your Word document
(d) Find the residuals using the command
>residuals(regress)
It is often useful to look at a plot of the residuals against x. Do this using
>plot(residuals(regress)~x)
Do they follow any pattern?
(e) Obtain the Product-Moment Correlation Coefficient, r, and comment on what its value
means.
>cor(x,y)
8.
A series of numbers x1, x2, x3,…… can be created from constants a, b and c using the
formula
xn+1 = (axn + c) mod b
This is called the Linear Congruential Method of random number generation.
Note that x mod y is the remainder obtained when x is divided by y. So 20mod3 =2
Use a “for” loop to generate the first 100 numbers of such a series for the values
(a) x1=1, a=5, b=94 and c=0
(b) x1=2, a=4, b=99 and c=3
Hint x(mody) is obtained in R by x%%y
Which is the more efficient at producing random numbers?
Download