R Exercise Practical 2 Follow the instructions described in this sheet. Type or paste any answers you are asked for into a Microsoft Word document. Note: You should have this sheet on your computer for reference and to copy and paste some text. 1. Obtain and attach the data set “cats” by typing the following: >library(MASS) >data(cats) >attach(cats) Have a look at the data by typing cats and also use the ?cats command. Plot Hwt (Heart Weight) against Bwt (Body Weight). Choose Bwt to be on the horizontal axis. >plot(Hwt~Bwt) Obtain the best straight line through the points using >abline(lm(Hwt~Bwt)) So far, the analysis has not looked at all at the gender of the cat. Type in the R code >boxplot(Hwt~Sex, xlab='gender', ylab='heart weight') which clearly shows that heart weight depends on gender. (use the subcommand horizontal =T if you want a different orientation). Similarly, the R code > plot(Hwt~Bwt, xlab='body weight (kg)', ylab='heart weight (g)', type='n') > points(Hwt[Sex=='M']~Bwt[Sex=='M']) > points(Hwt[Sex=='F']~Bwt[Sex=='F'], pch=3) produces a plot of heart weight against body weight for male cats (denoted by o) and female cats (denoted by +). 2. You can use R to plot some well known mathematical functions. Try the following: (a) (b) (c) (d) > curve(sin(x),from =0, to =2*pi) > curve(2*x^3+x^2-2*x+10,from =-2, to =2) > curve(log(x),from =0, to =10,col="magenta") >curve(5-x,from =0, to =10,col="red", add=T) 3. Download a new library to your R package using the command: >install.packages(“UsingR”) Choose any mirror site you like (St. Andrews worked for me). Obtain and attach the data set “cfb” by typing the following: >library(UsingR) >data(cfb) >attach(cfb) One of the vectors in this data frame is called INCOME. It represents the yearly income figures by households in a survey of consumer finances is the US in 2001. (a) Obtain a histogram of the INCOME data. What does it tell you? (b) Calculate the mean income using >mean(INCOME) How good is this statistic in describing the data? (c) A subcommand of the mean( ) command is “trim”. Use ?mean to investigate what this subcommand does, and experiment with various values of “trim”. (d) Calculate the median of the INCOME data using >median(INCOME) Does this give a better “feel” for the data set than the mean? (e) Calculate the coefficients of symmetry and kurtosis of the INCOME data by defining the functions discussed in the lectures (See lecture 3 on Vision). 4. Consider the following three random samples from a population. Enter each dataset into R using the “scan” command). (i) The data below are the number of hours between charging on a particular type of mobile phone. 45.8 54.9 41.7 41.1 48.5 52.8 55.9 40.4 60.5 46.6 44.4 38.5 57.0 51.0 60.4 45.0 44.2 53.8 58.5 59.1 47.3 46.7 46.9 50.2 49.3 50.7 58.8 52.7 43.7 50.7 (ii) The data below are the number of hours spent watching television per week for a sample of 34 households. 23.1 20.7 17.0 22.5 15.9 15.3 21.2 8.3 21.0 17.7 17.9 2.5 26.0 19.1 24.7 30.4 25.1 22.7 21.1 14.7 21.9 17.2 24.2 14.6 19.1 16.6 26.3 22.7 18.2 25.8 24.0 16.5 9.4 24.7 (iii) The data below are the number of miles travelled to and from work each day by a sample of 12 company employees. 3.7 40.7 14.3 5.3 11.0 26.5 5.2 4.8 24.2 16.9 8.2 26.5 (a) For each data set find the mean, median, standard deviation, quartiles and interquartile range. Use the commands >summary(hours) >sd(hours) (b) Obtain a boxplot for each dataset. (c) Obtain a histogram for each dataset. (d) Obtain a smooth curve through the histogram using the commands; > hist(hours,freq=F) > lines(density(hours)) (e) Recall from your statistics lectures that when samples are taken the statistic measured varies from sample to sample. Use the command >t.test(hours) to find a 95% confidence interval for the population mean for each sample. 5. The built in data set islands contains the size of the world’s land masses that exceed 10 000 square miles. Make a stem-and-leaf plot, then compare the mean, median and 25% trimmed mean. Are they similar? Hints: >data(islands) >mean(islands, 0.25) 6. The median absolute deviation is defined as mad(x)=1.4826median(|xi – median(x)|) This is a resistant measure of spread and is implemented using the mad( ) function. Explain in words what it measures. Compare the values of the sample standard deviation, interquartile range and median absolute deviation for the exec.pay data set in the Using R library. Hints: >library(UsingR) >data(exec.pay) 7. Consider the results below that give data on highway traffic flow. It shows the relationship between traffic flow, x (in thousands of vehicles per day) and lead content, y (micrograms per gram) of the bark of nearby trees. (a) Enter these data into two columns in R, using for example, x=c(8.3,8.3,12.1,…………) (b) Plot the data using plot(y~x), labelling the diagram appropriately. (c) You now wish to find the regression line of y on x. This has the equation y= a + bx The values a and b can be obtained in R using the lm command >lm(y~x) This gives the equation for y in terms of x (or at least the gradient b and point that it cuts the y axis, a). Assign this to a vector called “regress”. >regress=lm(y~x) Write down the equation of the best straight line and copy and paste the output from the >summary(regress) (e) The calculated straight-line equation can be drawn on the graph using >plot(y~x) >abline(lm(y~x)) Copy and paste this graph into your Word document (d) Find the residuals using the command >residuals(regress) It is often useful to look at a plot of the residuals against x. Do this using >plot(residuals(regress)~x) Do they follow any pattern? (e) Obtain the Product-Moment Correlation Coefficient, r, and comment on what its value means. >cor(x,y) 8. A series of numbers x1, x2, x3,…… can be created from constants a, b and c using the formula xn+1 = (axn + c) mod b This is called the Linear Congruential Method of random number generation. Note that x mod y is the remainder obtained when x is divided by y. So 20mod3 =2 Use a “for” loop to generate the first 100 numbers of such a series for the values (a) x1=1, a=5, b=94 and c=0 (b) x1=2, a=4, b=99 and c=3 Hint x(mody) is obtained in R by x%%y Which is the more efficient at producing random numbers?