Uploaded by Alireza Ghaffartehrani

STAT202 Assignment1 solutions markers

advertisement
Assignment 1
Due 11:59pm Monday, October 16th - to be submitted through Crowdmark
General instructions: You will be asked to answer questions using R, by hand, or a combination of the two.
It is expected that you answer the questions by hand unless otherwise requested in the question. When
using R, please present all R code you used as well as the results obtained. Someone reading your work must
be able to take your code, copy it into R, and obtain identical results. You must always show your work
unless otherwise noted.
Please note that the TAs may not be marking all questions in this assignment. The exact
questions that will be marked will not be determined until after the due date. Because of this,
all questions will say 10 marks until we start marking (this is a Crowdmark default).
Question 1 [2 marks]
Please use R to answer this question.
Take a random sample of size 50 (n = 50) from the population of favourite numbers in the survey_data
dataset. Note that the first row of the data is a header row (i.e., it contains the names of the variables). To
do this in R, use the following code but you must replace 123456789 in set.seed() with your student ID
number. This will generate a sample uniquely attached to your ID.
data <- read.csv("survey_data.csv", header=T)
set.seed(123456789)
fav_number_sample <- sample(data$fav_number, 50)
Now complete the following steps in order for your sample of favourite numbers. Your solutions should
include the code you used to solve the problem along with the final answer.
Steps:
1.
2.
3.
4.
Add 10 to each data value.
Remove the 5th data value.
Divide the remaining numbers by 3 more than the standard deviation of the original data values.
Calculate the median of the resulting numbers.
# (fav_number_sample+10)
# (fav_number_sample+10)[-5]
# (fav_number_sample+10)[-5]/(sd(fav_number_sample)+3)
median((fav_number_sample+10)[-5]/(sd(fav_number_sample)+3))
The final answer I get is 364.4617709. Yours will be different.
2 marks for correct answer, 0 otherwise (see excel spreadsheet for final answers by IDs)
Question 2 [6 marks]
(a) [2 marks] Using R and the survey_data dataset, create a pie chart to show the distribution of students
1
who were born in each month.
table(data$month_born)
##
##
##
##
##
April
20
May
16
August
11
November
18
December February
23
17
October September
17
18
January
16
July
17
June
19
March
9
pie(c(16, 17, 9, 20, 16, 19, 17, 11, 18, 17, 18, 23),
labels=c("Jan", "Feb", "Mar", "Apr", "May", "June",
"July", "Aug", "Sept", "Oct", "Nov", "Dec"))
Apr
Mar
May
Feb
June
Jan
July
Dec
Aug
Nov
Sept
Oct
Deduct 1 mark for simple coding mistakes
(b) [2 marks] Using R and the survey_data dataset, create a bar chart to show the distribution of students
who were born in each month.
barplot(c(16, 17, 9, 20, 16, 19, 17, 11, 18, 17, 18, 23),
names.arg=c("Jan", "Feb", "Mar", "Apr", "May", "June",
"July", "Aug", "Sept", "Oct", "Nov", "Dec"))
2
20
15
10
5
0
Jan
Mar
May
July
Sept
Nov
Deduct 1 mark for simple coding mistakes
(c) [2 marks] Compare your charts from parts (a) and (b). Which chart is better at displaying the
information? Explain your reasoning.
Most of the wedges/slices from the pie chart appear to be somewhat similar in size (the eye is bad at judging
relative areas and reading angles - both of which are needed for pie charts). The bar chart is better at
displaying this information because it is easier to see the differences.
1 mark for stating bar chart is better, 1 mark for appropriate reason
Question 3 [10 marks]
Below is a plot of how long it took 20 students to eat one cookie (in seconds). Use this data to answer the
following questions.
##
##
##
##
##
##
##
##
##
##
The decimal point is 1 digit(s) to the right of the |
0
1
2
3
4
5
6
|
|
|
|
|
|
|
24555779
00
045
000
55
0
0
(a) [4 marks] Calculate the mean, median, and mode.
Mean =
2+4+5+···+60
20
= 21.15, median = Q2 = 15, mode = {5, 30}.
1 mark for mean, 1 mark for median, 2 marks for mode
(b) [2 marks] Compare the mean and the median from part (a). What does this tell us about the shape of
the data?
The mean is larger than the median. This suggests that the data is right skewed (we also see this in the plot).
1 mark for comparison, 1 mark for right skewed
3
(c) [2 marks] Calculate the first and third quartiles.
Q1 = 6, Q3 = 30
1 mark for each
(d) [2 marks] Determine whether 60 is an outlier.
IQR = Q3 − Q1 = 30 − 6 = 24
U L = Q3 + 1.5 × IQR = 30 + 1.5 × 24 = 66
Since 60 < 66, it is not an outlier.
1 mark for UL, 1 mark for correct answer (not outlier)
Question 4 [18 marks]
The data from Question 3 was actually a random sample taken from the survey_data variable containing
the recorded times of how long it took students to eat one cookie (in seconds). Use the original dataset for
this variable, in its entirety, to answer the following questions (i.e., eat_cookies_secs).
(a) [2 marks] Using R, create a histogram of this dataset.
hist(data$eat_cookie_secs)
20 40 60 80
0
Frequency
120
Histogram of data$eat_cookie_secs
0
100
200
300
400
500
600
data$eat_cookie_secs
Deduct 1 mark for simple coding mistakes
(b) [2 marks] What does the second bar (from the y-axis) of the histogram from part (a) represent? Explain
using non-statistical language (i.e., use language that a wide audience will understand).
The bar represents the number of students who ate one cookie in 50 to 100 seconds. It appears that about 40
students were able to do so.
Deduct 1 mark for simple mistakes (not using non-statistical language; choosing a different bar, etc.)
(c) [2 marks] Using R, create a boxplot of this dataset.
boxplot(data$eat_cookie_secs)
4
500
300
100
0
Deduct 1 mark for simple coding mistakes
(d) [3 marks] Describe the shape, center and spread of the boxplot from part (c) using appropriate numerical
measures.
Shape: skewed right
Center: median is hard to tell visually here (any guess between 20 to 60 seconds is fine)
Spread: IQR is about 50 and the range is 600 seconds
1 mark for shape, 1 for center, 1 for spread (IQR or range)
median(data$eat_cookie_secs)
## [1] 30
IQR(data$eat_cookie_secs)
## [1] 50
range(data$eat_cookie_secs)
## [1]
0 600
(e) [2 marks] Using R and one line of code, give the five-number summary of this dataset. What extra
information is provided?
summary(data$eat_cookie_secs)
##
##
Min. 1st Qu.
0.00
10.00
Median
30.00
Mean 3rd Qu.
43.26
60.00
Max.
600.00
The mean is extra.
1 mark for summary function, 1 mark for noting the mean
(f) [2 marks] It appears that one student ate one cookie in 600 seconds. Without calculating the upper
limit, is this value an outlier? Explain your reasoning.
Yes, it is an outlier as shown in both the histogram (the bar is much further right than the rest of the bars)
and the boxplot (it is a dot above the whisker which is the largest non-outlier).
1 mark for stating yes, 1 mark for appropriate reason (using histogram OR boxplot OR both)
5
(g) [2 marks] Provide two possible reasons why the value from part (f) is in the dataset.
It could be a typo (perhaps they meant to type 60 seconds), a very large cookie, etc.
1 mark for each appropriate reason
(g) [3 marks] Is the random sample from Question 3 a “good” representation of the population from
Question 4? Explain your reasoning using statistics.
No, it is not a good representation. In Question 3, x̄ = 21.15 and Q2 = 15. But in Question 4, x̄ = 43.26
and Q2 = 30. The random sample failed to capture values from the right tail, so it’s under-representing the
population.
1 mark for stating no, 1 mark for comparing statistics, 1 mark for recognizing underrepresentation
Question 5 [16 marks]
To answer the following questions parts, use the next line of code:
Q5_data <- data[data$exercise_per_week <=7, ]
For this question only (i.e., Question 5), you are going to use “Q5_data” instead of “data”.
(a) [2 marks] Using R, calculate the correlation between pulse rate and average coffee consumed per day.
Note that you need to include use="complete.obs" in the function to get an answer. Interpret the
number you calculate.
cor(Q5_data$pulse_rate, Q5_data$avg_coffee_per_day, use="complete.obs")
## [1] 0.1893126
The correlation is 0.1893126. It is weak and positive.
1 mark for correlation, 1 mark for correct interpretation
(b) [1 mark] Explain what the argument use="complete.obs" from part (a) does. Type help(cor) or
?cor to open the help file which provides you with more information about the function. Note that you
may need to include this argument again when calculating correlations.
The argument handles missing values by deleting them on a case-by-case basis. That is, only complete
observations are used to calculate the correlation.
1 mark for appropriate explanation
(c) [5 marks] Using R, create side-by-side boxplots of the pulse rate (p) against the number of exercise days
per week (e) using the command boxplot(p~e). Note that using this command “as is” will not work
for you – you need to fill in the appropriate terms for p and e. Compare the pulse rates for students
who do not exercise and students who exercise every day of the week. Be sure to mention shape, center,
and spread, along with any outliers.
boxplot(Q5_data$pulse_rate ~ Q5_data$exercise_per_week)
6
140
120
100
80
60
40
Q5_data$pulse_rate
0
1
2
3
4
5
6
7
Q5_data$exercise_per_week
Shape: No exercise days (0) is symmetric and all exercise days (7) is skewed left.
Center: The median pulse rate for all exercise days is slightly lower than the median pulse rate for no exercise
days. The mean pulse rate for all exercise days should be lower than the mean pulse rate for no exercise days
(due to the shape).
Spread: The IQR (range) for no exercise days is larger (smaller) than the IQR (range) for all exercise days.
Outliers: There are outliers for both, although only outliers above the IQR for no exercise days.
1 mark for boxplot, 1 mark for shape, 1 for center, 1 for spread (IQR or range), 1 for outliers
(d) [2 marks] Explain what the line of code at the start of this question (Question 5) does and why we
needed to use it. Hint: Create the boxplot from part (c) using “data” instead of “Q5_data” to help
you.
100
80
60
40
data$pulse_rate
120
140
boxplot(data$pulse_rate ~ data$exercise_per_week)
0
1
2
3
4
5
6
7
30
150
data$exercise_per_week
It removed the values of 30 and 150 for the number of exercise days per week. We had to use it since there
7
are only 7 days in a week (i.e., the maximum can only be 7).
1 mark for recognizing the removes values, 1 mark for explaining why (max of 7)
(e) [2 marks] Using R, calculate the correlation between pulse rate and average coffee consumed per day,
for students who do not exercise.
no_exercise <- Q5_data[Q5_data$exercise_per_week=="0",]
cor(no_exercise$pulse_rate, no_exercise$avg_coffee_per_day)
## [1] 0.2869316
The correlation is 0.2869316.
1 mark for filtering data correctly, 1 mark for correlation
(f) [2 marks] Using R, calculate the correlation between pulse rate and average coffee consumed per day,
for students who exercise every day of the week.
most_exercise <- Q5_data[Q5_data$exercise_per_week=="7",]
cor(most_exercise$pulse_rate, most_exercise$avg_coffee_per_day, use="complete.obs")
## [1] 0.6251409
The correlation is 0.6251409.
1 mark for filtering data correctly, 1 mark for correlation
(g) [2 marks] Comment on your results from parts (e) and (f).
The correlation between pulse rate and average coffee consumed per day is larger (i.e., stronger) for students
who exercise every day of the week compared to students who do not exercise. Both of these are larger than
the correlation calculated in part (a).
Based on these results, it appears that coffee consumption may be affecting pulse rate more for
people who vigorously exercise (i.e., exercise every day of the week). In general, it is known that people who
exercise more (e.g., athletes) have lower pulse rates, so the coffee consumption may be "jolting" their pulse
rates. However, there could be other lurking/confounding variables that are influencing these results as well.
1 mark for comparing correlations correctly, 1 mark for recognizing/interpreting what it means contextually
Question 6 [16 marks]
Please use "data" (i.e., the original survey_data dataset) for this question. Please include the
argument xlim=c(0,250) in any plots you are asked to create for this question.
(a) [1 mark] Using R, plot a scatterplot of students’ height (response) against shoe size (explanatory
variate).
plot(data$shoe_size_cm, data$height_cm, xlim = c(0,250))
8
200
190
180
170
150
160
data$height_cm
0
50
100
150
200
250
data$shoe_size_cm
(b) [4 marks] Using R, build a linear model with height as the response variate and shoe size as the
explanatory variate. What is the intercept, slope, and coefficient of determination for this model?
summary(lm(height_cm ~ shoe_size_cm, data))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = height_cm ~ shoe_size_cm, data = data)
Residuals:
Min
1Q
-19.575 -7.603
Median
-1.622
3Q
7.330
Max
30.359
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 168.43012
1.49272 112.834 <0.0000000000000002 ***
shoe_size_cm
0.04769
0.05077
0.939
0.349
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.16 on 199 degrees of freedom
Multiple R-squared: 0.004413,
Adjusted R-squared: -0.00059
F-statistic: 0.8821 on 1 and 199 DF, p-value: 0.3488
The intercept is a = 168.43012, slope is b = 0.04769, and coefficient of determination is r2 = 0.004413 =
0.4413%.
1 mark for fitting linear model correctly, 1 mark for intercept, 1 for slope, 1 for coef
(c) [1 mark] Using R, draw the least-squares regression line from part (b) on the scatterplot from part (a).
plot(data$shoe_size_cm, data$height_cm, xlim = c(0,250))
abline(168.43012, 0.04769, col="red")
9
200
190
180
170
150
160
data$height_cm
0
50
100
150
200
250
data$shoe_size_cm
(d) [2 marks] Calculate (or predict) the average height of a student with a shoe size of 24.7cm.
ŷ = 168.43012 + 0.04769 × shoe size = 168.43012 + 0.04769(24.7) = 169.6081cm
Deduct 1 mark for simple calculation errors
(e) [6 marks] Repeat parts (b) and (c) after removing the 160th row from the dataset.
Q6_data <- data[-160,]
summary(lm(height_cm ~ shoe_size_cm, Q6_data))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = height_cm ~ shoe_size_cm, data = Q6_data)
Residuals:
Min
1Q
-19.047 -7.595
Median
-1.031
3Q
6.845
Max
29.869
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 150.4679
3.4475 43.645 < 0.0000000000000002 ***
shoe_size_cm
0.7741
0.1361
5.689
0.0000000455 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.442 on 198 degrees of freedom
Multiple R-squared: 0.1405, Adjusted R-squared: 0.1361
F-statistic: 32.36 on 1 and 198 DF, p-value: 0.00000004554
plot(Q6_data$shoe_size_cm, Q6_data$height_cm, xlim = c(0,250))
abline(150.4679, 0.7741, col="red")
10
200
190
180
170
160
150
Q6_data$height_cm
0
50
100
150
200
250
Q6_data$shoe_size_cm
The intercept is a = 150.4679, slope is b = 0.7741, and coefficient of determination is r2 = 0.1405 = 14%.
1 mark for removing row, 1 mark for fitting linear model correctly, 1 mark for intercept, 1 for slope, 1 for
coef, 1 mark for plot with line of best fit
(f) [2 marks] Is the 160th data point an influential observation? Explain your reasoning.
Yes, it is an influential observation because when we removed it from the dataset in part (e) the line of best
fit changed markedly. It also drastically changed the coefficient of determination (i.e., the correlation) and
the slope.
1 mark for stating yes, 1 mark for appropriate reason
Question 7 [5 marks]
The contents of this assignment illustrate how we can use the methods learned in class to analyze the data
collected in the Course Survey! Imagine that you are asked to help create the next version of the Course
Survey – that is, the survey that future STAT 202 students will fill out and analyze in their first assignment.
What questions would you incorporate? Please provide a few ideas for questions and, potentially, the questions
themselves. If you can think of more questions and want to include them as well, that would be great (but
keep in mind that we are looking for quality over quantity)!
Note that it may help you to think of how you would analyze the data collected from your questions.
Answers will vary here.
11
Download