Uploaded by peroxat716

DSA8001 Written Mock Assessment NS (2)

advertisement
DSA8001
Exam Time Table Code DSA8001
MSc Examination
Data Science and Analytics
DSA8001
Data Analytics Fundamentals
Friday, 29th October 2021 2:30PM – 4:30PM
Examiners: Professor Andrew Parnell
and the internal examiners
Calculators are allowed, provided they are non-graphical, nonprogrammable, and have at most two lines of display
Answer all questions
Approximate marks for parts of questions are shown in
brackets
You have TWO hours to complete this paper
Please enter your anonymous number: ________________________
1. [5 Marks]
Assume that the weight of chocolates in boxes of 'celebrations' chocolates in the UK
are approximately normally distributed with N(μ=5.8,σ=1.3). Also assume that only
those chocolates that weigh between 5 grams and 7 grams can be included for sale in
the boxes.
State the probability that a randomly chosen chocolate can be included for sale in the
celebrations boxes.
2. [5 Marks]
You are given the following dataset:
Name
Peter
Kim
Ann
John
Gender
0
1
1
0
Income
£1278.45
£1478.46
£1350.00
£1450.12
State clearly the size and dimensionality of this dataset. Outline fully the type of each
variable present in the dataset.
3. [5 Marks]
A medical study collected information on 189 newborn babies at Baystate Medical
Center, Springfield, Mass during 1986. The figure below shows a plot of each baby's
birthweight (grams) against their mother's birthweight (lbs). The correlation coefficient
between these variables is ρ=0.18.
Indicate the direction and strength of the correlation between mother's birthweight and
baby's birthweight.
4.
[10 Marks]
A company is deciding whether to set up a new book store in Belfast and wishes to
understand the reading habits of the population of students at Queen's University
Belfast. They decide to take a sample of the population to collect data on the number
and type of books read by each student in 2020.
You are tasked with advising on sampling strategies.
1. Present an outline of 3 different sampling approaches, explaining each in
context (using your own words).
2. Suppose the company wishes to stock both academic textbooks and fiction
books. Provide a recommendation to the company on a suitable sampling
strategy and fully explain your logic.
5. [10 marks]
A forester records a dataset with observations on the diameter (inches) and volume
(cubic feet) of 70 pine trees. The first 6 observations are displayed in the table below.
Diameter
Volume
4.4
2.0
4.6
2.2
5.0
3.0
5.1
4.3
5.1
3.0
5.2
2.9
The forester wishes to model the relationship between diameter and volume of pine
trees and fits a linear regression model of diameter on volume.
The (i) fitted linear model parameters and (ii) a regression plot with least squares
regression line are displayed below.
1. State the estimated simple linear regression equation and provide an interpretation of
the intercept and slope parameters.
2. If the Diameter of a certain Pine tree is 52 inches, estimate its' volume. Show your
working.
Output (i)
Intercept
Diameter
Estimate
-48.5681
6.8367
Std. Error
3.4267
0.2877
t value
-12.13
23.77
Residual standard error: 9.875 on 68 degrees of freedom.
Multiple R squared: 0.8926.
Adjusted R squared: 0.891.
Output (ii)
Pr(>|t|)
<2e-16
<2e-16
6. [10 Marks]
Suppose that you wish to test the assumption that 10% of smartphone users in Northern
Ireland use Huawei.
You have access to the following information collected from sample of 2600
smartphone users:
Phone
Count
Apple
803
Samsung
436
Huawei
532
HTC
318
LG
410
Motorola
101
Carry out an appropriate test to investigate whether 10% of smartphone users in
Northern Ireland use Huawei at the 5% significance level. State the relevant hypotheses,
describe your approach, state your results, and provide an interpretation of your results
in the context of the question.
7. [10 Marks]
Consider an experiment to compare the effects of two sleeping drugs A and B.
There are 7subjects and each subject receives treatment with each of the two drugs
(the order of treatment being randomised). The number of hours slept by each subject
is recorded and is given in the table below:
Subject
Hours slept using A
Hours slept using B
1
9.9
8.7
2
8.8
6.4
3
9.1
7.8
4
8.1
6.8
5
7.9
7.9
6
12.4
11.4
7
13.5
11.7
At 0.05 significance level state if there is evidence to suggest that there is a difference
between the effects of the two drugs.
NOTE: make sure that your answer contains the following:
1.
2.
3.
4.
The null and alternative hypotheses,
The corresponding test statistic for this hypothesis test,
The p-value of calculated test statistic,
The conclusion you would make about the null hypothesis based on your above
results in the context of this study.
8. [25 Marks]
A survey conducted on a random sample of 1301 adult Americans in 2013 and found
that 147 people had donated blood in the previous year. Suppose that you want to test
whether more than 10% of all adult Americans had given blood in the previous 12
months.
1. Outline the appropriate statistical test including hypotheses and justify your answer.
2. How many standard deviations above the hypothesised value is the observed value of
the sample proportion? Describe your logic.
3. Using your answer from part 2, determine the p-value. Show how you reached this
value.
4. What conclusion would you draw at the 5% significance level about whether this data
provides evidence that more than 10% of all adult Americans had given blood in the
previous 12 months? Explain your answer.
5. Suppose that you cannot afford to take such a large sample next year and you must
use a smaller sample. What effect will this have on the size of your confidence
interval?
9. [20 Marks]
Consider the following data where 20 pigs are assigned at random among 4
experimental groups and each group is fed a different diet. A farmer wishes to
investigate whether there is a significant statistical difference in weight between pigs
on the different diets.
Diet
Weight
1
60.8
1
57.1
1
65.0
1
58.7
1
61.8
2
68.3
2
67.7
2
74.0
2
66.3
2
69.9
Diet
Weight
3
102.6
3
102.2
3
100.5
3
97.5
3
98.9
4
87.9
4
84.7
4
83.2
4
85.8
4
90.3
i.
Which statistical test should the farmer use to answer their research question? Include
the relevant hypotheses in your answer.
ii.
The output for the relevant statistical test for pig weight and diet type is provided below.
Interpret the p-value and provide a written explanation to the farmer what it means in
the context of the research question.
Df
Diet
Residuals
3
16
Sum sq
4703
121
Mean sq
1567.7
7.6
F value
206.7
Pr (>F)
5.28e-13
iii.
The farmer asks whether it is possible to identify which diets differ significantly. State
any modifications that need to be made to the significance level of the test. Assuming
that 𝛼 = 0.05, show what the modified significance level should be for determining
which pairs of groups have significantly different means.
iv.
Perform a further investigation on whether diet categories 1 and 2 have statistically
significant differences between the pig mean weights and state your findings by
interpreting the resulting p-value.
Download