DSA8001 Exam Time Table Code DSA8001 MSc Examination Data Science and Analytics DSA8001 Data Analytics Fundamentals Friday, 29th October 2021 2:30PM – 4:30PM Examiners: Professor Andrew Parnell and the internal examiners Calculators are allowed, provided they are non-graphical, nonprogrammable, and have at most two lines of display Answer all questions Approximate marks for parts of questions are shown in brackets You have TWO hours to complete this paper Please enter your anonymous number: ________________________ 1. [5 Marks] Assume that the weight of chocolates in boxes of 'celebrations' chocolates in the UK are approximately normally distributed with N(μ=5.8,σ=1.3). Also assume that only those chocolates that weigh between 5 grams and 7 grams can be included for sale in the boxes. State the probability that a randomly chosen chocolate can be included for sale in the celebrations boxes. 2. [5 Marks] You are given the following dataset: Name Peter Kim Ann John Gender 0 1 1 0 Income £1278.45 £1478.46 £1350.00 £1450.12 State clearly the size and dimensionality of this dataset. Outline fully the type of each variable present in the dataset. 3. [5 Marks] A medical study collected information on 189 newborn babies at Baystate Medical Center, Springfield, Mass during 1986. The figure below shows a plot of each baby's birthweight (grams) against their mother's birthweight (lbs). The correlation coefficient between these variables is ρ=0.18. Indicate the direction and strength of the correlation between mother's birthweight and baby's birthweight. 4. [10 Marks] A company is deciding whether to set up a new book store in Belfast and wishes to understand the reading habits of the population of students at Queen's University Belfast. They decide to take a sample of the population to collect data on the number and type of books read by each student in 2020. You are tasked with advising on sampling strategies. 1. Present an outline of 3 different sampling approaches, explaining each in context (using your own words). 2. Suppose the company wishes to stock both academic textbooks and fiction books. Provide a recommendation to the company on a suitable sampling strategy and fully explain your logic. 5. [10 marks] A forester records a dataset with observations on the diameter (inches) and volume (cubic feet) of 70 pine trees. The first 6 observations are displayed in the table below. Diameter Volume 4.4 2.0 4.6 2.2 5.0 3.0 5.1 4.3 5.1 3.0 5.2 2.9 The forester wishes to model the relationship between diameter and volume of pine trees and fits a linear regression model of diameter on volume. The (i) fitted linear model parameters and (ii) a regression plot with least squares regression line are displayed below. 1. State the estimated simple linear regression equation and provide an interpretation of the intercept and slope parameters. 2. If the Diameter of a certain Pine tree is 52 inches, estimate its' volume. Show your working. Output (i) Intercept Diameter Estimate -48.5681 6.8367 Std. Error 3.4267 0.2877 t value -12.13 23.77 Residual standard error: 9.875 on 68 degrees of freedom. Multiple R squared: 0.8926. Adjusted R squared: 0.891. Output (ii) Pr(>|t|) <2e-16 <2e-16 6. [10 Marks] Suppose that you wish to test the assumption that 10% of smartphone users in Northern Ireland use Huawei. You have access to the following information collected from sample of 2600 smartphone users: Phone Count Apple 803 Samsung 436 Huawei 532 HTC 318 LG 410 Motorola 101 Carry out an appropriate test to investigate whether 10% of smartphone users in Northern Ireland use Huawei at the 5% significance level. State the relevant hypotheses, describe your approach, state your results, and provide an interpretation of your results in the context of the question. 7. [10 Marks] Consider an experiment to compare the effects of two sleeping drugs A and B. There are 7subjects and each subject receives treatment with each of the two drugs (the order of treatment being randomised). The number of hours slept by each subject is recorded and is given in the table below: Subject Hours slept using A Hours slept using B 1 9.9 8.7 2 8.8 6.4 3 9.1 7.8 4 8.1 6.8 5 7.9 7.9 6 12.4 11.4 7 13.5 11.7 At 0.05 significance level state if there is evidence to suggest that there is a difference between the effects of the two drugs. NOTE: make sure that your answer contains the following: 1. 2. 3. 4. The null and alternative hypotheses, The corresponding test statistic for this hypothesis test, The p-value of calculated test statistic, The conclusion you would make about the null hypothesis based on your above results in the context of this study. 8. [25 Marks] A survey conducted on a random sample of 1301 adult Americans in 2013 and found that 147 people had donated blood in the previous year. Suppose that you want to test whether more than 10% of all adult Americans had given blood in the previous 12 months. 1. Outline the appropriate statistical test including hypotheses and justify your answer. 2. How many standard deviations above the hypothesised value is the observed value of the sample proportion? Describe your logic. 3. Using your answer from part 2, determine the p-value. Show how you reached this value. 4. What conclusion would you draw at the 5% significance level about whether this data provides evidence that more than 10% of all adult Americans had given blood in the previous 12 months? Explain your answer. 5. Suppose that you cannot afford to take such a large sample next year and you must use a smaller sample. What effect will this have on the size of your confidence interval? 9. [20 Marks] Consider the following data where 20 pigs are assigned at random among 4 experimental groups and each group is fed a different diet. A farmer wishes to investigate whether there is a significant statistical difference in weight between pigs on the different diets. Diet Weight 1 60.8 1 57.1 1 65.0 1 58.7 1 61.8 2 68.3 2 67.7 2 74.0 2 66.3 2 69.9 Diet Weight 3 102.6 3 102.2 3 100.5 3 97.5 3 98.9 4 87.9 4 84.7 4 83.2 4 85.8 4 90.3 i. Which statistical test should the farmer use to answer their research question? Include the relevant hypotheses in your answer. ii. The output for the relevant statistical test for pig weight and diet type is provided below. Interpret the p-value and provide a written explanation to the farmer what it means in the context of the research question. Df Diet Residuals 3 16 Sum sq 4703 121 Mean sq 1567.7 7.6 F value 206.7 Pr (>F) 5.28e-13 iii. The farmer asks whether it is possible to identify which diets differ significantly. State any modifications that need to be made to the significance level of the test. Assuming that 𝛼 = 0.05, show what the modified significance level should be for determining which pairs of groups have significantly different means. iv. Perform a further investigation on whether diet categories 1 and 2 have statistically significant differences between the pig mean weights and state your findings by interpreting the resulting p-value.