Student: Julia Clade ID: JCL186 Stat101 – 23S1 Assignment 1 Question 1 a) This is an observational study. The person conducting the study isn’t controlling any explanatory variables, they are merely observing and recording the values. b) We can use the data in this study to make inferences on all patients that had an elective surgical procedure to biopsy or remove a lesion of the lung, colon, breast, ovary, or uterus that was found to be non-cancerous. c) We are comparing two categorical variables. The three plots we can chose to display the data are Side-by-Side/Clustered Bar Charts, Segmented/Stacked Bar Charts or 100% Stacked Bar Charts. It depends on the data we are looking for, which one is best suited. Stacked Bar Charts are great for comparing totals per category, while Side by Side Bar Charts allow better comparisons within a category. 100% stacked Bar Charts allow the viewer to analyze percentages within categories easily. d) To display the relationship between two quantitative variables, we use a scatterplot. The explanatory/independent variable (Fibre) will be on the x-axis and the response/dependent variable (Cholesterol) will be displayed on the yaxis. We can add a Regression line if a clear correlation above 0.4 exists. However, in this example the correlation is 0.154 and a regression line is not appropriate. Student: Julia Clade ID: JCL186 e) The curve above is skewed to the right. An appropriate measure of centre for this data is the median. The mean will be pulled into the direction of skewedness and will therefore be larger than the median. The mean is not resistant to skewed data distribution, but can be used as a measure of centre for evenly distributed, bell shaped curves. Student: Julia Clade ID: JCL186 f) We can see in this graph that in general calorific intake per day reduces with age. This is achieved by comparing the median Calorie intake per day between the age groups. In general, we also see a wider range and interquartile range of Calorie intake in the groups from age 25 to 74 compared to the 18–24 and 75+ year-olds. Question 2 a) Sample number = n = 496 b) ^p = 15/496 = 0.030 (3dp) c) ^p = 40/89 = 0.449 (3dp) d) ^pe - ^pc = 21/496 - 89/496 = 0.042 - 0.179 = -0.137 (3dp) Student: Julia Clade ID: JCL186 Question 3 a) The explanatory variable is ‘CO2’ and the response variable is ‘Temp’ b) There is a positive linear association between ‘Temp’ and ‘CO2’. The correlation is r = 0.749 which we consider a strong positive correlation as the value of r falls between 0.6 and 0.8. The line slopes upwards, we speak of a positive trend. c) d) As there is a strong correlation between the ‘Temp’ and ‘CO2’ variables, we can include a linear regression model in this association. It is important to always plot your data. A visual check can give confirmation, whether a linear regression model is appropriate and whether there are many outliers that can influence your regression line. e) ^ = -3.593 + 0.011⋅C O2 Temp This formula uses inputs for intercept and slope calculated by Stat Key. Student: Julia Clade ID: JCL186 f) For ever 1 increase of the x-value ‘CO2’, the y-value ‘Temp’ increases by 0.011. g) The y-intercept (temperature difference) of -3.591 is achieved when there is zero CO2 observed. This is not a meaningful value due to observed CO2 concentrations in the atmosphere being well above zero at all times. The lowest observed CO2 value in our dataset is around 340ppm, zero ppm CO2 in the atmosphere is well outside of our range of measurements. h) We could make a prediction outside of this data range, however this isn’t usually recommended. If we try to predict the temperature difference when CO2 is 420ppm, we can use our linear regression formula from e): ^y = -3.593 + 0.011⋅420 = 1.027 As we are looking at values outside of our observed range of values provided in the dataset, this is an extrapolation.