Uploaded by juliaclade

Statistics Assignment: Observational Studies & Regression

Student: Julia Clade
ID: JCL186
Stat101 – 23S1
Assignment 1
Question 1
a) This is an observational study. The person conducting the study isn’t controlling
any explanatory variables, they are merely observing and recording the values.
b) We can use the data in this study to make inferences on all patients that had an
elective surgical procedure to biopsy or remove a lesion of the lung, colon,
breast, ovary, or uterus that was found to be non-cancerous.
c) We are comparing two categorical variables. The three plots we can chose to
display the data are Side-by-Side/Clustered Bar Charts, Segmented/Stacked Bar
Charts or 100% Stacked Bar Charts. It depends on the data we are looking for,
which one is best suited.
Stacked Bar Charts are great for comparing totals per category, while Side by
Side Bar Charts allow better comparisons within a category. 100% stacked Bar
Charts allow the viewer to analyze percentages within categories easily.
d) To display the relationship between two quantitative variables, we use a
scatterplot. The explanatory/independent variable (Fibre) will be on the x-axis
and the response/dependent variable (Cholesterol) will be displayed on the yaxis. We can add a Regression line if a clear correlation above 0.4 exists.
However, in this example the correlation is 0.154 and a regression line is not
Student: Julia Clade
ID: JCL186
The curve above is skewed to the right. An appropriate measure of centre for this
data is the median. The mean will be pulled into the direction of skewedness and
will therefore be larger than the median. The mean is not resistant to skewed
data distribution, but can be used as a measure of centre for evenly distributed,
bell shaped curves.
Student: Julia Clade
ID: JCL186
We can see in this graph that in general calorific intake per day reduces with age.
This is achieved by comparing the median Calorie intake per day between the
age groups. In general, we also see a wider range and interquartile range of
Calorie intake in the groups from age 25 to 74 compared to the 18–24 and 75+
Question 2
Sample number = n = 496
b) ^p = 15/496 = 0.030 (3dp)
^p = 40/89 = 0.449 (3dp)
d) ^pe - ^pc = 21/496 - 89/496 = 0.042 - 0.179 = -0.137 (3dp)
Student: Julia Clade
ID: JCL186
Question 3
The explanatory variable is ‘CO2’ and the response variable is ‘Temp’
There is a positive linear association between ‘Temp’ and ‘CO2’.
The correlation is r = 0.749 which we consider a strong positive correlation as
the value of r falls between 0.6 and 0.8.
The line slopes upwards, we speak of a positive trend.
d) As there is a strong correlation between the ‘Temp’ and ‘CO2’ variables, we can
include a linear regression model in this association. It is important to always plot
your data. A visual check can give confirmation, whether a linear regression
model is appropriate and whether there are many outliers that can influence your
regression line.
^ = -3.593 + 0.011⋅C O2
This formula uses inputs for intercept and slope calculated by Stat Key.
Student: Julia Clade
ID: JCL186
f) For ever 1 increase of the x-value ‘CO2’, the y-value ‘Temp’ increases by 0.011.
g) The y-intercept (temperature difference) of -3.591 is achieved when there is zero
CO2 observed. This is not a meaningful value due to observed CO2
concentrations in the atmosphere being well above zero at all times.
The lowest observed CO2 value in our dataset is around 340ppm, zero ppm CO2
in the atmosphere is well outside of our range of measurements.
h) We could make a prediction outside of this data range, however this isn’t usually
recommended. If we try to predict the temperature difference when CO2 is
420ppm, we can use our linear regression formula from e):
^y = -3.593 + 0.011⋅420 = 1.027
As we are looking at values outside of our observed range of values provided in
the dataset, this is an extrapolation.