Tutorial 3 - scatter plots

advertisement
Data Handling & Analysis
BD7054
Scatter Plots
Andrew Jackson
a.jackson@tcd.ie
Scatter plot data
• How are two measures
related?
• Are they correlated?
• Does one cause an
effect in the other?
• What is the
relationship?
Develop a hypothesis
• What is the hypothesis
about these data?
• What is the null
hypothesis?
Covariance and Correlation
• Both the x and y data vary
in some way
• Question is do they co-vary?
– Are large x values associated
with large y values? (positive
covariance)
– Or large x with small y
(negative covariance)
• Calculate a statistic called
the “correlation coefficient”
(r) which takes values -1 >=
r <= +1
• Test r against a statistical
distribution
Lets ask a different question
• Instead of...
• Is there a relationship
between x and y?
• I want to know...
• What is the relationship
between x and y?
– Fitting a mathematical
line to the data will tell
us what the relationship
likely is
The equation of a line
• Mathematicians use
– Y = mX + c
• Statisticians use
– y = b1x + b0
– b1 is the slope of the line
– b0 is the intercept (the value of y when x=0)
• To calculate the coefficients use
– b1 = (y2-y1)/(x2-x1)
– b0 = y-b1x
To calculate the coefficients
• b1 = (y2-y1)/(x2-x1)
• b0 = y-b1x
• NB b0 can often be
estimated visually from
the graph
Different slopes
• Y = b1X + b0
10
8
6
b1 > 0
4
2
b1 = 0
Y 0
-2
-4
-6
b1 < 0
-8
-10
0
2
4
6
X
8
10
Different intercepts
• Y = b1X + b0
• Parallel lines
20
15
10
Y
5
0
-5
0
2
4
6
X
8
10
Sample data
Return to Interaction Strengths
• The computer fits the
line by minimising the
residuals off the line
• Strictly it (usually)
minimises the sum of
the squares of the
residuals
•
ŷ𝑖 − 𝑦𝑖
The
predicted y
value
2
The
observed y
value
Residuals
• Informative as it tells us which data are larger
than predicted, and which are lower
• Should ideally be normally distributed around
the line
– Test this with visual plots like histograms or q-q
plots
• Should be evenly spread around the line with
no obvious trend
Regression model assumptions
• Inherently assume a straight line relationship
• The residuals, or errors are assumed to be
normally distributed
– Need to test this
– And make sure they are evenly spread above and
below the line along its length
Computer Session
Download