Data Handling & Analysis BD7054 Scatter Plots Andrew Jackson a.jackson@tcd.ie Scatter plot data • How are two measures related? • Are they correlated? • Does one cause an effect in the other? • What is the relationship? Develop a hypothesis • What is the hypothesis about these data? • What is the null hypothesis? Covariance and Correlation • Both the x and y data vary in some way • Question is do they co-vary? – Are large x values associated with large y values? (positive covariance) – Or large x with small y (negative covariance) • Calculate a statistic called the “correlation coefficient” (r) which takes values -1 >= r <= +1 • Test r against a statistical distribution Lets ask a different question • Instead of... • Is there a relationship between x and y? • I want to know... • What is the relationship between x and y? – Fitting a mathematical line to the data will tell us what the relationship likely is The equation of a line • Mathematicians use – Y = mX + c • Statisticians use – y = b1x + b0 – b1 is the slope of the line – b0 is the intercept (the value of y when x=0) • To calculate the coefficients use – b1 = (y2-y1)/(x2-x1) – b0 = y-b1x To calculate the coefficients • b1 = (y2-y1)/(x2-x1) • b0 = y-b1x • NB b0 can often be estimated visually from the graph Different slopes • Y = b1X + b0 10 8 6 b1 > 0 4 2 b1 = 0 Y 0 -2 -4 -6 b1 < 0 -8 -10 0 2 4 6 X 8 10 Different intercepts • Y = b1X + b0 • Parallel lines 20 15 10 Y 5 0 -5 0 2 4 6 X 8 10 Sample data Return to Interaction Strengths • The computer fits the line by minimising the residuals off the line • Strictly it (usually) minimises the sum of the squares of the residuals • ŷ𝑖 − 𝑦𝑖 The predicted y value 2 The observed y value Residuals • Informative as it tells us which data are larger than predicted, and which are lower • Should ideally be normally distributed around the line – Test this with visual plots like histograms or q-q plots • Should be evenly spread around the line with no obvious trend Regression model assumptions • Inherently assume a straight line relationship • The residuals, or errors are assumed to be normally distributed – Need to test this – And make sure they are evenly spread above and below the line along its length Computer Session