Chapter Three: Introduction to Regression Summary To fit a straight line by eye, when using bivariate data, make sure there are an equal number of points above and below the fitted line. The 3-median regression line is used mainly as a graphical technique. It is useful when outliers exist as they won’t affect the results as much as when least squares regression is used. To fit a 3-median regression line: o Count the dots. Split into three groups If multiple of 3, even split If not, split symmetrically eg. 3 5 3 or or 4 5 4 o In each group, find the median of the x values and the median of the y values o This gives us 3 ordered pairs (xL, yL), (xM, yM) and (xR, yR) where L = left, M = middle and R = right. o Dot a line between (xL, yL) and (xR, yR) o If using the graphical approach: Move this line, keeping it at the same slope, ⅓ of the way to the point (xM, yM) The gradient of the line is the gradient of the line between the two points (xL, yL) and (xR, yR) Find the y-intercept by reading it off the graph or by using y = mx + c. o If using the arithmetic approach yR yL Calculate the gradient (m) of the line. Use the rule m = xR xL Calculate the y-intercept (c ) of the line. Use the rule c = ⅓ [(yL + yM + yR) – m(xL + xM + xR)] The equation of the regression line is y = mx + c o If using the graphics calculator, the function is labelled med-med Enter the x-variable in L1 and the y-variable in L2. Press STAT and select CALC and 3:Med-Med Substitute into the equation y = ax + b (or whichever way your calculator has it) Least squares regression can be used when data shows a linear relationship and have no obvious outliers. Least squares regression can be found using LinReg(ax + b) on your graphics calculator. Remember to enter the x-values (independent variables) in L1 and the y-values (dependent variables) in L2. To calculate the least squares regression by hand use the following: Remember that if you are asked to state your response in terms of the variables don’t just write, for example, y = 2x + 4. You would say, for example, Height of wife = 2 × height of husband + 4. Interpretation is taking data and instead of just using x and y, we use actual variables and make a statement about our results (as mentioned in the above dot point). o The slope (m) indicates the rate at which the data are increasing or decreasing o The y-intercept (c ) indicates the approximate value of the data when x = 0 Interpolation is the use of the regression line to predict values ‘inbetween’ two values already in the data set. Extrapolation is the use of the regression line to predict values outside the already given data set. Interpolation is more reliable than extrapolation. A residual is the vertical difference between each data point and the regression line (the vertical distance from the line to each point). To calculate residuals: o Calculate the predicted value of y from the regression equation o Calculate the difference between this predicted value and the original value When residuals have been plotted o If the plot shows points randomly scattered above and below zero then the original data probably have a linear relationship o If the plot shows some sort of pattern, then there is probably not a linear relationship between the original data sets Linear regression might produce a ‘good’ fit to a set of data, but it may still be non-linear. To remove, as much as possible, such non-linearity, the data can be transformed. Either the x-values, y-values or both may be transformed to make them more linear. This enables more accurate predictions. We study six transformations: o Logarithmic transformations: y versus log10x log10y versus x 2 o Quadratic transformations: y versus x y2 versus x 1 1 o Reciprocal transformations: y versus versus x x y To choose the correct transformation: examine the points on a scatterplot with high values of x and/or y (ie away from the origin) and decide for each axis whether it needs to be stretched or compressed to make the points line up. As there are at least two possible transformations for any non-linear scatterplot, you choose which one to use by correlation. The least squares regression equation with r closest to -1 or 1 should be considered most appropriate. See p150-153 for examples on performing transformations.