– Working with relationships between two variables • “Donation “ made to teacher & Stats Test Score 100 90 80 Stats Test Score 70 60 50 40 30 20 10 0 $0 $20 $40 $60 $80 Correlation & Regression • Univariate & Bivariate Statistics – U: frequency distribution, mean, mode, range, standard deviation – B: correlation – two variables • Correlation – linear pattern of relationship between one variable (x) and another variable (y) – an association between two variables – relative position of one variable correlates with relative distribution of another variable • X - An explanatory variable attempts to explain the observed outcomes in Y –A response variable measures an outcome of a study. • Warning: – No proof of causality – Cannot assume x causes y Scatterplot or Scatter Diagram a plot of paired data to determine or show a relationship between two variables Graduating Seniors by State in 2005 The state of Louisiana The state of Rhode Island Figure 3.1 (Percent taking SAT vs. Score) • Attributes of a good scatterplot – – – – – Consistent and uniform scale Label on both axis Accurate placement of data Data throughout the axis Axis break lines if not starting at zero. • To achieve this goal you should try to do your scatterplots on graph paper. AP Statistics, Section 3.1, Part 1 5 Graduating Seniors by State in 2005 States from NE, Mid-Atlantic and West States from Midwest, Mtn Central, and Southwest Paired Data Miles traveled 2 5 12 7 7 15 10 Minutes 6 9 23 18 15 28 19 Scatter Diagram Minutes 6 9 23 18 15 28 19 Relationship between miles traveled and minutes 30 minutes Miles 2 5 12 7 7 15 10 20 10 0 0 5 10 miles 15 20 Linear Correlation The general trend of the points seems to follow a straight line segment. Linear Correlation Non-Linear Correlation No Linear Correlation High Linear Correlation Points lie close to a straight line. High Linear Correlation Moderate Linear Correlation Low Linear Correlation Perfect Linear Correlation Questions Arising • Can we find a relationship between x and y? minutes Relationship between miles traveled and minutes 30 25 20 15 10 5 0 0 5 10 miles 15 20 • How strong is the relationship? When there appears to be a linear relationship between x and y: attempt to “fit” a line to the scatter diagram. When using x values to predict y values: • Call x the explanatory variable • Call y the response variable Scatterplot! • No Correlation – Random or circular assortment of dots • Positive Correlation – ellipse leaning to right – GPA and SAT – Smoking and Lung Damage – Number of Whoppers eaten and Mr. Flynn’s weight • Negative Correlation – – – – ellipse learning to left Depression & Self-esteem Studying & test errors Vampire friends & Werewolf boyfriends Interpreting Scatterplots • Pattern/Shape: linear, parabola, bell shaped – Deviations from pattern: Are there areas where the data conform less to the pattern? – Form: Are there clusters of data? – Special data: Are there any influential points? – Is a transformation of data necessary? • Trend/Direction: positive, negative, or WTF? – As x increases what happens to y? • Strength/Association: weak, moderate, strong – IF a line were drawn through the data, how close would the points be to the line? – Is the a small or large amount of variability within the y values? AP Statistics, Section 3.1, Part 1 22 Pearson’s Correlation Coefficient • “r” indicates… – strength of relationship (strong, weak, or none) – the variation of the points around the model (linear) – direction of relationship • positive (direct) – variables move in same direction • negative (inverse) – variables move in opposite directions • r ranges in value from –1.0 to +1.0 -1.0 Strong Negative 0.0 No Rel. +1.0 Strong Positive •Try quick estimates –Next slide and strange quiz Practice with Scatterplots r = .__ __ r = .__ __ r = .__ __ r = .__ __ A relationship between correlation coefficient, r, and the slope, b, of the least squares line: sx r b s y where s y standard deviation of the y values and s x standard deviation of the x values Linear correlation coefficient 1 r +1 Calculating the Correlation Coefficient, r r SS xy SS x SS y where SS xy xy x y n 2 x SS x x 2 n 2 y SS y y 2 n n number of data pairs Paired Data Miles traveled 2 5 12 7 7 15 10 Minutes 6 9 23 18 15 28 19 Scatter Diagram Minutes 6 9 23 18 15 28 19 Relationship between miles traveled and minutes 30 minutes Miles 2 5 12 7 7 15 10 20 10 0 0 5 10 miles 15 20 Find the Least Squares Line x (Miles Traveled) 2 y (Minutes) 6 x2 xy 4 12 5 9 25 45 12 23 144 276 7 18 49 126 7 15 49 105 15 28 225 420 10 19 100 190 x = 58 y = 118 x2 = 596 xy = 1174 Finding the slope SS xy xy x y 1174 (58)(118) 196.28571 n 7 x x n 2 and SS x 2 58 2 596 115.42857 7 SS xy 196.28571 slope b 1.700495 SS x 115.42857 Finding the y-intercept 118 y mean of y values 16.857143 7 58 x mean of x values 8.2857143 7 y int ercept a y bx 16.857143 1.700495 ( 8.2857143 ) 2.7673273 The equation of the least squares line is: y = a + bx y = 2.8 + 1.7x To Compute r: • Complete a table, with columns listing x, y, x2, y2, xy • Compute SSxy, SSx, and SSy • Use the formula: r SS xy SS x SS y Find the Correlation Coefficient x (Miles) 2 y (Min.) 6 x2 y2 xy 4 36 12 5 9 25 81 45 12 23 144 529 276 7 18 49 324 126 7 15 49 225 105 15 28 225 784 420 10 19 100 361 190 x = 58 y = 118 x2 = 596 y2=2340 xy = 1174 Calculations: SS xy x y (58)(118) xy 1174 196.28571 x x n n 7 2 2 58 2 SS x 596 115.42857 7 2 2 y 118 SS y y 2 2340 350.85714 n 7 SS xy 196.28571 r 0.9753643 115.42857350.85714 SS x SS y The Correlation Coefficient, r = 0.9753643 r 0.98 Calculating Correlation • The calculation of correlation is based on mean xi x 1 and standard r deviation. n 1 sx • Remember that both mean and standard deviation are not resistant measures. AP Statistics, Section 3.2, Part 1 yi y sy 38 Calculating Correlation The formula for calculating zvalues. • What does the Both z-values contents of the are negative. parenthesis look Their product is positive. like? • What happens when xi x yi y 1 the values are both r from the lower half n 1 sx s y of the population? Both z-values From the upper are positive. half? Their product is positive. AP Statistics, Section 3.2, Part 1 39 Calculating Correlation • What happens when one value is from the lower half of the population but other xi x 1 r value is from the n 1 sx upper half? yi y sy One z-value is positive and the other is negative. Their product is negative. AP Statistics, Section 3.2, Part 1 40 Using the TI-83/84 to calculate r • With Diagnostics ON: • Run LinReg(a+bx) [STAT>CALC>option 8] with the explanatory variable as the first list, and response variable as the second list The results are the slope and vertical intercept of the regression equation (more on that later) and values of r and r2. (More on r2 check next handout ;) AP Statistics, Section 3.2, Part 1 41 Predictive Potential • Coefficient of Determination – r² – Amount of variance accounted for in y by x – Percentage increase in accuracy you gain by using the regression line to make predictions – Without correlation, you can only guess the mean of y – [Used with regression] 0% 20% 40% 60% 80% 100% Understanding r-squared actvity Limitations of Correlation • linearity: – can’t describe (accurately) non-linear relationships – e.g., flavor and % eaten, thickness and strength • truncation of range: – underestimate strength of relationship if you can’t see full range of x value • no proof of causation – third variable problem: • could be 3rd variable causing change in both variables • directionality: can’t be sure which way causality “flows” • “We don’t get it” – what does it have to do with that f#$%@! Line? That is for another session… Regression • Regression: Correlation + Prediction – predicting y based on x – e.g., predicting…. • throwing points (y) • based on distance from target (x) • Regression equation – – – – formula that specifies a line y’ = a + bx plug in a x value (distance from target) and predict y (points) note • y= actual value of a score • y’= predict value •Data Handout –Test takers, planets, darts The Least-Square Regression • Finds the best fit line by trying to minimize the areas formed by the difference of the real data from the values predicted by the model. AP Statistics, Section 3.3, Part 1 45 The Least-Square Regression • Statisticians use a slightly different version of “slope-intercept” form. Slope is the product of r value and std dev ratio Y-intercept is the value found using the avg x and avg y y a bx sy br sx a y bx AP Statistics, Section 3.3, Part 1 46 Regression Graphic – Regression Line 120 100 80 60 y’=47 40 y’=20 20 0 Rsq = 0.6031 8 10 12 14 16 Distance from target 18 if x=18 then… 20 22 24 26 if x=24 then… Predicting Model • To put the regression line on the graph use the Statistics:Eq:RegEQ from the Vars menu to put the Y1 equation. • Then you can use Trace or Table or Y1 to find response values that correspond to particular experimental values. AP Statistics, Section 3.3, Part 1 48 Regression Equation • y’= a + bx – – – – See STAT – CALC – LinReg: a + bx y’ = predicted value of y b = slope of the line x = value of x that you plug-in a = y-intercept (where line crosses y axis) • In the dart throwing case…. – y’ = 125.401 - 4.263(x) • So if the distance is 20 feet – y’ = 125.401 - 4.263(20) – y’ = 125.401 -85.26 – y’ = 40.141 Drawing a Regression Line by Hand Four steps 1. Use the y-intercept (if possible; does it have meaning =interval vs. rational) 2. Plot the average point (mean x, mean y) 3. Plug in a large value for x (just so it falls on the right end of the graph), plug it in for x, then plot the resulting point 4. Connect the three points with a straight line! Residuals Predicted Value ( ŷ ) • It is important to note that the observed value almost never match the predicted values exactly • The difference between the observed value and predicted has a special name: residual Residual: y yˆ AP Statistics, Section 3.3, Part 1 Observed Value: (y) 51 Residual Plots • You can plot the residuals to see if the there is any trends with the quality of the predictive model • Try looking in the List menu for “RESID:” AP Statistics, Section 3.3, Part 1 52 Residual Plots • This residual shows no tendencies. It is equally bad throughout. • This suggests that the original relationship is linear. AP Statistics, Section 3.3, Part 1 53 “Pattern” =Not Linear “Well Distributed”=Linear AP Statistics, Section 3.3, Part 1 54 Predictive Ability • Mantra!! – As variability decreases, prediction accuracy __________ – if we can account for variance, we can make better predictions • As r increases: – r² increases • “variance accounted for” increases • the prediction accuracy increases – prediction error decreases (distance between y’ and y) – Sy decreases • the standard error of the residual/predictor • measures overall amount of prediction error – It can be thought of like this … We like big r’s and we cannot lie!!! You other brothers can’t deny!!! Check out those residuals son and plot em with your TI-84 on Cause if they don’t look all scattered and patterned then your least squared line is shattered Then I only want that - if your scale and r squared is fat So kick out those nasty outliers When your correlation factor is on BABY GOT STATS! Thanks – Peace !