Chapter 4 More About Relationships Between Two Variables 4.1 Transforming to Achieve Linearity 4.2 Relationship Between Categorical Variables 4.3 Establishing Causation How do you determine if data is linear? • Look at the graph (is it straight?) • Look at the residual plot (is it scattered?) • Look at the correlation coefficient, r (is it close to -1 or 1?) If the answer to any of these questions is no, then a line is probably not a good fit and a curved function may be more appropriate. Curved Functions Tested in AP Stats Exponential Regression 𝒚 = 𝒂𝒃𝒙 • (x, log y) is linear • Linear Regression on (x, log y) is 𝑙𝑜𝑔 𝑦 = 𝑎𝑥 + 𝑏 • Algebraically solve for 𝑦 Power Regression 𝒚 = 𝒂𝒙𝒃 • (log x, log y) is linear • Linear Regression on (log x, log y) is 𝑙𝑜𝑔 𝑦 = 𝑎𝑙𝑜𝑔 𝑥 + 𝑏 • Algebraically solve for 𝑦 What if the data is not linear? • Transform the data to determine whether Exponential or Power Regression is appropriate. • Run a Linear Regression on the transformed data. • Perform an inverse transformation to turn the equation into Exponential or Power. Transform the Data to Determine if Exponential or Power Regression is Appropriate 1. Enter data into L1 and L2. 2. See that it is not linear. (scatterplot is curved, residual is curved) 3. Transform the data into logarithms. – Enter L3 = log x and L4 = log y. 4. Look at (x, log y) and (log x, log y) for linearity. – If (x, log y) is linear, use exponential regression. – If (log x, log y) is linear, use power regression. Run a Linear Regression on the Transformed Data Exponential Power 1. Run Linear Regression on (x, log y) 4:LinReg L1,L4 2. Write as 𝑙𝑜𝑔 𝑦 = 𝑎𝑥 + 𝑏 3. Define your variables as they were originally (x = ?, 𝑦 = predicted ?) 1. Run Linear Regression on (log x, log y) 4:LinReg L3,L4 2. Write as 𝑙𝑜𝑔 𝑦 = 𝑎 𝑙𝑜𝑔 𝑥 + 𝑏 3. Define your variables as they were originally (x = ?, 𝑦 = predicted ?) Perform an Inverse Transformation to Turn the Equation into Exponential or Power Exponential Example Power Example 𝑙𝑜𝑔𝑦 = 𝑎𝑥 + 𝑏 → 𝑦 = 𝑎𝑏 𝑥 𝑙𝑜𝑔𝑦 = 𝑎𝑙𝑜𝑔𝑥 + 𝑏 → 𝑦 = 𝑎𝑥 𝑏 𝑙𝑜𝑔 𝑦 = 5𝑥 + 2 𝑙𝑜𝑔 𝑦 = 5𝑙𝑜𝑔𝑥 + 2 10log 𝑦 = 105𝑥+2 𝑦 = 105𝑥 102 𝑦 = 105 𝑥 102 𝑦 = 100,000 𝑥 100 𝑙𝑜𝑔 𝑦 = 𝑙𝑜𝑔 𝑥 5 + 2 5 10log 𝑦 = 10𝑙𝑜𝑔𝑥 +2 𝑦 = 𝑥 5 102 𝑦 = 𝑥 5 100 𝑦 = 100 100,000 𝑥 𝑦 = 100𝑥 5 Non-Linear Regression in the Calculator 1. Enter data into L1 and L2. 2. See that it is not linear. (scatterplot is curved, residual is curved) 3. Run 0:ExpReg L1, L2, Y1 4. Run A:PwrReg L1, L2, Y1 5. See which fits the data better – Look at the graph and see which curve follows the data better. – Look at r and r2 to see which line (x, log y) or (log x, log y) fits the data better. 6. Write out the equation from the calculator (NO LOGS) and define x = ? and 𝑦 = predicted ? By Hand vs. The Calculator x = L1, y = L2, log x = L3, log y = L4 LinReg L1, L4 = ExpReg L1, L2 𝑙𝑜𝑔𝑦 = 𝑎𝑥 + 𝑏 𝑦 = 𝑎𝑏 𝑥 LinReg L3, L4 = PwrReg L1, L2 𝑙𝑜𝑔𝑦 = 𝑎𝑙𝑜𝑔𝑥 + 𝑏 𝑦 = 𝑎𝑥 𝑏 Categorical Data in Two Way Tables Marginal Distribution: the distribution of only one of the variables. Chocolate Vanilla Strawberry Freshmen 10 12 16 Sophomores 11 19 22 Juniors 25 7 13 Seniors 10 22 2 Find the marginal distribution of ice cream flavors. Categorical Data in Two Way Tables Conditional Distribution: the distribution of one variable given a specific condition of the other variable. Chocolate Vanilla Strawberry Freshmen 10 12 16 Sophomores 11 19 22 Juniors 25 7 13 Seniors 10 22 2 Find the conditional distribution of grade level among those who prefer chocolate. 1. 2. 3. 4. 5. Chocolate Vanilla Strawberry Freshmen 10 12 16 Sophomores 11 19 22 Juniors 25 7 13 Seniors 10 22 2 What percent of students like strawberry? What percent of seniors like vanilla? What percent of chocolate lovers are juniors? What percent of students are freshmen? What percent of students are vanilla loving seniors? 6. What percent of upper classmen like chocolate? Simpson’s Paradox Suppose two people, Lisa and Bart, are editors for the St. Louis Post Dispatch. Answer the following questions given the data below: Lisa Bart Week 1 Week 2 Total 60 / 100 1 / 10 61 / 110 9 / 10 30 / 100 39 / 110 What percentage of articles did Lisa edit in Week 1? _________ Bart? _________ Who edited a higher percentage of articles in Week 1? ______________________ What percentage of articles did Lisa edit in Week 2? _________ Bart? _________ Who edited a higher percentage of articles in Week 2? ______________________ What percentage of articles did Lisa edit Total? _________ Bart? _________ Who edited a higher percentage of articles in Total? ______________________ HOW CAN THIS BE?? In the first week, Lisa improves 60 percent of the articles she edits while Bart improves 90 percent of the articles he edits. In the second week, Lisa improves just 10 percent of the articles she edits, while Bart improves 30 percent. Both times, Bart improved a much higher percentage of articles than Lisa—yet when the two tests are combined, Lisa has improved a much higher percentage than Bart! Lisa Bart Week 1 60.0% 90.0% Week 2 10.0% 30.0% Total 55.5% 35.5% Establishing Causation