A Brief Introduction A. Data (variables). Can be in three forms: 1. Interval – There is a common scale to measure the variable, so that a value of two is actually twice a value of one. Examples: % of vote, degrees Fahrenheit, number killed, duration of regime, number of soldiers, GDP 2. Ordinal – There is a rank-ordering to the variable, so 2 > 1, but the scale varies so that 2 is not exactly twice one. Examples: Yes/No variables, how close a bill is to passage (no houses, one house, both houses, signature), war outcomes (win, lose, or draw) 3. Nominal – There are numbers, but they are completely arbitrary. Examples: country codes, leader names, strategy choices, apples and oranges. 1. 2. Examples include % of the two-party Presidential vote, % seats held by Dems, war/non-war, political (in)stability, etc. Easiest to have a continuous (interval) DV, but techniques exist for all three types 1. 2. 3. Can be either interval or ordinal. So… Transform nominal into ordinal. Example: Is this country the US? A nominal variable (USA) becomes an ordinal one (Yes or No). Again, examples in syllabus 1. Positive (or direct) correlation: the values of the IV and DV move up and down together (poverty and crime, CO2 and global temperature, drug addiction and prostitution, geographic proximity and conflict) 1. 2. Positive (or direct) correlation: the values of the IV and DV move up and down together (poverty and crime, CO2 and global temperature, drug addiction and prostitution, geographic proximity and conflict) Negative (or inverse): The values of the IV and DV move in opposite directions (alcohol and coordination, democracy and interstate conflict, war and development) 1. 2. 3. Positive (or direct) correlation: the values of the IV and DV move up and down together (poverty and crime, CO2 and global temperature, drug addiction and prostitution, geographic proximity and conflict) Negative (or inverse): The values of the IV and DV move in opposite directions (alcohol and coordination, democracy and interstate conflict, war and development) Conditional: Direction depends on the value of some other variable Dependent Variable Independent Variables Statistical Relationships A. Simplest tool: the scatterplot or scatter diagram. Example from medicine: A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn The following data set provide information on 15 pregnant mothers who were contacted for this study BMI (Kg/m2) Birth-weight (Kg) 20 30 50 45 10 30 40 25 50 20 10 55 60 50 35 2.7 2.9 3.4 3.0 2.2 3.1 3.3 2.3 3.5 2.5 1.5 3.8 3.7 3.1 2.8 Scatter diagram plots bivariate observations (X, Y) BMI (the IV) is X and birthweight (the DV) is Y ◦ Y is the dependent variable (Dependent goes Down the side) ◦ X is the independent variable (goes across the graph) Scatter diagram of BMI and Birthweight 4 3.5 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 70 People tend to mentally fit a line or curve to describe the shape of the scatterplot Examples: Strong relationships Y Weak relationships Y X Y X Y X X No relationship Y X Y X Linear relationships Y Curvilinear relationships Y X Y X Y X X 1. 2. 3. Intended to simplify relationship. The line is ultimately an estimate, usually known to be wrong (but close enough to be useful) Line is probabilistic, not deterministic – otherwise it would perfectly pass through every point on the scatterplot = key difference between predicting politics and predicting planetary orbits. Kepler’s equations are deterministic, but econometric models are probabilistic Sample scatterplot: 60 40 20 0 Y 0 20 40 X 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 How would you draw a line through the points? How do you determine which line ‘fits best’? Y 60 40 20 0 X 0 20 40 60 Regression = using an equation to find the line (or curve) that most closely fits the data a. Relationship Between Variables Is a Linear Function Constant, or Y-Intercept Coefficient of X, or Slope Random Error Y 0 1X Dependent Variable Independent (Explanatory or Control) Variable It should…. Y Y = mX + b m = Slope Change in Y Change in X b = Y-intercept X High School Teacher As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the yintercept One definition of m is that for every one unit increase in x, there is an m unit increase in y One definition of b is the value of y when x is equal to zero Line 20 18 16 y = 1.5x + 4 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 Look at the data in this picture Does there seem to be a correlation (linear relationship) in the data? Is the data perfectly linear? Could we fit a line to this data? 25 20 15 10 5 0 0 2 4 6 8 10 12 Linear regression tries to find the best line (curve) to fit the data The equation of the line is The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points 25 20 y = 1.5x + 4 15 10 5 0 0 2 4 6 8 10 12 a. b. Find the values of that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the i2 Why minimize squared errors? ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum But Positive Differences Offset Negative! (errors of 10 and -10 add to zero) squaring errors solves the problem: 10 * 10 = 100 and -10 * -10 also = 100. For each observation i, the equation is merely an estimate, not the actual value. There are errors (εi), and the line minimizes the sum of ε12, ε22, ε32, ε42, ε52, and so on. Yi 0 1X i Y ^4 ^2 ^1 ^3 X Regression Formula: Y = a + bX, Y = α + βX, Y = α + β1X1, Y = β0 + β1X1, etc all are the same formula! • Y = the predicted value of the dependent variable (its estimated mean given X) • a (or alpha: α, or beta-zero: β0) = the Y intercept, or the value of Y when X = 0 (constant) • b (or beta: β) = the regression coefficient, the slope of the regression line, or the amount of change produced in Y by a unit change in X Positive sign of regression coefficient: positive direction of association Negative sign of regression coefficient: negative direction of association • X = the value of the independent variable 47 What is: ◦ ◦ ◦ ◦ Y? X? β1? β0? Typical formula: Y = β0 + β1X1 + β2X2 + β3X3, etc. • DV, constant haven’t changed • But now there are several independent variables • Each IV has its own coefficient. So the first X may be positively related to Y, while the others might be negatively related to Y. • Could plot the effect of any one independent variable on Y as a line, but can no longer plot the whole equation since there are now as many dimensions as there are independent variables (plus one, for Y). • Multivariate regression is best interpreted by consulting tables of coefficients, evaluating the effect of each X separately (i.e. all else being equal) 49 1. R2 : Proportion of the variation in the dependent variable (Y ) that is explained by the independent variable (X) R2 =Explained variation/Total variation Ranges between 0 (no reduction in error) and 1 (no errors remain – the model perfectly predicts the dependent variable) R2 is a comparative measure – it compares the amount of error made by the linear regression to the amount of error made by guessing the mean (average) value of Y for every case (e.g. Y = 12 for every case) 50 It is how much variation there is when you know X (i.e. how good your line fits the data) compared to how much variation there is when you don’t know X (which means you just assume the mean of Y is constant). First the regression…. Y (Internet use, hours per week) …and now the variance without regression 16 14 12 10 8 6 4 2 0 Y Predicted Y 0 1 2 3 4 X (Education level) 51 Good Fit 2 1.8 1.6 1.4 y 1.2 1 0.8 y = 1.9599x + 0.2823 0.6 2 R = 0.9369 0.4 0.2 0 0 0.2 0.4 0.6 x1 0.8 1 Poorer Fit 3 2.5 y 2 1.5 1 y = 1.9696x + 0.5683 2 R = 0.811 0.5 0 0 0.2 0.4 0.6 x1 0.8 1 1.2 Statistical significance of the regression model uses one of a number of indicators (χ2, for example). No need to understand the indicator to interpret it. Look for a “p value” associated with the indicator. b. Statistical Significance of each Regression Coefficient (β1, for example). Also measured by a p value. c. Key is to find p and see if p < .05 (in the social sciences). If yes statistically significant. If no not statistically significant. The p value is the probability that random noise would have coincidentally given you an association this strong. Hence, lower values of p are “better.” a. 56 • • • • The p value is the probability that random data (i.e. no real relationship with Y) would have coincidentally given you an association this strong. Hence, lower values of p are “better.” Authors sometimes say “significant at the .001 level.” This means p < .001. There may or may not be a table of p values for coefficients – authors frequently use asterisks to highlight coefficients at a given level of significance. If the model is not significant, the author has failed to discover a significant correlation between the model’s predicted values of Y and the actual values of Y. If a coefficient is not significant, then the author has failed to discover a significant correlation between that particular X and Y. 57 • • “p <.6 so the relationship is statistically insignificant, and therefore I conclude that X doesn’t affect Y” – Not true, because p could be .001. All we know is that it is less than .23. In other words, absence of evidence is not evidence of absence. Indeed when the number of cases is very small, all of the p values – even for real relationships – are likely to be too large to make the coefficients statistically significant “p < .000001 so the relationship between X and Y is very strong” – Not true, because p values for any coefficient (no matter how tiny) becomes smaller as the number of cases increases. Millions of cases just about every relationship is “statistically significant,” but many are substantively trivial 58 This depends on what you are looking for! • What units are X and Y measured in • Does the coefficient mean that small increases in X lead to large increases in Y? If statistically significant, this is also substantively significant • Does the coefficient mean that large increases in X only produce trivial changes in Y? Then regardless of statistical significance, the relationship is substantively uninteresting • This is a qualitative judgment based on your needs, but it takes into account the numbers Research hypothesis: The level of economic development has a positive effect on civil liberties in countries of the world Dependent variable: civil liberties ◦ Interval-ratio Independent variable: GDP per capita ($1000) ◦ Measure of the level of the economic development ◦ Interval-ratio 61 • Regression Coefficient (beta) = .257 Substantive significance • Increase of $1000 in the level of GDP per capita increases the civil liberties score by .257. • On a 5-point scale, this is interesting. On a 1000-point scale it would not be interesting. Statistical significance: p < .001 Statistically significant at the .001 or .1% level • R square=.525 GDP per capita explains 52.5% of variation in civil liberties • Research hypothesis: was not falsified by bivariate regression analysis (i.e. was consistent with the regression) The level of economic development has a positive and statistically significant effect on civil liberties 62 Linear regression predicts best near the mean values of X. Extreme values of X (low or high) are associated with greater error when predicting Y. Solution: Confidence intervals. A 95% confidence interval is where 95% of observations of Y at a given value of X are expected to fall, given the significance of the coefficient of X. Example: Polls with “margins of error” (typically 95% confidence intervals) Another example: 63 Also known as “time series analysis.” A. Simplest form: Yt = Yt-1+α ◦ Y is the DV, t is time, and α is a constant ◦ If Yt-1Y is 38 and α is 1, then y will be 101, 102, 103, etc as time passes ◦ Note that this is simply a rearranged linear regression equation. The DV is predicted by previous values of the DV (which fill in as the IVs in the model) Form: Yt = βYt-1 + α β is the multiplicative relationship between Yt-1 and Yt So if β=1, then Y never changes over time. ◦ If β>1 then Y increases over time ◦ If β<1 then Y diminishes over time 1. Time’s arrow: Since cause must precede effect, time series analysis can be used to rule out the possibility that Y causes X 2. Autocorrelation: Sometimes we need to address the correlation of a variable with itself over time. Example: to predict defense budget, first thing to know is that it’s usually similar to last year’s budget. Then one can add IVs that might cause it to increase or decrease. 3. Omitted variable bias: Failing to “control” for a relevant IV (one that may correlate with both X and Y) can generate “false positives” – statistically significant relationships between variables that are causally unrelated (example: high correlation between Vietnam vets and supermarkets) A. Is the relationship causal? Difficult to know for sure… 1. Possibility of coincidence: Addressed by requiring models to be statistically significant. Chance remains, but is low. 2. Sources of bias: a. b. c. Y causes X. That is, perhaps the researcher has reversed the DV and IV. Use time-series analysis to rule this out. Faulty data – But only if the data is biased in some manner that makes X and Y correlate. Random noise is already accounted for. Example of bias = serial autocorrelation, or correlation across time. Many things (kids and dogs) grow larger over time. But height of your kid does not cause your dog to get bigger! Omitted variables – suppose Z causes X and Z causes Y. Then X and Y will appear to be causally related when in fact they are merely correlated. Adding Z to the model would reveal that X has no independent effect on Y. A. Is the relationship causal? Difficult to know for sure… 1. Possibility of coincidence: Addressed by requiring models to be statistically significant. Chance remains, but is low. 2. Sources of bias: a. b. c. Y causes X. That is, perhaps the researcher has reversed the DV and IV. Use time-series analysis to rule this out. Faulty data – But only if the data is biased in some manner that makes X and Y correlate. Random noise is already accounted for. Example of bias = serial autocorrelation, or correlation across time. Many things (kids and dogs) grow larger over time. But height of your kid does not cause your dog to get bigger! Omitted variables – suppose Z causes X and Z causes Y. Then X and Y will appear to be causally related when in fact they are merely correlated. Adding Z to the model would reveal that X has no independent effect on Y. 1. Requires either a. The ability to forecast the IVs themselves, or b. A model that forecasts Y(t) from IVs in t-1, t-10, etc. 2. Long-term forecasting models are rare. Why? 1. 2. 3. 4. 5. 6. Find a linear regression (OLS) that forecasts something Find the future values of X Plug these into the equation Multiply each X with its corresponding B (order of operations) Add it all together. Don’t forget the intercept. Presto! You have a forecast for Y!