Regression Analysis

UNC-Wilmington

Department of Economics and Finance

ECN 377

Dr. Chris Dumas

Regression Analysis

Regression Analysis is a method for using data to find the best estimates of the possible relationships among variables. Typically, we are interested in explaining, forecasting or predicting the behavior of one or more variables, called the dependent, or “Y,” variables , based on the behavior of one or more independent, or “X,” variables . In Regression Analysis, the X variables are hypothesized to affect the Y variables. Regression

Analysis is sometimes divided into two types, depending on the number of X variables involved in the analysis:

Simple Regression Analysis: Investigates the relationship between Y and one X variable.

Multiple Regression Analysis: Investigates the relationship between Y and two or more X variables.

The method used to conduct Regression Analysis is the same for both Simple and Multiple Regression Analysis, the terms “Simple” and “Multiple” are simply labels indicating the number of X variables involved.

OLS Regression Analysis

Several alternative mathematical techniques can be used to conduct Regression Analysis; in this handout we focus on the most commonly-used technique, named Ordinary Least Squares regression analysis, or “OLS.” (If you are investigating only one X variable, it would be Simple OLS Regression Analysis; if you have two or more X variables, it would be Multiple OLS Regression Analysis.)

Regression Analysis vs. Correlation Analysis

Regression Analysis is more general (can handle a wider variety of situations) than Correlation Analysis in that

Regression Analysis can be used to investigate and quantify nonlinear as well as linear relationships among the variables ; as you will recall, Correlation Analysis is limited to the investigation of linear relationships among the variables .

Y

Linear Relationship

Among Variables

Y

Nonlinear Relationships

Among Variables

Y

X

Nonlinear Relationship

Among Variables

Y

or

X

Nonlinear Relationships

Among Variables

X

or

X

1

UNC-Wilmington


The OLS Regression Model

ECN 377

Dr. Chris Dumas

An OLS Regression Model is a formal, precise (that is, mathematical) description of a hypothesized relationship among the Y and X variables for a population of individuals under study. That is, we measure the values of the Y and X variables for each individual in a population under study, and then we use Regression Analysis to estimate the relationship among the Y and X variables. We are assuming that the relationship among the Y and X variables is the same for all individuals in the population—the particular values of Y and X may be different for the various individuals in the population, but the relationship between the Y and X variables is assumed to be the same for all individuals (in other words, the effect of variable X on variable Y is assumed to be the same for all individuals).

Although Regression Analysis can be used to investigate both linear and nonlinear relationships among the variables , as described above, the relationships are assumed to be linear in the parameters . Relationships that are linear in the parameters have the following mathematical form:

𝑌 𝑖

= 𝛽

0

The OLS Regression Model

+ 𝛽

1

∙ 𝑋

1𝑖

+ 𝛽

2

∙ 𝑋

2𝑖

+ 𝛽

3

∙ 𝑋

3𝑖

+ ⋯ + 𝑒 𝑖



 where Y i

is the dependent variable that we are trying to understand, predict, forecast, etc.,

X

1i

, X

2i

, X

3i

, etc. are independent variables that we can observe and that we hypothesize might have an effect on Y i

,

 β

0

, β

1

, β

2

, etc. are parameters (that is, constants) that indicate the strength of the relationships between the various X variables and the Y variable,

 e i

is a “random error term,” a variable that represents the combined effects of “variables outside the model,” that is, variables that we do not observe (either because we can’t, or just because we haven’t taken the time or spent the money needed to observe them) on the Y variable. In combination, the

“variables outside the model” are assumed to have a random effect on Y, causing Y to vary randomly from what we would predict based on the values of the X’s that are included in the model. For this reason, the e i

variable is also called the “error term” because it accounts for the errors in our prediction of

Y that remain after we attempt to use the X’s in the model to predict Y. The e i

variable accounts for the combined effects of: o errors due to the assumption that the relationship between the variables is linear in the parameters o errors due to X variables that are omitted from (left out of) the regression equation o errors in measuring/approximating/rounding the X and Y variables o errors in recording the data on X and Y

 and finally, subscript i is simply a label that is used to indicate that the relationship is hypothesized to hold for individual i in the population. Each individual in the population has its own, separate values for

Y, the X’s and e, so each of these variables gets a subscript i; however, the β’s are assumed to be the same for all individuals, so the β’s do not receive i subscripts.

The OLS Regression Model equation above is said to be linear in the parameters because each parameter (that is, each β) enters the equation by either “standing alone” (for example, β

0

“stands alone”) or by multiplying an associated X variable (for example, β

1

multiplies its associated variable X

1

). In order for the equation to be linear in the parameters, no parameter can be located in an exponent position, inside a logarithm, inside a sin or cosine term in the equation, etc. However, the X variables can take on all kinds of crazy, nonlinear forms, and the equation will still remain linear in the parameters . For example, in place of variable X

1 in the equation above, we could insert log(X

1

), (X

1

) 2 , (X

1

) 3 , cosine(X

1

), etc., and the equation would still meet the assumption of being linear in the parameters . To remain linear in the parameters, we must simply keep the β’s out of any of these crazy forms that we might use for the X variables.

2

UNC-Wilmington


Deterministic vs. Stochastic Parts of the OLS Regression Model

ECN 377

Dr. Chris Dumas

In the OLS Regression Model equation below, the portion comprised of the β’s and X’s is called the deterministic part of the regression equation, and the portion comprised of simply e is called the stochastic part.

𝑌 𝑖

= 𝛽

0

+ 𝛽

1

∙ 𝑋

1𝑖

+ 𝛽

2

∙ 𝑋

2𝑖

+ 𝛽

3

∙ 𝑋

3𝑖

+ ⋯ + 𝑒 𝑖

“deterministic” part of “stochastic” part of regression model regression model

The deterministic part of the regression equation is simply the part that is not random. The stochastic part of the regression equation is simply the part that is random.

The Goal of OLS Regression Analysis

!!!! The goal of OLS Regression Analysis is to find the best possible estimates of the

β’s in the OLS Regression Model equation!!!!!

Once we have the β’s, we know the complete equation (because we already have the values of the X’s and Y’s from our data), except for the random error e, of course (which we hope will be small). (Actually, as described later, we will estimate the value of e and determine whether it is large or small.)

Population OLS Regression Model vs. Sample OLS Regression Model

The OLS regression model we have been discussing to this point is a model of the relationship between the Y and

X variables based on data from all individuals in the population under study; this is the so-called population regression model . However, usually we don’t have data for all individuals in a population; instead, we usually have data for only a sample of individuals. A regression model based on the data from a sample of individuals is called a sample regression model . From this perspective, the population regression model is the “true” relationship among the variables for the population, and the sample regression model is our estimate of the true relationship. Because our estimate could be (usually is) different from the truth, we use different symbols to represent the population regression model and the sample regression model:

𝑌 𝑖

Population OLS Regression Model

= 𝛽

0

+ 𝛽

1

∙ 𝑋

1𝑖

+ 𝛽

2

∙ 𝑋

2𝑖

+ 𝛽

3

∙ 𝑋

3𝑖

+ ⋯ + 𝑒 𝑖

𝑌 𝑖

Sample OLS Regression Model

= β̂

0

+ β̂

1

∙ 𝑋

1𝑖

+ β̂

2

∙ 𝑋

2𝑖

+ β̂

3

∙ 𝑋

3𝑖

+ ⋯ + ê i

Notice that the only differences between the population regression model and the sample regression model are:

 the sample regression model uses “ β̂

 the sample regression model uses “ ê i i

” in place of “β’s”

” in place of “e i

”

!!!! We use our sample data to estimate the “

β̂ i

” and “

ê i

” in the sample regression model. The “

β̂ i

” and “

ê i

” from the sample regression model are our estimates of the “β’s” and “u

i

’s” in the population regression model !!!!

3

UNC-Wilmington


The OLS Regression Estimator Equations

ECN 377

Dr. Chris Dumas

An Estimator is a rule/procedure/method used to calculate an estimate of something.

The OLS Regression Estimator is the method/procedure used to calculate the β̂ i equation. These β̂ i

’s in the sample OLS regression

’s are our estimates of the β’s in the population regression equation.

The OLS Regression Estimator method begins with the Sample OLS Regression Model equation:

𝑌 𝑖

= β̂

0

+ β̂

1

∙ 𝑋

1𝑖

+ β̂

2

∙ 𝑋

2𝑖

+ β̂

3

∙ 𝑋

3𝑖

+ ⋯ + ê i

To see how the method works, let’s consider an example with just one X variable :

𝑌 𝑖

= β̂

0

+ β̂

1

∙ 𝑋

1𝑖

+ ê i

The basic idea of the method is to find the values of β̂

0

and β̂

1

that make the errors in the equation, the ê i

’s, as small as possible. (After all, if you’re trying to estimate something, you want the error in your estimate to be as small as possible.) Because we are trying to make the ê i to focus on the ê i

’s:

’s as small as possible, let’s rearrange the equation above 𝑒̂ 𝑖

= 𝑌 𝑖

− β̂

0

− β̂

1

∙ 𝑋

1𝑖

Now, for a given individual i from our sample, we could plug in the Y i

and X i

values for this individual from our sample data and, if we plugged in “guesstimates” for β̂

0 to choose the best “guesstimates” for β̂ to choose values for β̂

0

and β̂

1

0

and β̂

1

and β̂

1

, we could then calculate the ê i

that would make the

that would make the ê i ê i

’s. Our problem is

’s as small as possible. However, we want

’s as small as possible for all individuals in the sample.

Remember that the same regression equation is assumed to apply to all individuals, so all individuals have the same values for β̂

0

and β̂

1 value of ê i

for every individual in the sample, sum up (add up) the ê i guesstimates for β̂

0

and β̂

1

. We could guess alternative values for

are the ones that make the sum of the ê to do this, we would discover that the positive ê i i

β̂

0

and β̂

1

, and, for each guess, calculate the

’s, and then conclude that the “best”

’s as small as possible. However, if we were

’s exactly cancel the negative ê i

’s when we add them up, so that they add to zero, regardless of our guess about the values of β̂

0

and β̂

1

. So, to prevent the ê i

’s from adding to zero, we square them before we add them up . . .

So, in the end, the OLS Regression Estimator Equations are the solution to this problem:

“Find the β̂ i

’s that minimize the sum of the squared ê i

’s for all individuals in the sample.” or, written as a math optimization problem for our simple example with just two b’s (b

0

and b

1

):

min

𝛽

0

̂

1

∑(𝑒̂

𝑖

2

)

𝑖

The problem above is a nonlinear optimization problem without constraints, so we can use the classical calculus method of optimization to find a solution (recall the classical calculus method of optimization from your ECN321 course at UNCW).

4

UNC-Wilmington


ECN 377

Dr. Chris Dumas

After a lot of algebra and a bit of calculus (see the Appendix of this handout for details), we find:

The OLS Regression Estimator Equations for β̂

0

and β̂

1

.

β̂

1

=

β̂

0

𝑌

𝑖

𝑋

1𝑖

]

𝑋

2

1𝑖

]

= 𝑌̅ − β̂

1

− 𝑛 ∙ 𝑋

1

∙ 𝑋

̅̅̅) 2 where 𝑋

1

is the mean value of X

1

, and 𝑌̅ is the mean value of Y.

To use the OLS Estimator Equations , we calculate n, 𝑋

1

, 𝑌̅ , ∑ [

𝑌

𝑖

𝑋

1𝑖

] , and ∑ [

𝑋

2

1𝑖

] from our sample data and insert these values into the equation for of β̂

1

(along with

̅̅̅

β̂

1

, which gives an answer for β̂

1

. Then, we insert the resulting value

and 𝑌̅ ) into the equation for β̂

0

to find and answer for β̂

0

.

Sidenote: If our regression equation had more X variables in it, such as X

2

, X

3

, etc., we could find similar equations for the corresponding β̂

2

, β̂

3

, etc., but this is much more easily done using methods from linear/matrix algebra, so for now just accept that this can be done (hey, we gotta save something for you to learn in grad school!).

Testing Hypotheses about β

0

and β

1

Recall that β̂

0

and β̂

1

are our best estimates of β

0

and β

1

. The β̂

0

and β̂

1

estimates are calculated from only a sample of data (rather than from data for the full population), and so the values that we calculate for β̂

0

and β̂

1 depend on which particular individuals appear in the sample. Because β̂

0

and β̂

1

are estimates based on only a sample of data, they are subject to error and are probably not equal to the true values β

0

and β

1

. However, using our estimates of β̂

0

and β̂

1

, we can test hypotheses about the true values β

0

and β

1

and develop confidence intervals for β

0

and β

1

. To do so, we need to calculate the variances and standard errors of β̂

0

and β̂

1

(because the standard errors are needed in the t-test formula and the confidence interval formula). It can be shown that the variances are: 𝜎

2

̂𝑜

= 𝑣𝑎𝑟(𝛽̂𝑜) = 𝜎 2 𝑒

∙ [ 𝑛 ∙ ∑ (𝑋 𝑖 𝑖

2

− 𝑋̅) 2

]

Variances of b

0

and b

1

: 𝜎

2

̂1

= 𝑣𝑎𝑟(𝛽̂1) = 𝜎

2 𝑒

∙ [

∑ (𝑋 𝑖

1

− 𝑋̅) 2

]

In both equations above, 𝝈

𝟐 𝒆

Unfortunately, because the e

is the variance of the error term “e i

” in the population regression equation

. i

’s are unknown, sample regression equation (recall that the ê calculate 𝜎

2 𝑒̂

as shown below. 𝝈

𝟐 𝒆̂ i 𝜎

2 𝑒

is also unknown. However, we can calculate the

’s are our estimates of the e

is the variance of the error term ê i ê i

’s from the i

’s), and then, based on the ê i

’s, we can

in the sample regression equation.

!!!! We then use 𝝈

𝟐 𝒆̂

as our estimate of 𝝈

𝟐 𝒆

!!!!! 𝝈

𝟐 𝒆̂ is our estimate of 𝝈

𝟐 𝒆 𝝈

𝟐 𝒆̂

= 𝑣𝑎𝑟(ê i

) = 𝑖

2 𝑛 − 𝑘

) where k is the number of β’s in the regression equation.

5

UNC-Wilmington


ECN 377

Dr. Chris Dumas

So, we plug our estimates of β̂

0 equation to calculate the ê i divide by n – k this gives us 𝜎 above (along with n, X

1i

and 𝑋

2 𝑒̂

1𝑖

and β̂

1 into our sample regression equation, then we use the sample regression

’s for all of the individuals in our sample. We square these ê

, which we use as our estimate of 𝜎

) in order to calculate 𝜎

2

̂𝑜

and 𝜎

2

̂1

2 𝑒

. We substitute 𝜎

.

2 𝑒̂ i

’s, add them up, and

for 𝜎

2 𝑒

in the equations

With the variances of β̂

0

and β̂

1 in hand, we can now calculate the standard errors of β̂

0

and β̂

1 by simply taking the square roots of the variances: s.e.

β̂

0

= √𝜎

2

̂ 𝑜

Standard Errors of b

0

and b

1

: s.e.

β̂

1

= √𝜎

2

̂ 1

Now we can (finally) test hypotheses about β

0

and β

1

using t-tests. As is usually the case with t-tests, we calculate a t test

number and compare it with a t critical

number from the t-table. Typically, we are interested in testing the following two-sided hypotheses:

H0: β

H1: β

0

0

= 0

≠ 0

H0: β

1

= 0

H1: β

1

≠ 0

Note that we could test one-sided hypotheses instead of two-sided, if we wished. Also, we could test hypotheses about some other given number, instead of 0, by plugging in the given number in place of 0. 𝑡 𝑡𝑒𝑠𝑡

=

β̂

0

− 𝛽

0 𝑠. 𝑒.

β̂

0

As always, take the hypothesized values of the β’s (in this case, zeros) from H0 and H1 above and insert them into the t test formulas for

β

0

and β

1.

t test

formulas for β̂

0

and β̂

1

: 𝑡 𝑡𝑒𝑠𝑡

=

β̂

1

− 𝛽

1 𝑠. 𝑒.

β̂

1

To do a t-test, we also need values for t critical

. The values for t critical

are found from the t-table using α/2 (because it’s a two-sided test) and d.f. = n – k, where n = sample size, and k = the number of β’s in the regression equation.

As usual, if t test

> t critical

(where t test

is farther from zero than t critical

), then we Reject H0 and Accept H1.

Interpretation: Each of these t-tests is a test of whether one of the β’s in the population regression equation is equal to zero.

If we accept H0 for β̂

1

, it means that β

1

is zero, and as a result β

1

X

1

falls out of the regression equation.

This means that X

1

has no effect on Y.

If we reject H0 for β̂

1

, it means that β

1

is not zero, and β

1

X

1

remains in the regression equation. This means that X

1 does have an effect on Y, and β̂

1



A one unit increase in X

1 is our estimate of the effect of X

causes Y to change by β̂

1

units.

1

on Y. This means:



If β̂

1

is positive, then X



Or, if β̂

1

1

has a positive effect on Y.

is negative, then X

1

has a negative effect on Y.

6

UNC-Wilmington


Confidence Intervals for β

0

and β

1

ECN 377

Dr. Chris Dumas

We can calculate Confidence Intervals for β

0

and β

1

using the usual Confidence Interval formulas:

Confidence Interval for β

0

= β̂

Confidence Interval for β

1

= β̂

0

+/- (t critical,α/2

· s.e.

β̂

0

1

+/- (t critical,α/2

· s.e.

β̂

1

)

)

Measuring “Goodness of Fit” for the Regression Equation

We often want to know how well the regression equation fits the sample data. There are many measures of

“Goodness of Fit,” but three of the most commonly used measures are the Standard Error of the Regression

(SER), the Multiple Correlation Coefficient (R), and the Coefficient of Determination (R 2 ).

The Standard Error of the Regression (SER) is the square root of the variance of the ê i

’s. SER measures the variation of the Y data (on average) around the sample regression equation. SER gives the average distance of a data point from the regression line/curve. If the sample regression line fits the data well, then the data points will be close to the regression line/curve, and the SER will be small; thus, the smaller the SER, the better the sample regression equation fits the data.

𝑺𝑬𝑹 = √𝒗𝒂𝒓(ê i

’s ) = √𝝈

𝟐 ê i

= √ 𝒊

𝟐 𝒏 − 𝒌

)

The Multiple Correlation Coefficient (R) is similar to a Pearson Correlation Coefficient (r), but R applies to situations where more than two variables are involved. Whereas the Pearson Correlation Coefficient measures the linear correlation between two variables, such as Y and X, or between two different X's, the Multiple Correlation

Coefficient measures the linear correlation between the Y values of the data points and the Y values of the regression line/curve (which depend on all the X variables in the regression equation). Unlike the Pearson correlation coefficient (r), which can be either positive or negative, R can only be positive, and ranges from 0 to

+1. Thus, whereas r measures both the strength and the direction (positive or negative) of the relationship, R measures only the strength and not the direction. The closer R is to +1, the better the regression line/curve fits the data points. The formula for R is based on a complicated combination of the Pearson Correlation Coefficients

(r's) between Y and each X, as well as the Pearson Correlation Coefficients between the pairs of X variables. The formula is most easily understood using matrix algebra, which we won't pursue further here; in this course, we'll find R by simply taking the square root of the Coefficient of Determination R 2 , which is described below.

The Coefficient of Determination (R 2 ) measures the proportion of the total variation in the Y data that is explained by the X's in the regression equation.





The total variation in the Y data is measured by comparing the Y i

values in the dataset to the average

(mean) value of Y in the dataset,

̅

. The difference between Y

i

and

Y

is calculated for each observation in the dataset, each difference is squared, and the squared differences are added up to obtain the

Total Sum of Squares (TSS) = ∑ (𝑌 𝑖

− 𝑌̅) 2

.

The amount of the total variation in the Y data that is explained by the sample regression model is given by the difference between 𝑌̂ i

and

obtained by plugging X

i

̅

, where

𝑌̂ i

is the value of Y on the regression line/curve

into the regression equation. The difference between

𝑌̂ i

and

Y


Regression Sum of Squares (RSS) = ∑ (𝑌̂ 𝑖

− 𝑌̅)

2

.

7

UNC-Wilmington




ECN 377

Dr. Chris Dumas

The amount of the total variation in the Y data that is not explained by the sample regression model is given by the difference between

Y

i

and

𝑌̂

. The difference

between

Y

i

and

𝑌̂


Error Sum of Squares (ESS) = ∑ (𝑌 𝑖

− 𝑌̂ 𝑖

)

2

.

Components of TSS, RSS and ESS

Y

𝑌̂

𝑌̅

part of ESS, Y

𝑌̅

𝑌̅ i

-

𝑌̂

Y

X

i i

𝑌̂

-

Y

𝑌̂ = β̂

, part of RSS

𝑌̅

𝑌̅

0

+ β̂

1

∙ 𝑋

Y

i

-

Y

, part of

TSS

𝑌̅

𝑌̅

X

1

1

Coefficient of Determination

R 2 =

𝑹𝑺𝑺

𝑻𝑺𝑺



The value of R 2 always lies between zero and one.



The larger the value of R 2 , the better the sample regression equation fits the data.



Example: if R 2 = 0.67, then 67% of the variation in the Y data can be explained by the X variables in the sample regression equation.

Adjusted R-squared or “R-bar-squared” ( 𝑹 𝟐

)

The Coefficient of Determination (R 2 ) is defined for a sample regression equation that has a single X variable. When the sample regression equation contains more than one X variable, the Coefficient of

Determination must be adjusted for the number of X variables in the equation. This adjusted Coefficient of Determination is called the “Adjusted R-squared” or “R-bar-squared” ( 𝑅̅

2

). The formula for the adjusted R-squared is:

𝑅̅

2

= 1 − (1 − 𝑅

2

) ( 𝑛 − 1 𝑛 − 𝑘

) where n is sample size and k is the number of β ’s in the regression equation.

Using an F-test to Assess the Statistical Significance of the Entire Regression Equation

We can use an F-test to test a hypothesis about the statistical significance of the entire regression equation as a whole . This is a test of the hypothesis that all of the β’s in the population regression model are equal to zero:

H0: all of the β’s in the population regression model are zero

H1: one or more of the β’s in the population regression model is not zero

(Side note: Earlier in this handout, we tested hypotheses about the individual β’s in the regression model. Those were tests of the statistical significance of particular parts of the regression model rather than tests of the statistical significance of the whole regression equation.)

Equivalently, the hypotheses for this test can be expressed as:

H0: R 2 = 0 the regression equation explains none of the variation in Y

H1: R 2 > 0 the regression equation explains at least some of the variation in Y

8

UNC-Wilmington


ECN 377

Dr. Chris Dumas

We use an F-test to conduct this hypothesis test. In this case, the formula for the F test

number is:

𝐹 𝑡𝑒𝑠𝑡

=

[

𝑅𝑆𝑆

] 𝑘−1

𝐸𝑆𝑆

[ 𝑛−𝑘

]

𝑜𝑟 𝐹 𝑡𝑒𝑠𝑡

=

[

𝑅2 𝑘−1

]

[

1−𝑅2 𝑛−𝑘

] where n is sample size and k is the number of β’s in the regression equation.

The value of F critical

is obtained from the F-table with significance level = α (this is a one-sided test), d.f.

numerator

= k – 1 and d.f.

denominator

= n – k.

As usual, if F test

> F critical

then Reject H0 and Accept H1.

If hypothesis H0 is accepted (not rejected), then all of the β’s in the regression model are zero, which means that none of the X variables in the model help to explain movements in Y. That is, the model does not help explain/predict/forecast Y. (It’s “back to the drawing board” to construct a new model.)

If hypothesis H0 is rejected, then one or more of the β’s in the regression model are not zero, which means that one or more of the X variables in the model helps to explain movements in Y. That is, the model does help explain/predict/forecast Y. (We can now use t-tests, as explained earlier, to determine which β’s contribute to the explanation/prediction/forecast of Y.)

Applying the OLS Regression Method

When we use the OLS Regression Method, in what order should we look at the various results?

1.

Use the OLS Regression Estimator Equations to estimate the 𝛽 ‘s

for the regression model.

2.

Use the F-test to determine whether the regression as a whole is statistically significant. If the F-test is not significant (that is, if we do not reject H0), then the X’s (all of them together) in the model are not helping us explain Y, and we should “go back to the drawing board” and develop a new model.

3.

If the F-test is significant, then calculate the “Goodness of Fit” measures: (1) calculate SER, which measures the average variation of the Y data around the sample regression line/curve and (2) calculate R 2

(or Rbar-squared) to assess the proportion of the variation in Y that is explained by the X’s (all of them together) in the model.

4.

Check the t test

numbers for the 𝛽 ‘s

to determine which of the 𝛽 ‘s

are statistically significant. If a 𝛽 is

5.

statistically significant, then its associated X variable has a statistically significant effect on Y

For the 𝛽 ‘s

that are statistically significant, interpret them for your audience/reader. Each statistically significant 𝛽 gives the direction and size of the effect of its associated X variable on Y. Examples: If 𝛽

= 37, then when X

1

increases by one unit, then Y increases by 37 units (Y increases because 𝛽 positive). If 𝛽

2

1

is

= -5.7, then when X

2

increases by one unit, then Y decreases by 5.7 units (Y decreases because 𝛽

2

is negative). Notice that interpretation of 𝛽

̂

0

, the intercept, does not have an associated X variable. The

0

is that it is the average value of Y when all of the X’s are equal to zero.

1

9

UNC-Wilmington


ECN 377

Dr. Chris Dumas

Assumptions behind the OLS Regression Model

For the OLS Regression Model to work properly, several assumptions must be met:

1.

The dependent (Y) variable is a continuous, measurement variable.

2.

The regression model equation must be linear in the parameters.

3.

The distribution of the population error e is normal (bell-shaped) with a mean value of zero.

4.

The variance of the population error e is the same for all individuals in the population. (This is known as the “homoskedasticity” assumption.)

5.

Each population error e is uncorrelated with (independent of) every other population error. (That is, if one error is large, that doesn’t mean that the next error will necessarily be large; the errors do not affect one another.)

6.

The population error e is uncorrelated (is independent of) the X and Y variables in the model.

7.

There is no perfect linear correlation between any two X variables in the model.

8.

The model is correctly “specified.” (That is, all X variables that affect Y are included in the regression model equation, and all X variables that do not affect Y are excluded from the regression equation.)

!!!!

The Gauss-Markov Theorem

—This is the reason why we use the OLS Regression method to analyze relationships between variables rather than some other method !!!!

If the Assumptions behind the OLS Regression Model listed above are met, then the OLS Regression methodology is

“B.L.U.E.”



B (Best) = the OLS Regression methodology produces the lowest variance in the estimates of the

̂ ‘s compared to all other linear, unbiased estimators.



L (Linear) = the OLS Regression methodology is linear in the parameters



U (Unbiased) = the OLS Regression methodology produces unbiased estimates of the 𝛽 ‘s

(that is, the estimates of the 𝛽 ‘s

are equal to the true β’s, on average across many samples).



E (Estimator) = the OLS Regression model is an estimator (a method that produces estimates of the 𝛽 ‘s from the sample data on X and Y).

10

Regression Analysis

or

or

1

2

“deterministic” part of “stochastic” part of regression model regression model

!!!! The goal of OLS Regression Analysis is to find the best possible estimates of the

β’s in the OLS Regression Model equation!!!!!

!!!! We use our sample data to estimate the “

” and “

” in the sample regression model. The “

” and “

” from the sample regression model are our estimates of the “β’s” and “u

’s” in the population regression model !!!!

3

min

∑(𝑒̂

)

4

=

𝑌

𝑋

𝑋

∙ 𝑋

𝑌

𝑋

𝑋

5

6

. The difference between Y

and

is calculated for each observation in the dataset, each difference is squared, and the squared differences are added up to obtain the

obtained by plugging X

, where

is the value of Y on the regression line/curve

into the regression equation. The difference between

and

is calculated for each observation in the dataset, each difference is squared, and the squared differences are added up to obtain the

7

Y

and

. The difference

Y

and

is calculated for each observation in the dataset, each difference is squared, and the squared differences are added up to obtain the

Components of TSS, RSS and ESS

Y

part of ESS, Y

-

Y

X

-

Y

-

, part of

X

8

9

!!!!

—This is the reason why we use the OLS Regression method to analyze relationships between variables rather than some other method !!!!

10

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib