Uploaded by blue_jennifer2000

TOPIC 6

advertisement
STATISTIC FOR BUSINESS
TOPIC 6A: SIMPLE LINEAR REGRESSION ANALYSIS
1 Introduction



Main objection of statistical investigations  establish relationships which make it possible
to predict one or more variables in terms of other known variables.
Problem of predicting the average value of one variable in terms of known value of another
variable is the problem of regression.
Crucial for decision making is  making prediction.
Simple linear regression
 Common process used in all methods of prediction is:
1. To fit a model to the data which has been collected; and,
2. to use this model as the basis for any predictions made.
– It is believed there is some relationships, so go ahead and record some of
the observation. Once data Is collected, use the data to find some kind of
model to use to a prediction device.
 Clearly, if our predictions are to be ‘close’, it is important that:
o The fitted model gives an accurate representation of the data; and
o That the mechanisms which gave rise to the collected data are also valid for the
period of the prediction.
 For this season reason, avoid making predictions in the far future, or the circumstances
well outside the scope of the initial data. Estimates should be revised when either.
o More data becomes available.
o The underlying mechanisms are known to have changed.
Scatter Diagrams
 Scatter diagrams Shows the relationship between two sets of data.


We are interested in estimating a straight-line relationship between variables X and Y.
o X  the independent variable.
o Y  The dependent variable (the predictable variable).
st
1 step in estimating a sightline of best fit to a set of data is to ensure that a straight-line
really is a reasonable representation for the data. This is done by plotting the data in a
graph, a scatter diagram.
Example on the next page
2 Estimating the Simple Linear Regression Line

Equation of a straight line is
o Y = β0 + β1 X
or (equivalently written as Y = a + b * X or Y = m * X + c]
β0 + β1
unknown constants
β0
representing the Y-intercept
(the value of Y when X = 0)
β1
representing the slope of the line
(the change in Y which corresponds to
a 1-unit change in X)
These are important when it comes to interpretating the estimated coefficient
from a regression analysis.


1st once decided that the straight-line is a reasonable representation of the data, 2nd step is
to estimate the equation of the straight-line.
We will use the sample data to estimate β0 and β1 to give the straight line which 'best' fits
the data. These estimates are denoted by
The hat (^) denotes an estimate and
they are called “Beta naught hat” and “Beta one hat”.
Estimating the coefficients
 Suppose we have a sample of n pairs of observations taken from a population, and the i’th
pair is
 If we have a single predictor variable influencing the outcome of a response variable,



To get best possible fit for the straight line  minimise the differences between the
observed Y-values, yi, and the corresponding point on the line of best fit,
. This is we
wish to minimise the differences (y¡- ). This difference is called a residual.
Problem  since some differences will be positive and some negative.
Solution  use the least squares method to minimise the sum of squares of these
differences – that is to minimise:
Least Squares Criterion
 The lines that best fits the data is the line for which SSE is minimised, this line is called the
regression line.
 Regression line =
 Set of equation to workout B0 hat and B1 hat
3 Use of Excel for Simple Linear Regression

Common to use computer packages when dealing with problems of regression.
4 Residual Analysis



Use residual analysis to determine if the regression line is a good fit to the data and that
the estimates coming from the regression analysis are valid and reliable.
If the model fits the data well, the residuals which represent the “error” term in the
regression model should be small and not exhibit any pattern.
Patterns in the residual usually indicates a predictable component whereas the residuals
should be random. This usually indicates that
o The linear model is not appropriate OR
o The data needs to be transformed (often a log or other transformation).
Standard Error of the Estimate
 In estimating the standard error of any parameters or predictions using the model, we
require the estimate of the unknown population standard deviation, σ. This is related to
the Sum of Squares for error as:


The divisor here is (n - 2) as we have had to estimate 2 parameters, β0 and β1.
Standard error is also displayed in excel print out.
5 Correlation Coefficient




Determine the strength of the association between variables X and Y to determine how
good the regression model fits the data.
Two measures of the strength of the association between variables X and Y are:
o The sample correlation coefficient (r).
o The coefficient of determination (r²).
r²  measures the proportion of variation of the responses variable (y) explained by the
predictor variable (x).
r  measures the LINEAR association between X and Y.
Coefficient of Correlation
 The formula for determining the sample coefficient of correlation is

Correlation provides the measure of association between the predictor and a response
variable and is between -1 and +1.
o r= +1  X and Y are perfectly positively correlated (or associated) with one another
o r= -1  X and Y are perfectly negatively correlated (or associated) with one another
o r= 0  X and Y are uncorrelated, there is no linear association between them.
This does not mean that they are statistically independent (note that dependence
may exist in a quadratic, cubic or higher nature.
o r is also displayed in excel print out.
o The population coefficient of correlation is ρ. This is the Greek letter rho
Note: Correlation coefficient is close to 0  doesn’t mean there is no relationship between the
variables being considered, only that it’s not a LINEAR relationship.
Coefficient of Determination
 Formula for determining the sample coefficient of determination is:

This describes what proportion of variation in the observed y values can be explained by
the regression line (i.e. the variation in the predicted y values).
o r² falls into the range 0 ≤ r2 ≤ 1 and is usually written as a percentage.
o An r² value close to 1 implies that most of the variability in the y values is explained
by the regression model.
o r² is also displayed in excel print out.
o The population coefficient of determination is ρ². This is the Greek letter rho,
squared.
6 Related Statistical Inferences
Confidence Interval for the population slope – β1
 Standard error can be used for confidence intervals and test on the slope parameter β1.

Significance of the Linear Model
 Hence when testing the significant of the linear relationships, test the null hypothesis at



HA (as always) will depend on what the question is.
o Test for positive slope  HA = B1 > 0
o Test for negative slope  HA = B1 < 0
o Test if the linear relationship is significant (i.e. does it exist)  HA = B1 ≠ 0
Use six step procedure. Since population standard deviation is unknown, use t-test.
For simple linear regression, as we are estimating two parameters β0 and β1 , the degrees
of freedom is d.f. = n – 2
1. Depends on question.
2. Find suitable test statistic
3. Depends on question, specify the level of significance.
The value of t is given in the Excel print out. In fact, the p-value is also given. This makes our
rejection region easy to determine.
4. Reject H0 if p-value < α.
5. Calculations.
6. Conclusion.
Note:
 The p-value given in Excel is for a 2-tailed test. So, if we are performing a 1-tailed test, then
we must halve the p-value from the printout first.
Confidence and Prediction Intervals
Download