26134 Business Statistics

advertisement
26134 Business Statistics
Mahrita.Harahap@uts.edu.au
Week 4 Tutorial
Simple Linear Regression
Key concepts in this tutorial are listed below
1. Detecting associations using scatterplots
2. Dependent and independent variables
3. Bivariate regression
4. Difference between cause and effect between two
variables vs. a relationship between two variables
5. Regression and correlation
1
In statistics we usually want to statistically analyse a population but collecting data
for the whole population is usually impractical, expensive and unavailable. That is
why we collect samples from the population (sampling) and make inferences about
the population parameters using the statistics of the sample (inferencing) with some
level of accuracy (confidence level).
A population is a collection of all possible individuals, objects, or measurements of
interest. A sample is a subset of the population of interest.
Regression
The linear regression line characterises the relationship between two numerical
variables. Using regression analysis on data can help us draw insights about that
data. It helps us understand the impact of one of the variables on the other.
It examines the relationship between one independent variable
(predictor/explanatory) and one dependent variable (response/outcome) . The
linear regression line equation is based on the equation of a line in mathematics.
β0+β1X
 0  1 X
Y:
Outcome variable
Response Variable
Dependent Variable
The outcome to be
measured/predicted.
X:
Predictor Variable
Explanatory Variable
Independent Variable
Variable one can control.
 ~ N (0,  )
Correlation





Correlation measures the association between two numerical variables with the
strength of the relationship measured by the correlation coefficient r.
A statistic that quantifies a linear relation between two variables
Falls between -1.00 and 1.00
The sign of the number indicates the direction of relationship.
The value of the number indicates the strength of the relation.
NOTE:
Regression examines the relationship between one independent variable and one
dependent variable. That is the slope of the linear regression. Correlation indicates
the association between two metric variables with the strength and direction of
the relationship measured by the correlation coefficient.
Strength & Direction of Correlation
STRENGTH:
PERFECT
STRONG
MODERATE
WEAK
DIRECTION:
POSITIVE
NEGATIVE
Difference between cause and effect between two variables
vs. a relationship between two variables
 Cause and effect implies that one variable directly
causes change in the other. A relationship implies
variables move in the same or opposite direction
together, which may be caused by another variable not
currently used in the model.
 If two variables are associated with each other it does
not mean one variable directly affects or causes the
other.
 https://www.youtube.com/watch?v=taA0DWqi_jM
Regression on Excel
1. On EXCEL to get the “Data Analysis” pack, click File>Options>Add-In>Manage: Go>Analysis
toolpack>Ok>Data>Data Analysis>Regression>Ok
2. For the scatterplot graph, click insert>scatter>select data>select data range (make sure x is
horizontal and y is vertical). Right-click on data points and click “add trendline”>click on “add
regression equation”
3. d) For each additional employee, the average profit per dollar of sales increases by 2.14
cents
8
Q1.1. In this bivariate analysis which is the dependent variable and which is the
independent variable?
Independent variable: advertisements in sports magazine
Dependent: level of sales
Q1.2. Which statistical technique should be used to establish the
strength of association between these two variables?
Correlation
Q1.3. Draw a diagram representing the expected direction of the
relationships described. Be sure to label axes.
9
Q1.4. Which statistical technique would be used to understand the impact of one of the
variables on the other?
Regression analysis.
Q1.5. What are some of the statistical assumptions being used in applying this statistical
technique and how can these be verified?
Assumptions are:
a) the relationship between the two variables is linear (verified by using a scatter
plot),
b) it evaluates the magnitude of relationship and used for prediction, but no cause
and effect can be attributed (verified by theory)
c) variables are numeric variables (verified by interval or ratio metric scales)
d) error terms are independent and are normally distributed (i.e. normal bell
shaped curve)
 ~ N (0,  )
Q1.6. What is the benefit of using the statistical technique for understanding the
relationship between two variables compared to understanding an association?
Correlation shows the direction and strength of association with a value between -1
(perfectly negative association) to +1 (perfectly positive association), whereas
regression allows for the impact of one variable on the other to be established and a
predictive model to be created. Regression measures the slope of the linear
equation.
10
11
Q2.1. Which value would you use to determine the relationship between the two
variables and does the direction of this relationship make sense?
Beta Coefficient = .459. , yes as we would expect higher food quality would lead to
customer’s returning.
Hypothesis Testing
 We use hypothesis testing to infer conclusions about the population

1.
2.
3.
4.
5.
parameters based on analysing the statistics of the sample.
In statistics, a hypothesis is a statement about a population parameter.
The null hypothesis, denoted H0 is a statement or claim about a
population parameter that is initially assumed to be true. Is always
an equality. (Eg. H0: β1=0)
The alternative hypothesis, denoted by H1 is the competing claim.
What we are trying to prove. (Eg. H1: β1 ≠ 0)
Test Statistic: a measure of compatibility between the statement in
the null hypothesis and the data obtained.
Decision Criteria: The P-value is the probability of obtaining a test
statistic as extreme or more extreme than the observed sample value
given H0 is true.
If p-value≤0.05 reject Ho
If p-value>0.05 do not reject Ho
Conclusion: Make your conclusion in context of the problem.
Q2.2. How do we use the t statistic and what does the significance tell us about these
variables?
This hypothesis test will tell us if there is enough evidence in our sample data to tell us
if there is a significant linear relationship.
H0: β1=0. There is no association between the dependent variable and the independent
variable. (There is no significant linear relationship) i.e. y= β0 + 0*x
H1: β1≠0. The independent variable will affect the dependent variable. (There is a
significant linear relationship t) i.e y= β0 + β1*x
Test Statistic: The t-test tells us whether the INDIVIDUAL regression coefficient is
different enough from zero to be statistically significant.
t
1
s
 .5911
P-value=0.009
Since p-value=0.009<0.05 (level of significance) we reject the null
hypothesis and conclude that we have enough statistical evidence to prove
that there is a significant linear relationship between the two variables.
13
R2 Coefficient of Determination
• Tells us the amount of variation explained in
the dependent variable that is accounted for
by the independent variable.
14
Q2.3. What does the R2 tell us about the relationship between food quality and
customers returning?
An R2 .263 means food quality explains 26.3% of the variation in whether a
customer will return.
Q2.4. How much does perception of food quality not explain whether a customer
would return to a restaurant?
Food quality does not explain (1 - .263=)73.7% of the variation in customers
returning.
Q2.5. List three other variables that may explain whether a customer would return
to Joe’s restaurant.
Other variables include price, location, food cuisine, quality of staff, ambience etc.
This means this requires multivariate regression analysis (next week’s topic.)
15
Sons Height= 33.73 + 0.516 x Father’s Height
Interpret the Coefficients:
16
Download