Inference in regression - Previous Experiences as an Educator

advertisement
1st
Business and Economic Statistics
Tutorial 1: Describing Categorical Data (Ch 4)
Tutor: Sam Capurso
E-mail: ...
1. Why Statistics?
Initiates policy / decisions
Statistics
Evaluates and informs policy / decisions
Accountants work in an economy (in fact, everyone does)
i
P
E.R.
Confidence
Business Consumer
More...
2. Prac set up
Task
Minutes
Attendance, hand back work
5
Summary for this week
Individual written work (4 in
the semester)
Individual MCQ test
Group MCQ scratchy test
Group WAQ
Worked Example
5 - 10
10
10 - 15
10 (or until finished)
Approx 1 hour
10 - 15
3. First prac (only)
* Introduction
* “House keeping”
* Arrange groups
* Work out team names and take attendance
* Prac work
4. Things to note
* Need to attend lectures and read text BEFORE PRAC
* Assessment for pracs =
Indiv MCQ (5%)^ + Team MCQ (5%)^ + Team WAQ (10%)^^
^ Hand in prac
^^ Hand in by due date: … in hand-in box: names, ID numbers,
time, day, tutor.
5. Add previous prac’s results
Building a House
Group activity
Roles



Architect – design, framework, ideas
Tradesperson – technical, 'expert' in field
Superintendent – leader, knowledge of different
areas


Decorator – finer details, user-friendliness
Real estate agent – communication, 'sells the
product'

General contractor – follows direction, able to learn
how to perform different roles
Task
Questions:
1.Why did you choose this role?
2.What types of skills / experiences are related to
this role?
3.What are the ways in which someone in your
role can work with someone from (choose a
different role)?
4.How can you relate this activity to working in
your BES team?
2nd
Note
Stratified and clustered sampling
Clip:
http://www.youtube.com/watch?v=CvPPM2st
uPg&feature=c4overview&list=UUZFQ2rSVMR2ahKAzBto5P7
w
Sampling:
Population
Undercoverage
Sampling
frame (list)
Target sample
Note:
n↑ ≠ ↓bias
n↑  ↓sampling error
(error due to randomness)
Non-response bias
Voluntary response
bias
Convenience
sampling
Actual sample
(respondents)
Response bias
Need to improve survey
design to bias
If ↑ n, just asking more
people the wrong question!
2nd
E.g.
Simpson’s Paradox
School
Girls
Boys
Total
School A
273
77
350
School B
289
61
350
Total
562
138
700
Which school had higher proportion of girls?
School
% girls
School A
78%
School B
83%
School B has more girls
School
Year 11
Girls
Year 12
Boys
Girls
Boys
School A
Girls
Boys
Total
Yr 11
81
6
87
Yr 12
192
71
263
Total
273
77
350
School B
Girls
Boys
Total
Yr 11
234
36
270
Yr 12
55
25
80
Total
289
61
350
Percentage of girls by school broken into year levels
School
Yr 11
Yr 12
School A
93%
73%
School B
87%
69%
So, something must be going on with year levels when we add
them up to get results before.
School A has
more girls in
each year level
School A
Yr 11
Yr 12
Total
Girls
81
192
273
Boys
6
71
77
Total
87
263
350
School B
Yr 11
Yr 12
Total
Girls
234
55
289
Boys
36
25
61
Total
270
80
350
Percentage of girls in each year level
Year level
Yr 11
Yr 12
% girls
88%
72%
Year 11
Characteristic
Girls
Boys
Year 12
% Yr 11 in each school
School
School A
School B
Category summed
Girls
Boys
% Yr 11
25%
77%
Group
School A
Category summed
Yr 11
Yr 12
So, proportion of girls exaggerated in School B, because...
* Year 11 students are more likely to be girls, and
* School B has higher proportion of Year 11 students
School B
Yr 11
Yr 12
3rd
Note
Displaying and Describing Quantitative Data
3rd
Note
Displaying and Describing Quantitative Data
3rd
E.g.
Displaying and Describing Quantitative Data
• Construct a box-and-whisker plot for the following data: 3, 8, 1, 5, 3, -2,
3
•
•
•
•
•
•
•
•
•
•
Solution:
Ordered: -2, 1, 3, 3, 3, 5, 8
Median: 3
Q1: 2
Q3: 4
IQR: 4 – 2 = 2
1.5 * IQR = 3
LF = Q1 – 3 = -1
UF= Q3 + 3 = 7
So, whiskers at 1 and 5, outliers are -2 and 8
4th
Note
Interpretation of slope coefficient
Clip:
http://www.youtube.com/watch?v=BgCoGYXwD
4w&list=UUZFQ2rSVMR2ahKAzBto5P7w
4th
E.g.
Correlation and Linear Regression
• The difference between r (correlation coefficient)
and R2 (the coefficient of determination)…
• The difference between interpreting r and
commenting on a scatter plot…
• Question – True or false? Two variables which are
strongly related will always have a high
correlation coefficient. Explain…
• Is this point unusual? What to do…
5th
E.g.
Probability
and Expected Values
Be aware of the following:
* V[X + c] ≠ V[X] + c
* SD[X + Y] ≠ SD[X] + SD[Y]; = V Var[X] + Var[Y]
* where X, Y are random variables, c is a constant.
* Note the two tests for independence…
* Interpretation of expected value: we expect ….(include units)… in the
long run, on average.
5th
E.g.
Probability
and Expected Values
Questions:
1. Find the formula for P(A or B) if A and B are: independent; not
independent.
2. Find the formula for P(A and B) if A and B are: disjoint; not
disjoint.
3. Consider disjoint events A and B, which both have non-zero
probabilities. Can A and B ever be independent? Explain in
words or using formulae.
4. Complete the following: E[aX + bY + c]; Var[aX + bY + c],
where a, b are constants, and X, Y are independent random
variables
5th
E.g.
Probability
and Expected Values
Consider a single trial with two outcomes, success (which we will represent by a
1) or failure (0).
Let the probability of success be p.
a)
b)
c)
d)
e)
y
0
1
Pr(y)
?
p
What is the probability of failure? Hint: you need to make sure the
probability model is valid.
Write down the formula for calculating the expected value.
Use this to work out E(y) in terms of p.
Write down the formula for calculating variance.
Solutions
Use this to show Var(y) = p(1-p).
Normal and sampling distributions
Note
6th
•




The four types of normal probability questions:
P(X < A)
Because Z tables
P(A < X < B) = P(X < B) – P(X < A) only have < probs
P(X > B) = P(X < -B) = 1 – P(X < B)
Given the probability, what are the boundaries?
Proportions
Means
Normal
Shape
Model
Mean
Centre
Mean
Variance
Spread
Variance
Shape
Model
Centre
Spread
Assumptions
1.
2.
http://www.youtube.com/wat
ch?v=ddBdqqtXiao&feature=c
4overview&list=UUZFQ2rSVMR
2ahKAzBto5P7w
Assumptions
Conditions
1.
2.
3.
Normal
1.
2.
Conditions
1.
2.
3.
6th
E.g.
Normal distribution
The length, X cm, of members of a certain species
of fish is normally distributed with mean 40 and
standard deviation 5.
a. Find the probability that a fish is longer than 45
cm.
b. Find the probability that a fish is between 35 cm
and 50 cm long.
c. Describe the longest 10% of this specifies of fish.
Solutions
7th
Confidence intervals and hypothesis tests
Note
Proportions
• Confidence intervals for proportions: 𝑝 + z
𝑝𝑞
𝑛
• Remember to check conditions
CI
90%
95%
99%
z
1.645
1.96
2.576
• Interpretation: we are 95% confident the population
proportion lies between [lower bound] and [upper bound]
2
• n= 𝑧
𝑝𝑞
𝑀𝐸
7th
Confidence intervals and hypothesis tests
Note
Means
• CI: 𝑦 + t
𝑠
𝑛
where s = sample standard deviation
and where t has df = n – 1
• Remember to check conditions
Demo – finding t from tables
• Similar interpretation…
7th
Confidence intervals and hypothesis tests
Note
Hypothesis tests of one proportion
•
•
•
•
•
Hypothesis test: one-tailed (< >) or two-tailed
Conditions
State model using (z or t)
Standardised statistic
P-value (or… learn other way this week, ‘critical
value’ approach)
• Conclusion
7th
Hypothesis test: 1 proportion
E.g.
Historically, 53% of the population supported the ruling
political party. A recent survey, in which the 150
respondents were selected randomly, showed that 93 of
them supported the party. A two-tailed z-test at the 0.05
level of significance is to be used to determine whether
or not the population proportion has significantly
changed.
a. State the null hypothesis and the alternative hypothesis.
b. Check the conditions that justify inference in this
context.
c. Determine whether or not the null hypothesis should
be rejected, and make a conclusion based on your
finding.
Handwritten solution
8th
Inference so far… reviewing the p-value
Note
Inference so far…
8th
Note
Inference so far…
hypothesis tests for counts
8th
Note
8th
E.g.
Hypothesis test: 1 mean
• Previous research has shown that the
average IQ of Australians was 110. In
2012, a random sample of 40 Australians
revealed an average IQ of 100 with
standard deviation 15. The researcher
wants to test, at a 1% level of
significance, whether the average IQ of
Australians has indeed decreased.
• (Fictional data)
Handwritten solution
9th
Note
Excel Output
9th
Note
Inference in regression
9th
Note
Inference in regression
9th
Note
Inference in regression
9th
Inference in regression
E.g.
We are estimating the relationship between bwght (birth weight of newborn
baby in pounds) and cigs (packets of cigarettes smoked per week by mother
prior to birth).
Consider the Excel output below and answer the following questions.
Regression Statistics
Multiple R
R Square
-0.1507
0.0227
Adjusted R square
0.022
Standard Error
1.258
Observations
1388
ANOVA
df
Regression
SS
MS
F
1
51.0172632
51.0172632
Residual
1386
2193.55977
1.58265495
Total
1387
2244.57703
1.61829634
Intercept
cigs
Significance F
32.24
0
Coefficients S. Error
tstat
P-value Lower 95% Upper 95%
7.485744 0.0357713 209.27
0 7.415572
7.55915
-0.0321108 0.0056557 -5.68
0 -0.0432054 -0.03210161
a.
9th Which
E.g. do you think is the explanatory variable and which is the response
variable?
b. Write down and interpret the correlation coefficient.
c. Write down and interpret R2 (the coefficient of determination).
d. Interpret the slope and the intercept.
e. Are the signs and sizes of the slope and intercepts reasonable? Explain.
f. Write down and interpret the 95% confidence interval for the slope.
g. Do the same for the 90% confidence interval. Explain how this differs from
the 95% confidence interval.
h. Formulate a null and alternative hypothesis for the slope, using economic or
general theory.
i. Conduct this hypothesis test using a 5% level of significance and make a
conclusion.
j. Test whether the slope is significantly different from -0.05 at a 1% level of
significance.
k. Suppose a hypothesis test for the slope had hypotheses H0: β1 = 0, and HA:
β1≠0. Explain the purpose of conducting this test in terms of assessing
whether the current regression model should be used.
Notation
- recap:
Note
10th
• μ
• 𝑦
• σ
• s
• 𝜎𝑦 =
(or
•
•
•
•
•
•
𝑠
𝑛
𝜎
𝑛
for estimate)
n
N
P
𝑝
p-value
b0,1
• β0,1
• Population mean
• Sample mean
• Population standard deviation
(variability of individual observations)
• Sample standard deviation
• Standard deviation of sample means
•
•
•
•
•
•
Sample size
Population size
Population proportion
Sample proportion
See definition…
Sample coefficient on intercept/slope in
regression
• Population coefficient on intercept/slope in
regression
10th
Multiple
Linear Regression; Dummy Variables;
Note
Time Series – some things to note
Multiple linear regression
• Interpretation of slope coefficient: we estimate for every [one
unit] increase in [explanatory variable], the [response variable]
[increases/decreased] by [… units], on average, holding all other
explanatory variables fixed.
• Inference on the whole equation
• H0: β1 = β2 = … = 0
 no linear relationship between Y and X1, X2, …
• HA: β1 ≠ 0 and/or β2 ≠ 0
 at least one of the slopes is significant; there is a significant
relationship between the response variable and the explanatory
variables as a group.
• Use p-value from Excel  “Significance-F”
10th
Multiple
Linear Regression; Dummy Variables;
Note
Time Series – some things to note
Dummy variables
• Interpretation of dummy variables… see example.
• The dummy variable trap…
• Testing the significance of a dummy variable is the same as
testing whether there is a significant difference between the
means of the two categories.
Trend
Time Series
Components of a classical
time series model
• Interpretation of trend line, trend = a + bt
Cyclical
Seasonal
Irregular
• Trend is [a units] at [origin] and [increases / decreases] by [b
units] each [time period, t].
10th
E.g.
Dummy Variables
1. Consider the following equation:
• Income = β0 + β1experience + β2gender + ε
• where gender = 1 if male, 0 if female.
a.
b.
i.
ii.
c.
State what you expect the sign of β1 and β2 to be. Explain why.
Interpret the following:
The slope coefficient on gender.
The slope coefficient on experience.
Redefine gender to be 1 if female, 0 if male. What happens to β2?
2.
Suppose that we want to examine the level of crime in different
regions of Adelaide: north, south, east and west. In other words,
in our regression model, crime level is the response variable, and
region is the explanatory variable. Create a dummy variable for
Solutions – for 2
the region.
11th
Note
Time Series and Price Indices
𝑃𝑡
𝑃0
•
Price relative = 100*
•
Be careful about the difference between a percentage increase and percentage point
increase.
Year
Base year
A
B
Prince index
100
a
b
Assume a, b > 100
•
•
Interpretation: price index of A means prices are (a – 100)% higher in Year A than in the
base year / there has been a (a – 100)% increase
The increase in the index number from Year A to Year B is (b – a) percentage points or…
𝑏−𝑎
𝑎
•
•
∗ 100 %
Note: you could do the same using prices, instead of price indices.
•
Interpretation of average price relatives: on average, the price of the … goods
increased by …% between … and … (*)
•
Could do the same for expenditure …
•
Same interpretation, but instead of “price” use “cost”.
𝑃𝑡 𝑄𝑡
𝑃0 𝑄0
∗ 100 … but of little use.
11th
Note
Time Series and Price Indices
𝑃𝑄
𝑡 0
• Laspeyres Price Index =
∗ 100. This is the increase in the cost
𝑃0 𝑄0
of the time 0 basket of goods in time t relative to what they cost in
time 0.
𝑃𝑄
𝑡 𝑡
• Paasche Price Index = =
∗ 100. This is the increase in the cost
𝑃0 𝑄𝑡
of the time t basket of goods in 2010 relative to what they would
have cost in 2008.
• Same interpretation as (*)
• Note:
• Why the Laspeyres and Paasche Indices differ.
• How to shift the base, and chain series.
• Nominal = in current prices. Real = in constant (base year prices)
• Real prices =
𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
𝑝𝑟𝑖𝑐𝑒 𝑖𝑛𝑑𝑒𝑥
∗ 100 (if price index base = 100)
11th
Note
Time Series and Price Indices
Discussion question – what are the limitations of the CPI?
• Overestimates price index because there is a type of
Laspeyres index
• What items are included in the goods basket? (Can’t include
all of them!)
• Only surveys metropolitan households
• Data taken from survey – potential sources of sampling bias
• Does not account for change in quality in goods with same /
lower price (e.g. computers)
• How do you include new technology that didn’t exist in the
previous period?
• What prices do you take? CPI doesn’t take into account sales /
specials
Download