1st Business and Economic Statistics Tutorial 1: Describing Categorical Data (Ch 4) Tutor: Sam Capurso E-mail: ... 1. Why Statistics? Initiates policy / decisions Statistics Evaluates and informs policy / decisions Accountants work in an economy (in fact, everyone does) i P E.R. Confidence Business Consumer More... 2. Prac set up Task Minutes Attendance, hand back work 5 Summary for this week Individual written work (4 in the semester) Individual MCQ test Group MCQ scratchy test Group WAQ Worked Example 5 - 10 10 10 - 15 10 (or until finished) Approx 1 hour 10 - 15 3. First prac (only) * Introduction * “House keeping” * Arrange groups * Work out team names and take attendance * Prac work 4. Things to note * Need to attend lectures and read text BEFORE PRAC * Assessment for pracs = Indiv MCQ (5%)^ + Team MCQ (5%)^ + Team WAQ (10%)^^ ^ Hand in prac ^^ Hand in by due date: … in hand-in box: names, ID numbers, time, day, tutor. 5. Add previous prac’s results Building a House Group activity Roles Architect – design, framework, ideas Tradesperson – technical, 'expert' in field Superintendent – leader, knowledge of different areas Decorator – finer details, user-friendliness Real estate agent – communication, 'sells the product' General contractor – follows direction, able to learn how to perform different roles Task Questions: 1.Why did you choose this role? 2.What types of skills / experiences are related to this role? 3.What are the ways in which someone in your role can work with someone from (choose a different role)? 4.How can you relate this activity to working in your BES team? 2nd Note Stratified and clustered sampling Clip: http://www.youtube.com/watch?v=CvPPM2st uPg&feature=c4overview&list=UUZFQ2rSVMR2ahKAzBto5P7 w Sampling: Population Undercoverage Sampling frame (list) Target sample Note: n↑ ≠ ↓bias n↑ ↓sampling error (error due to randomness) Non-response bias Voluntary response bias Convenience sampling Actual sample (respondents) Response bias Need to improve survey design to bias If ↑ n, just asking more people the wrong question! 2nd E.g. Simpson’s Paradox School Girls Boys Total School A 273 77 350 School B 289 61 350 Total 562 138 700 Which school had higher proportion of girls? School % girls School A 78% School B 83% School B has more girls School Year 11 Girls Year 12 Boys Girls Boys School A Girls Boys Total Yr 11 81 6 87 Yr 12 192 71 263 Total 273 77 350 School B Girls Boys Total Yr 11 234 36 270 Yr 12 55 25 80 Total 289 61 350 Percentage of girls by school broken into year levels School Yr 11 Yr 12 School A 93% 73% School B 87% 69% So, something must be going on with year levels when we add them up to get results before. School A has more girls in each year level School A Yr 11 Yr 12 Total Girls 81 192 273 Boys 6 71 77 Total 87 263 350 School B Yr 11 Yr 12 Total Girls 234 55 289 Boys 36 25 61 Total 270 80 350 Percentage of girls in each year level Year level Yr 11 Yr 12 % girls 88% 72% Year 11 Characteristic Girls Boys Year 12 % Yr 11 in each school School School A School B Category summed Girls Boys % Yr 11 25% 77% Group School A Category summed Yr 11 Yr 12 So, proportion of girls exaggerated in School B, because... * Year 11 students are more likely to be girls, and * School B has higher proportion of Year 11 students School B Yr 11 Yr 12 3rd Note Displaying and Describing Quantitative Data 3rd Note Displaying and Describing Quantitative Data 3rd E.g. Displaying and Describing Quantitative Data • Construct a box-and-whisker plot for the following data: 3, 8, 1, 5, 3, -2, 3 • • • • • • • • • • Solution: Ordered: -2, 1, 3, 3, 3, 5, 8 Median: 3 Q1: 2 Q3: 4 IQR: 4 – 2 = 2 1.5 * IQR = 3 LF = Q1 – 3 = -1 UF= Q3 + 3 = 7 So, whiskers at 1 and 5, outliers are -2 and 8 4th Note Interpretation of slope coefficient Clip: http://www.youtube.com/watch?v=BgCoGYXwD 4w&list=UUZFQ2rSVMR2ahKAzBto5P7w 4th E.g. Correlation and Linear Regression • The difference between r (correlation coefficient) and R2 (the coefficient of determination)… • The difference between interpreting r and commenting on a scatter plot… • Question – True or false? Two variables which are strongly related will always have a high correlation coefficient. Explain… • Is this point unusual? What to do… 5th E.g. Probability and Expected Values Be aware of the following: * V[X + c] ≠ V[X] + c * SD[X + Y] ≠ SD[X] + SD[Y]; = V Var[X] + Var[Y] * where X, Y are random variables, c is a constant. * Note the two tests for independence… * Interpretation of expected value: we expect ….(include units)… in the long run, on average. 5th E.g. Probability and Expected Values Questions: 1. Find the formula for P(A or B) if A and B are: independent; not independent. 2. Find the formula for P(A and B) if A and B are: disjoint; not disjoint. 3. Consider disjoint events A and B, which both have non-zero probabilities. Can A and B ever be independent? Explain in words or using formulae. 4. Complete the following: E[aX + bY + c]; Var[aX + bY + c], where a, b are constants, and X, Y are independent random variables 5th E.g. Probability and Expected Values Consider a single trial with two outcomes, success (which we will represent by a 1) or failure (0). Let the probability of success be p. a) b) c) d) e) y 0 1 Pr(y) ? p What is the probability of failure? Hint: you need to make sure the probability model is valid. Write down the formula for calculating the expected value. Use this to work out E(y) in terms of p. Write down the formula for calculating variance. Solutions Use this to show Var(y) = p(1-p). Normal and sampling distributions Note 6th • The four types of normal probability questions: P(X < A) Because Z tables P(A < X < B) = P(X < B) – P(X < A) only have < probs P(X > B) = P(X < -B) = 1 – P(X < B) Given the probability, what are the boundaries? Proportions Means Normal Shape Model Mean Centre Mean Variance Spread Variance Shape Model Centre Spread Assumptions 1. 2. http://www.youtube.com/wat ch?v=ddBdqqtXiao&feature=c 4overview&list=UUZFQ2rSVMR 2ahKAzBto5P7w Assumptions Conditions 1. 2. 3. Normal 1. 2. Conditions 1. 2. 3. 6th E.g. Normal distribution The length, X cm, of members of a certain species of fish is normally distributed with mean 40 and standard deviation 5. a. Find the probability that a fish is longer than 45 cm. b. Find the probability that a fish is between 35 cm and 50 cm long. c. Describe the longest 10% of this specifies of fish. Solutions 7th Confidence intervals and hypothesis tests Note Proportions • Confidence intervals for proportions: 𝑝 + z 𝑝𝑞 𝑛 • Remember to check conditions CI 90% 95% 99% z 1.645 1.96 2.576 • Interpretation: we are 95% confident the population proportion lies between [lower bound] and [upper bound] 2 • n= 𝑧 𝑝𝑞 𝑀𝐸 7th Confidence intervals and hypothesis tests Note Means • CI: 𝑦 + t 𝑠 𝑛 where s = sample standard deviation and where t has df = n – 1 • Remember to check conditions Demo – finding t from tables • Similar interpretation… 7th Confidence intervals and hypothesis tests Note Hypothesis tests of one proportion • • • • • Hypothesis test: one-tailed (< >) or two-tailed Conditions State model using (z or t) Standardised statistic P-value (or… learn other way this week, ‘critical value’ approach) • Conclusion 7th Hypothesis test: 1 proportion E.g. Historically, 53% of the population supported the ruling political party. A recent survey, in which the 150 respondents were selected randomly, showed that 93 of them supported the party. A two-tailed z-test at the 0.05 level of significance is to be used to determine whether or not the population proportion has significantly changed. a. State the null hypothesis and the alternative hypothesis. b. Check the conditions that justify inference in this context. c. Determine whether or not the null hypothesis should be rejected, and make a conclusion based on your finding. Handwritten solution 8th Inference so far… reviewing the p-value Note Inference so far… 8th Note Inference so far… hypothesis tests for counts 8th Note 8th E.g. Hypothesis test: 1 mean • Previous research has shown that the average IQ of Australians was 110. In 2012, a random sample of 40 Australians revealed an average IQ of 100 with standard deviation 15. The researcher wants to test, at a 1% level of significance, whether the average IQ of Australians has indeed decreased. • (Fictional data) Handwritten solution 9th Note Excel Output 9th Note Inference in regression 9th Note Inference in regression 9th Note Inference in regression 9th Inference in regression E.g. We are estimating the relationship between bwght (birth weight of newborn baby in pounds) and cigs (packets of cigarettes smoked per week by mother prior to birth). Consider the Excel output below and answer the following questions. Regression Statistics Multiple R R Square -0.1507 0.0227 Adjusted R square 0.022 Standard Error 1.258 Observations 1388 ANOVA df Regression SS MS F 1 51.0172632 51.0172632 Residual 1386 2193.55977 1.58265495 Total 1387 2244.57703 1.61829634 Intercept cigs Significance F 32.24 0 Coefficients S. Error tstat P-value Lower 95% Upper 95% 7.485744 0.0357713 209.27 0 7.415572 7.55915 -0.0321108 0.0056557 -5.68 0 -0.0432054 -0.03210161 a. 9th Which E.g. do you think is the explanatory variable and which is the response variable? b. Write down and interpret the correlation coefficient. c. Write down and interpret R2 (the coefficient of determination). d. Interpret the slope and the intercept. e. Are the signs and sizes of the slope and intercepts reasonable? Explain. f. Write down and interpret the 95% confidence interval for the slope. g. Do the same for the 90% confidence interval. Explain how this differs from the 95% confidence interval. h. Formulate a null and alternative hypothesis for the slope, using economic or general theory. i. Conduct this hypothesis test using a 5% level of significance and make a conclusion. j. Test whether the slope is significantly different from -0.05 at a 1% level of significance. k. Suppose a hypothesis test for the slope had hypotheses H0: β1 = 0, and HA: β1≠0. Explain the purpose of conducting this test in terms of assessing whether the current regression model should be used. Notation - recap: Note 10th • μ • 𝑦 • σ • s • 𝜎𝑦 = (or • • • • • • 𝑠 𝑛 𝜎 𝑛 for estimate) n N P 𝑝 p-value b0,1 • β0,1 • Population mean • Sample mean • Population standard deviation (variability of individual observations) • Sample standard deviation • Standard deviation of sample means • • • • • • Sample size Population size Population proportion Sample proportion See definition… Sample coefficient on intercept/slope in regression • Population coefficient on intercept/slope in regression 10th Multiple Linear Regression; Dummy Variables; Note Time Series – some things to note Multiple linear regression • Interpretation of slope coefficient: we estimate for every [one unit] increase in [explanatory variable], the [response variable] [increases/decreased] by [… units], on average, holding all other explanatory variables fixed. • Inference on the whole equation • H0: β1 = β2 = … = 0 no linear relationship between Y and X1, X2, … • HA: β1 ≠ 0 and/or β2 ≠ 0 at least one of the slopes is significant; there is a significant relationship between the response variable and the explanatory variables as a group. • Use p-value from Excel “Significance-F” 10th Multiple Linear Regression; Dummy Variables; Note Time Series – some things to note Dummy variables • Interpretation of dummy variables… see example. • The dummy variable trap… • Testing the significance of a dummy variable is the same as testing whether there is a significant difference between the means of the two categories. Trend Time Series Components of a classical time series model • Interpretation of trend line, trend = a + bt Cyclical Seasonal Irregular • Trend is [a units] at [origin] and [increases / decreases] by [b units] each [time period, t]. 10th E.g. Dummy Variables 1. Consider the following equation: • Income = β0 + β1experience + β2gender + ε • where gender = 1 if male, 0 if female. a. b. i. ii. c. State what you expect the sign of β1 and β2 to be. Explain why. Interpret the following: The slope coefficient on gender. The slope coefficient on experience. Redefine gender to be 1 if female, 0 if male. What happens to β2? 2. Suppose that we want to examine the level of crime in different regions of Adelaide: north, south, east and west. In other words, in our regression model, crime level is the response variable, and region is the explanatory variable. Create a dummy variable for Solutions – for 2 the region. 11th Note Time Series and Price Indices 𝑃𝑡 𝑃0 • Price relative = 100* • Be careful about the difference between a percentage increase and percentage point increase. Year Base year A B Prince index 100 a b Assume a, b > 100 • • Interpretation: price index of A means prices are (a – 100)% higher in Year A than in the base year / there has been a (a – 100)% increase The increase in the index number from Year A to Year B is (b – a) percentage points or… 𝑏−𝑎 𝑎 • • ∗ 100 % Note: you could do the same using prices, instead of price indices. • Interpretation of average price relatives: on average, the price of the … goods increased by …% between … and … (*) • Could do the same for expenditure … • Same interpretation, but instead of “price” use “cost”. 𝑃𝑡 𝑄𝑡 𝑃0 𝑄0 ∗ 100 … but of little use. 11th Note Time Series and Price Indices 𝑃𝑄 𝑡 0 • Laspeyres Price Index = ∗ 100. This is the increase in the cost 𝑃0 𝑄0 of the time 0 basket of goods in time t relative to what they cost in time 0. 𝑃𝑄 𝑡 𝑡 • Paasche Price Index = = ∗ 100. This is the increase in the cost 𝑃0 𝑄𝑡 of the time t basket of goods in 2010 relative to what they would have cost in 2008. • Same interpretation as (*) • Note: • Why the Laspeyres and Paasche Indices differ. • How to shift the base, and chain series. • Nominal = in current prices. Real = in constant (base year prices) • Real prices = 𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑝𝑟𝑖𝑐𝑒 𝑖𝑛𝑑𝑒𝑥 ∗ 100 (if price index base = 100) 11th Note Time Series and Price Indices Discussion question – what are the limitations of the CPI? • Overestimates price index because there is a type of Laspeyres index • What items are included in the goods basket? (Can’t include all of them!) • Only surveys metropolitan households • Data taken from survey – potential sources of sampling bias • Does not account for change in quality in goods with same / lower price (e.g. computers) • How do you include new technology that didn’t exist in the previous period? • What prices do you take? CPI doesn’t take into account sales / specials