Lecture 1: Course Introduction – Descriptive Statistics B6014 Managerial Stats. – Prof. Jing Dong Managerial Statistics Professor Jing Dong Today: • Course Outline • Recap: descriptive statistics material – mean, median, standard deviation • Normal Approximation Course Introduction – Descriptive Statistics Lecture 1 / #2 Introductions Professor: Jing Dong Office: Uris 413 Tel: (212) 854-9154 Email: jing.dong@gsb.columbia.edu Office hours: Thur 5:00pm-7:00pm or by appointment TAs: Gowtham Tangirala, Zhe Liu, Sharon Huang, Yue Hu, Pengyu Qian, Aishwarya Sharma, Raghav Seth (see Canvas for names/emails/office hrs) • Course Website: Canvas Course Introduction – Descriptive Statistics Lecture 1 / #3 Why are we here? • All: Become an informed consumer of statistical information • Most: Learn basic statistical analysis for other courses and your work – Core: Strategy Formulation; Corporate Finance; Marketing; Managerial Economics; Operations Management; Business Analytics Elective: Capital Markets and Investments; Marketing Research; Applied Regression Analysis; Business Analytics II ... – Examples: estimate product reliability, evaluate the risk and reward of a portfolio, test for bias in analysts’ recommendations, predict sales on the basis of a product’s characteristics, . . . • Many: Develop a foundation for learning further statistical methods Course Introduction – Descriptive Statistics Lecture 1 / #4 Course outline Regression Confidence intervals Sampling and sampling errors Normal distribution Random variables, Probability, Exp.Value Descriptive statistics and Summary measures Course Introduction – Descriptive Statistics Lecture 1 / #5 Course Organization • Lectures Mon/Wed/Fri (only first 4 Fri: 9/7, 9/14, 9/21, 9/28) – Cluster A: 4:00pm - 5:30pm in Uris 326 – Cluster B: 2:15pm - 3:45pm in Uris 326 – Cluster E: 9:00am - 10:30am in Uris 326 – Cluster F: 10:45 am - 12:15pm in Uris 326 • Review sessions: Wednesdays 5:45-7:15pm, Uris 142 – non mandatory (shared across all 8 clusters) – reviews material already covered in class; goes over predetermined set of practice problems; reviews Excel concepts as needed • Midterm on Friday Sep 21 (in class); midterm exam review session (share across all 8 clusters): Sep 18, 5:45 - 7:15pm in Uris 332 • Final exam on Friday Oct 19; final exam review session (share across all 8 clusters): Oct 15, 9:00-10:00am in Uris 142 Course Introduction – Descriptive Statistics Lecture 1 / #6 Class Contribution • Professionalism in class – Present, on time, prepared and engaged – no laptops / phones / tablets • For many sessions, you have a few questions to prepare or answer on Canvas. • PollEverywhere will be used during the Lecture and your responses will be used for taking attendance. • Questions and comments are strongly encouraged! It is OK to say “I’m lost”. Course Introduction – Descriptive Statistics Lecture 1 / #7 Reference Material • Lecture slides are the primary source of information. They will be provided in class and posted on Canvas. • Course Reader – Course Notes: secondary source for the material we will cover – Practice problems & Answers – Cases • Supplementary textbooks (not required): – Levine, Stephan, Krehbiel, & Berenson, Statistics for Managers, 6th Ed., Prentice-Hall, 2010 • Resources – Weekly review sessions – Me: feel free to stop by my office hours or make an appointment. – Teaching Assistants: office hours and contact information on Canvas – Tutor: available through Student Affairs – Learning team Course Introduction – Descriptive Statistics Lecture 1 / #8 Assignments/Grading • Midterm & Final exam: closed book (three sheets of notes, double-sided allowed) • Four hand-in assignments: – 4 cases with learning team First one due on Monday 9/10 • Weekly homework: problems in the course reader – Not graded; answers also in the course reader • Grading: based on the maximum of the following two weighting schemes. Class participation Hand-in assignments Midterm Final 5% 20% 30% 45% 5% 20% 0% 75% Course Introduction – Descriptive Statistics Lecture 1 / #9 Course outline Main goal: introduce and study how basic statistical tools are used in managerial decision making We will cover the following key concepts of statistical analysis and inference: • Descriptive statistics: summarize data, observe patterns, extract vital information • Probability : systematic framework for dealing with uncertainty, basis for statistical inference • Sampling : sample data as a guide for statistical inference • Estimation: point & interval estimators, construction & interpretation of confidence intervals • Regression: construction of predictive models based on statistical data Course Introduction – Descriptive Statistics Remainder of lecture • Quick recap of Descriptive Statistics material from pre-term videos – mean, median – standard deviation – Normal Approximation Lecture 1 / #10 Course Introduction – Descriptive Statistics aa • Who is the best baseball player of all time? Lecture 1 / #11 Course Introduction – Descriptive Statistics Lecture 1 / #12 aa • What is happening to the economic health of America’s middle class? Graphic from CNN Money The per capita income in the United States climbed from $7,787 in 1980 to $26,487 in 2010. Course Introduction – Descriptive Statistics Lecture 1 / #13 Which printer has better quality? aa • Data: for each printer sold last year, the file documents the number of quality problems that were reported during the warranty period. The total number of printers sold by each company are – Brand X: 57,334 – Brand Y: 994,773 • Mean number of quality problems per printer – Brand X: 3.49 – Brand Y: 2.64 Course Introduction – Descriptive Statistics Lecture 1 / #14 Which printer has better quality? aa • Data: for each printer sold last year, the file documents the number of quality problems that were reported during the warranty period. The total number of printers sold by each company are – Brand X: 57,334 – Brand Y: 994,773 • The average number of quality problems per printer – Brand X: 3.49 – Brand Y: 2.64 Course Introduction – Descriptive Statistics Which printer has better quality? • Median number of quality problems per printer: – Brand X: 1 – Brand Y: 2 • Standard deviation of the number of quality problems per printer – Brand X: 4.07 – Brand Y: 2.15 Lecture 1 / #15 Course Introduction – Descriptive Statistics Which printer has better quality? • Median number of quality problems per printer: – Brand X: 1 – Brand Y: 2 • Standard deviation of the number of quality problems per printer – Brand X: 4.07 – Brand Y: 2.15 Lecture 1 / #16 Course Introduction – Descriptive Statistics Lecture 1 / #17 Recap: measures of central tendency For a dataset X1, X2, . . . , Xn, we defined several notions of an average: X1 + · · · + Xn • mean or arithmetic average: n • median: “midpoint” or “middle value” of dataset • mode: most frequently occurring value (mostly for categorical data) • weighted average: w1X1 + · · · + wnXn, where we often require w1 + w2 + · · · + wn = 1. Remarks: • Appropriate notion of average depends on the context • The mean is more sensitive to outliers than the median Course Introduction – Descriptive Statistics Lecture 1 / #18 aa • Who is the best baseball player of all time? According to Steve Moyer, president of Baseball Info Solutions, the three most valuable statistics (other than age) for evaluating any player who is not a pitcher would be – On-base percentage (OBP): Measures the proportion of the time that a player reaches base successfully, including walks (which are not counted in the batting average). – Slugging percentage (SLG): Measures power hitting by calculating the total bases reached per at bat. A single counts as 1, a double is 2, a triple is 3, and a home run is 4. Thus, a batter who hit a single and a triple in five at bats would have a slugging percentage of (1 + 3)/5, or .800. – At bats (AB) • Derek Jeter: OBP 0.377; SLG 0.440; AB: 0.310 • Babe Ruth: OBP 0.474; SLG 0.690; AB: 0.342 Course Introduction – Descriptive Statistics Lecture 1 / #19 aa • How is the economic heath of the American middle class? According to Alan Krueger (Professor of Economics and Public Affairs at Princeton), we should examine – changes in the median wage (adjusted for inflation) – changes to wages at the 25th and 75th percentiles (adjusted for inflation) Source: Congressional Budget Office Course Introduction – Descriptive Statistics Lecture 1 / #20 Recap: measure of data dispersion • Standard devision is the most basic and widely used measures of variability in statistical analysis The variance σ 2 is the avg. squared distance to mean: 2 σ2 = (X1 − X̄) + · · · + (Xn − X̄) n " 2 = n 1X n # (Xi − X̄)2 i=1 and the standard deviation σ is √ σ= σ2 v u n u X = t 1 (Xi − X̄)2 n i=1 • Variance expressed in the original units squared; Square root recovers the original units • Excel: STDEVP; “P” is mnemonic for “population” Course Introduction – Descriptive Statistics Lecture 1 / #21 Variance & Standard Deviation – version 2 When we estimate the variance and standard deviation using a data sample we tweak the formulae as follows: The sample variance s2 is the avg. squared distance to mean: 2 s2 = (X1 − X̄) + · · · + (Xn − X̄) n−1 2 " = 1 n−1 n X # (Xi − X̄)2 i=1 and the sample standard deviation σ is √ s= s2 v u n X u = t 1 (Xi − X̄)2 n − 1 i=1 • Excel: STDEV computes the sample standard deviation, i.e., divides by n − 1 Course Introduction – Descriptive Statistics Lecture 1 / #22 Which one to use? We will not worry about the distinction between the two in this course. You can use either one. • The population standard deviation σ: – To calculate σ, we fist find the variance and then take the square root. – The variance is the average squared distance to the mean – Excel: STDEVP; “P” stands for “population” • The sample standard deviation s: – Divides by n − 1 instead of n. – This is often used when estimating the stdev of a population from a random sample – Excel: STDEV Course Introduction – Descriptive Statistics Lecture 1 / #23 Recap of mean and standard deviation calculation Xi Xi − X (Xi − X)2 Data: X1,..., X n Sample size: n Sample mean: (Excel function AVERAGE) X= … X1 +... + X n 1 n = ∑ Xi n n i=1 Sample variance: (X1 − X)2 +... + (X n − X)2 2 s = n −1 Sample standard deviation: Excel function STDEV (X1 − X)2 +... + (X n − X)2 s= n −1 Course Introduction – Descriptive Statistics Should I worry about my health? My fictitious blood test result: – My score: 134 – The average of female of my age group: 122 • The standard deviation of female of my age group: 18 Lecture 1 / #24 Course Introduction – Descriptive Statistics Lecture 1 / #25 Should I worry about my health? My fictitious blood test result: – My score: 134 – The average of female of my age group: 122 • The standard deviation of female of my age group: 18 • Chebyshev’s rule: True for any dataset – At least 75% of the data lies within ± 2 standard deviations from the mean – At least 88.9% of the data lies within ± 3 standard deviations from the mean Course Introduction – Descriptive Statistics Normal approximation • In practice, data may often have a bell-shaped histogram. • If data is approximately normal, – 68.3% of observations within ± 1 standard deviation from the mean – 95.4% of observations within ± 2 standard deviation from the mean – 99.7% of observations within ± 3 standard deviation from the mean Lecture 1 / #26 Course Introduction – Descriptive Statistics How do the two events compare? • The S&P 500 index drop by 4.1% on Feb 5, 2018 • The average December temperature in central park in 2015 is 50.8 F Lecture 1 / #27 Course Introduction – Descriptive Statistics Daily return of S&P 500 • Mean daily return (=AVERAGE) = 0.04% • Standard deviation of daily returns (=STDEV) = 1.10% • Out of 7,716 days, 94.9% are within ± 2 standard deviations from the mean Lecture 1 / #28 Course Introduction – Descriptive Statistics Central park Temperature • Mean = 36 F • Standard deviation = 4.3 F • 95.1% of the observations are within ±2 standard deviations from the mean Lecture 1 / #29 Course Introduction – Descriptive Statistics Lecture 1 / #30 How do the two events compare? • The S&P 500 index drop by 4.1% on Feb 5, 2018. It is a 3.76 standard deviations below the mean event. (0.0001) • The average December temperature in central park in 2015 is 50.8 F. It is a 3.44 standard deviations above the mean event. (0.0003) Course Introduction – Descriptive Statistics How do the two events compare? • The monthly return of Shanghai Stock Exchange in March 2016 is 11.75%. • The imprisonment rate in New Jersey drop by -6.5% in 2015. Lecture 1 / #31 Course Introduction – Descriptive Statistics Lecture 1 / #32 Swiss central bank lets Frank rise – 01/15/15 • The move on Jan 15 was 32 standard deviations above the mean • . . . it is silly to think like that – the distribution changed • Later on, we will be able to verify that the difference between the new mean and the old one is statistically significant Course Introduction – Descriptive Statistics Lecture 1 / #33 Swiss central bank lets Frank rise – 01/15/15 (2) Course Introduction – Descriptive Statistics Lecture 1 / #34 Summary Quick recap of Descriptive Statistics material from pre-term videos • mean, median • standard deviation • Normal approximation If data is approximately normal, we can characterize 68.3% 86.6% 95.4% 99.7% of data lie within 1 1.5 2 3 stdev’s from the mean