Economics 105: Statistics • http://www.davidson.edu/academic/economics/foley/105/index.html •Powerpoint slides – meant to help you listen in class – print out BEFORE you do the day’s reading! – things will seem fast if you do the reading AFTER lecture – I expect you to do the reading prior to class. Everyone does the first few weeks, hard part is continuing to do so (before class). If you don’t, things will seem to go fast. – Stats is one of those classes where you can teach yourself quite a bit via the reading. I expect you to do so. – yes, these are high expectations. – Not everyone likes Powerpoint ... Economics 105: Statistics • Today – What is Statistics? – Presenting data • For next time: Read Chapters 1 – 3.5 Organizing and Presenting Data Graphically • Data in raw form are usually not easy to use for decision making –Some type of organization is needed •Table •Graph •Techniques reviewed in Chapter 2: Bar charts and pie charts Pareto diagram Ordered array Stem-and-leaf display Frequency distributions, histograms and polygons Cumulative distributions and ogives Contingency tables Scatter diagrams Raw Form of Data Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27 Tabulating Numerical Data: Frequency Distributions •What is a Frequency Distribution? – A frequency distribution is a list or a table … – containing class groupings (ranges within which the data fall) ... – and the corresponding frequencies with which data fall within each grouping or category Why Use a Frequency Distribution? •It is a way to summarize numerical data •It condenses the raw data into a more useful form •It allows for a quick visual interpretation of the data Class Intervals and Class Boundaries • Each class grouping has the same width • Determine the width of each interval by range Width of interval @ number of desired class groupings • Usually at least 5 but no more than 15 groupings • Class boundaries never overlap • Round up the interval width to get desirable endpoints Frequency Distribution Example Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27 Frequency Distribution Example (continued) • Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 • Find range: 58 - 12 = 46 • Select number of classes: 5 (usually between 5 and 15) • Compute class interval (width): 10 (46/5 then round up) • Determine class boundaries (limits): – 10, 20, 30, 40, 50, 60 • Compute class midpoints: 15, 25, 35, 45, 55 • Count observations & assign to classes Frequency Distribution Example (continued) Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 Total Frequency Relative Frequency 3 6 5 4 2 20 .15 .30 .25 .20 .10 1.00 Percentage 15 30 25 20 10 100 Tabulating Numerical Data: Cumulative Frequency Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class Frequency Percentage Cumulative Cumulative Frequency Percentage 10 but less than 20 3 15 3 15 20 but less than 30 6 30 9 45 30 but less than 40 5 25 14 70 40 but less than 50 4 20 18 90 50 but less than 60 2 10 20 100 20 100 Total Graphing Numerical Data: The Histogram • A graph of the data in a frequency distribution is called a histogram • The class boundaries (or class midpoints) are shown on the horizontal axis • the vertical axis is either frequency, relative frequency, or percentage • Bars of the appropriate heights are used to represent the number of observations within each class Histogram Class Midpoint Frequency Class 15 25 35 45 55 3 6 5 4 2 (No gaps between bars) Histogram : Daily High Tem perature 7 6 Frequency 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 5 4 3 2 1 0 5 15 25 35 45 Class Midpoints 55 65 Graphing Numerical Data: The Frequency Polygon Class Midpoint Frequency 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 15 25 35 45 55 3 6 5 4 2 (In a percentage polygon the vertical axis would be defined to show the percentage of observations per class) Frequency Polygon: Daily High Temperature 7 6 Frequency Class 5 4 3 2 1 0 5 15 25 35 45 Class Midpoints 55 65 Graphing Cumulative Frequencies: The Ogive (Cumulative % Polygon) Less than 10 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 0 10 20 30 40 50 0 15 45 70 90 100 Ogive: Daily High Temperature Cumulative Percentage Class Lower Cumulative class boundary Percentage 100 80 60 40 20 0 10 20 30 40 50 60 Class Boundaries (Not Midpoints) Summary Measures Describing Data Numerically Central Tendency Quartiles Variation Arithmetic Mean Range Median Interquartile Range Mode Variance Geometric Mean Standard Deviation Shape Skewness Coefficient of Variation Measures of Central Tendency Overview Central Tendency Arithmetic Mean Median Mode n X= åX i=1 n Geometric Mean XG = ( X1 ´ X2 ´ i Midpoint of ranked values Most frequently observed value ´ Xn )1/ n Arithmetic Mean • The arithmetic mean (mean) is the most common measure of central tendency – For a sample of size n: n X= Sample size åX i=1 n i X1 + X2 + = n + Xn Observed values Arithmetic Mean (continued) • Mean = sum of values divided by the number of values • Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 1 + 2 + 3 + 4 + 5 15 = =3 5 5 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 1 + 2 + 3 + 4 + 10 20 = =4 5 5 Median • In an ordered array, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 • Not affected by extreme values Finding the Median • The location of the median: n +1 Median position = position in the ordered data 2 – If the number of values is odd, the median is the middle number – If the number of values is even, the median is the average of the two middle numbers • Note that n +1 2 is not the value of the median, only the position of the median in the ranked data Mode • • • • A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical (nominal) data • There may may be no mode • There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode Review Example • Five houses on a hill by the beach $2,000 K House Prices: $2,000,000 500,000 300,000 100,000 100,000 $500 K $300 K $100 K $100 K Review Example: Summary Statistics House Prices: $2,000,000 500,000 300,000 100,000 100,000 • Mean: ($3,000,000/5) = $600,000 • Median: middle value of ranked data = $300,000 Sum $3,000,000 • Mode: most frequent value = $100,000 Which measure of location is the “best”? • Mean is generally used, unless extreme values (outliers) exist • Then median is often used, since the median is not sensitive to extreme values. – Example: Median home prices may be reported for a region – less sensitive to outliers Geometric Mean • Geometric mean – Used to measure the rate of change of a variable over time XG = ( X1 ´ X2 ´ ´ Xn ) 1/ n • Geometric mean rate of return – Measures the status of an investment over time RG = [(1 + R1 ) ´ (1+ R2 ) ´ ´ (1+ Rn )] 1/ n -1 – Where Ri is the rate of return in time period i Example An investment of $100,000 declined to $50,000 at the end of year one and rebounded to $100,000 at end of year two: X1 = $100,000 X2 = $50,000 50% decrease X3 = $100,000 100% increase The overall two-year return is zero, since it started and ended at the same level. Example (continued) Use the 1-year returns to compute the arithmetic mean and the geometric mean: Arithmetic mean rate of return: ( -50%) + (100%) X= = 25% 2 Geometric mean rate of return: RG = [(1 + R1 ) ´ (1 + R 2 ) ´ Misleading result ´ (1 + Rn )]1/ n - 1 = [(1 + ( -50%)) ´ (1 + (100%))]1/ 2 - 1 = [(.50) ´ (2)]1/ 2 - 1 = 11/ 2 - 1 = 0% More accurate result Quartiles • Quartiles split the ranked data into 4 segments with an equal number of values per segment 25% Q1 25% 25% Q2 25% Q3 • The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger • Q2 is the same as the median (50% are smaller, 50% are larger) • Only 25% of the observations are greater than the third quartile Quartile Formulas Find a quartile by determining the value in the appropriate position in the ranked data, where First quartile position: Q1 = (n+1)/4 Second quartile position: Q2 = (n+1)/2 (the median position) Third quartile position: Q3 = 3(n+1)/4 where n is the number of observed values Quartiles • Example: Find the first quartile Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 (n = 9) Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so use the value half way between the 2nd and 3rd values, so Q1 = 12.5 Q1 and Q3 are measures of noncentral location Q2 = median, a measure of central tendency Quartiles • Example: (continued) Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 (n = 9) Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so Q1 = 12.5 Q2 is in the (9+1)/2 = 5th position of the ranked data, so Q2 = median = 16 Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data, so Q3 = 19.5