Chapter 1 An Introduction to Econometrics and Statistical Inference Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Learning Objectives • Understand the steps involved in conducting an empirical research • Understand the meaning of the term econometrics • Understand relationship between populations, samples, and statistical inference • Understand the important role that sampling distributions play in statistical inference 1-2 What is an Empirical Research Project? An empirical research project is a project that applies empirical analysis to observed data to provide insight into questions of theoretical interest. 1-3 The 5 Steps in Conducting an Empirical Research Project? (1) Determining the question of interest (2) Developing the appropriate theory to address the question (3) Collecting data that is appropriate for empirically investigating the answer (4) Implementing appropriate empirical techniques, correctly interpreting results, and drawing appropriate conclusions based on the estimated results (5) Effectively writing up a summary of the first four steps 1-4 What is Econometrics? Econometrics is the application of statistical techniques to economic data. 1-5 Populations, Samples, and Statistical Inference A population is the entire group of entities that we are interested in learning about. A sample is a subset or part of the population and it is what is used to perform statistical inference. Statistical inference is the process of drawing conclusions from data that are subject to random variation. 1-6 Populations, Samples, and Statistical Inference Continued 1-7 Some Important Definitions A parameter is a function that exists within the population. A statistic is a function that is computed from the sample data. A point estimate is a single valued statistic that is the best guess of a population parameter. 1-8 Sampling Distributions A sampling distribution is the distribution of a sample statistic such as the sample mean. A sampling distribution is constructed by (1) collecting all possible samples of size 𝑛 that could be drawn from the unobserved population of size 𝑁 (2) calculating the value of a given statistic (say, the sample mean) for each of those samples (3) placing those values in order on the number-line to create a distribution known as a sampling distribution 1-9 A Visual Example 1-10 Chapter 2 Collection and Management of Data Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Learning Objectives • Consider potential sources of data • Work through an example of the first three steps in conducting an empirical research project • Develop data management skills • Understand some useful Excel commands 1-12 Goals of the Chapter 1-13 Types of Data • Cross-sectional data is data collected for many different individuals, countries, firms, etc. in a given time-period. • Time-series data is data collected for a given individual, country, firm, etc. over many different time periods. • Panel data are data collected for a number of individuals, countries, firms, etc. over many different time periods. 1-14 Primary Data Sources • private-use data – government surveys or internal firm-level data – obtained through formal request and/or having the appropriate connections. • publicly-available data – obtained through the internet or through formal Freedom of Information Act (FOIA) request • personal survey data – obtained by personally conducting a survey asking people for information and recording their responses 1-15 An Example of the First Three Steps Suppose you are trying to convince your significant other to go camping but he or she is afraid of bears. How can you use your empirical research skills to convince him or her that bear attacks are not a realistic concern? Step 1: Identify a question of interest What factors affect the number of fatal bear attacks in the US? 1-16 An Example of the First Three Steps Step 2: Develop appropriate theory The number of fatal bear attacks in the US should depend on: • The number of bears • The number of campers • Square feet of national parkland 1-17 An Example of the First Three Steps Step 3: Collect appropriate data Start with an internet search for the data you seek 1-18 An Example of the First Three Steps Download data to Excel and then repeat the process for the independent variables you seek. 1-19 Data Management Skills Two important points: (1) When working with data, it is common to make mistakes which alter the initial data (2) When working on a larger project, it is common to take time off before returning to the project 1-20 Data Management Skills Our goals with data management are to be able to: (1) Recreate our initial data as easily as possible (2) Recall what we had previously done as easily as possible 1-21 Data Management Skills When working with data, we recommend: (1) Creating a “Master” file with the initial data and performing calculations in a different “working” file (2) Exhaustively documenting all initial data sources (3) Making file and variable names as intuitive as possible (4) Documenting all commands used when performing estimation 1-22 Chapter 3 Summary Statistics Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Learning Objectives • • • • • • Construct relative frequency histograms Calculate measures of central tendency Calculate measures of dispersion Use measures of central tendency and dispersion Detect whether outliers are present Construct scatter diagrams for the relationship between two variables • Calculate the covariance and the correlation coefficient between two variables 1-24 1-25 Construct a Relative Frequency Histogram • A bar chart that shows how often observations lie within a specified classes • Allows a visual inspection of the data • Based on a Relative Frequency Table • The example dataset for constructing a histogram use states.xls, a survey of econometrics students that asked how many states they have been visited. 1-26 Number of States Visited 0.45 0.4 0.35 Relative Frequency 0.3 0.25 0.2 0.15 0.1 0.05 0 0-5.99 6-11.99 12-17.99 18-24 Numver of States Visited 1-27 To create a frequency distribution we must … 1. Select the number of classes 2. Choose the class interval or width of the classes 3. Select the class boundaries or the values that form the interval for each class 4. Count the number of values in the dataset that fall in each class 1-28 Step 1: Select the number of classes The rule for determining the approximate number of classes is: Approximate number of classes = [(2)(Number of observations)].3333 The actual number of classes is the integer value that just exceeds the number value. If the formula gives us 4.66 we use 5 1-29 Step 1: Example We have 43 data points so the rule is: Approximate number of classes = [(2)(20)].3333 = 3.503 Round this up to the next integer value which is 4. The number of classes is 4. **Always round up!! 1-30 Step 2: Choose the width of the interval The rule for determining interval width is: Approximate interval width = Largest data value – Smallest data value Number of classes The actual interval width is the integer value that just exceeds the number value. If the formula gives us 6.17 we use 7 **Always round up!! 1-31 Step 2: Example Approximate interval width = (24-1)/4 = 5.75 Round up to 6. Therefore the class width is 6. 1-32 Step 3: Select the class boundaries • Class boundaries must be chosen such that each data item belongs to one and only one class. • Start just below the lowest value in the dataset to get the lower boundary. The lower boundary for the second class is then found by adding the class width. The upper boundary for the first class is found by subtracting .01 from the lower boundary of the second class. • Keep adding the class width and subtracting .01 to get the boundaries. 1-33 Step 3: Example Lowest data point is 1. We will start our classes at 0. Class 1 = 0 Class 2 = 6 (=0+6) Class 3 = 12 (=6+6) Class 4 = 18 (=12+6) 1-34 Step 3: Example Continued Class boundaries are then: Class 1: Class 2: Class 3: Class 4: 0- 5.99 6-11.99 12-17.99 18-24 1-35 Step 4: Count the number of values in the dataset that fall into each class • Doing this by hand is tedious and, therefore, we want to rely on Excel to do this for us. • Enter the class boundaries into Excel next to the data set. • Enter the Upper Boundaries of each of the classes • Use the Frequency command 1-36 How to use the Frequency command in Excel 1. Select the cells next to the class intervals where the frequencies should go (say E2:E6). 2. Type but do not enter the formula =Frequency(A2:A44,D2:D6) A2:A44 contains the data D2:D6 contain the ending class boundaries 3. Press CTRL+SHIFT+ENTER and the array formula will be entered into each of the cells E2:E6. 1-37 Our Excel Results Class Boundaries 0-5.99 6-11.99 12-17.99 18-24 Upper Limit 5.99 11.99 17.99 24.00 Frequency 18 18 4 3 1-38 Creating relative frequency and percent frequency distributions Recall that the relative frequency is the proportion of the observations belonging to a class. With n observations Relative frequency of a class = Frequency of the class n The percent frequency is the relative frequency multiplied by 100. 1-39 Relative Frequency Table 1-40 Using Excel’s Chart Wizard to Construct a Histogram 1. Use the frequency distribution we just constructed and highlight the frequencies 2. Click the Chart Wizard and choose column in the chart type 3. Click on the Category (X) axis labels box and enter the class boundaries 4. To get the bars to touch right click on any rectangle in the column chart and choose Format Data Series. Select the Options tab and enter 0 in the Gap Width box. 1-41 Number of States Visited 0.45 0.4 0.35 Relative Frequency 0.3 0.25 0.2 0.15 0.1 0.05 0 0-5.99 6-11.99 12-17.99 18-24 Numver of States Visited 1-42 Soda Consumption Data Your mission is to pair up with a classmate and draw what you think the histogram for soda consumption looks like. 1-43 Calculate Measures of Central Tendency • Central tendency is the middle value of a dataset. • The measure of central tendency is typically thought of as the number that best describes the data. • Measures of central tendency are: (1) Mean (2) Median 1-44 Measure of Central Tendency - Mean The mean is the arithmetic average of the data. To calculate the mean sum all the observations and divide by the number of observations. Represented by the symbol, x 1 n 1 Mean – x xi ( x1 x2 ... xn ) n i 1 n For the following small data set: 95 85 99 92 80 Mean =(95+85+99+92+80)/5 = 451/5 = 90.2 In Excel =average(highlight data) 1-45 Measure of Central Tendency - Median Median – the middle observation when the data are arranged from smallest to largest sometimes called the 50% percentile. Half the observations lie below the median and half the observations lie above the median. The median is the middle observation for an odd number of ordered observations and the average of the middle two ordered observations for an even number of observations. The median is an order statistic so in order to calculate it the data must be ordered from smallest to largest. 1-46 Measure of Central Tendency - Median Median – Central observation for an odd number of observations and an average of the two middle data points for an even number of observations For the following small data set : 95 85 99 92 80 (ordered data 80 85 92 95 99) Median = 92 (the 3rd data point) If we had 75 80 85 92 95 99 median =(.5*85)+(.5*92) = (85+92)/2 = 42.5+46 = 88.5 In Excel =median(highlight data) 1-47 Calculate Measures of Dispersion Dispersion is a measure of how the data vary. Measures of dispersion are: (1) Variance (2) Standard Deviation (3) Percentiles (4) Five Number Summary 1-48 Measure of Dispersion – Variance and Standard Deviation Standard Deviation – the average deviation away from the mean. It is the square root of the variance. The variance is calculated by subtracting the mean from each observation, squaring that value, adding up all n values, and then dividing that by the number of observations less one. n Sample variance formula is s 2 Standard deviation is s ( xi x ) 2 i 1 n 1 s2 In Excel = var(highlight data) = stdev(highlight data) 1-49 Measure of Dispersion – Variance and Standard Deviation n 2 ( x x ) i Sample variance: s 2 i 1 n 1 For the following small data set : 95 85 99 92 80 s2= [(95-90.2)2+ (85-90.2)2+ (99-90.2)2+ (92-90.2)2+ (80-90.2)2]/4=234.8/4=58.7 Sample standard deviation s s 2 s= 58.7 =7.6616 1-50 Measure of Dispersion – Percentile A percentile is a number such that p% of the ordered observations lie below the percentile and (1-p)% of the observations lie above the percentile. The median is the 50th percentile and an example of a percentile where 50% of the ordered data lies below that level and 50% of the ordered data lies above that level. A percentile is an order statistic. There are many different ways to calculate percentiles. On the next slide one of the easiest ways to calculate percentiles. 1-51 Steps to Calculate a Percentile, p (1) Sort the data from low to high (2) Count the number of observations, n (3) Select the p(n+1) observation (4) If the value p(n+1) is not a whole number then select the closest whole number (5) If p(n+1) is less than 1 then select the smallest number (6) If p(n+1) is greater than 1 then select the largest number. In Excel =percentile(highlight data, p) Note that the steps to calculate a percentile by hand and calculating percentiles in Excel will likely not result in the same value. 1-52 Measure of Dispersion - Percentile Calculate the 10th and the 70th percentile for the following small data set : 95 85 99 92 80 (ordered data 80 85 92 95 99) 10th percentile select the .1(n+1) = .1(6) = .6 number in the data set. The closest whole number is 1 so the 10th percentile is the first observation or 80. 70th percentile select the .7(n+1) = .1(6) = 4.2 number in the data set. The closest whole number is 4 so the 70th percentile is the fourth observation or 95. 1-53 Measure of Dispersion – Five Number Summary The Five Number Summary is (1) Minimum (2) Q1 or 25th Percentile (3) Q2 or Median (50th Percentile) (4) Q3 or 75th Percentile (5) Maximum 1-54 How to Calculate the Five Number Summary in Excel Minimum =Min (data) Q1 or 25th Percentile =percentile(data,.25) or =quartile(data,1) Q3 or 75th Percentile =percentile(data,.75) or =quartile(data,3) Maximum =Max (data) 1-55 Shapes of Histograms • • • • Symmetric Skewed to the right or Positively skewed Skewed to the left or Negatively Skewed Bimodal 1-56 Symmetric Histogram Histogram for Diameter of 400 Elevator Rails 90 80 70 60 50 40 30 20 10 0 <=0.455 .455- .465 .465- .475 .475- .485 .485- .495 .495- .505 .505- .515 .515- .525 .525- .535 .535- .545 >0.545 C a t e gor y 1-57 Positively Skewed Distribution Histogram for Time Betw een Bank Customer Arrivals 160 140 120 100 80 60 40 20 0 <=2.5 2.5- 5 5- 7.5 7.5- 10 10- 12.5 12.5- 15 15- 17.5 17.5- 20 20- 22.5 22.5- 25 25- 27.5 >27.5 C a t e gor y 1-58 Negatively Skewed Distribution Histogram for Scores on a Midterm 20 18 16 14 12 10 8 6 4 2 0 <=45 45- 50 50- 55 55- 60 60- 65 65- 70 70- 75 75- 80 80- 85 85- 90 90- 95 >95 C a t e gor y 1-59 Bimodal Distribution 1-60 Positively Skewed Distribution Median = 2.77 Histogram for Tim e Betw een Bank Custom er Arrivals 160 140 120 Mean = 4.16 100 80 60 40 20 0 <=2. 5 2. 5- 5 5- 7. 5 7. 5- 10 10- 12. 5 12. 5- 15 15- 17. 5 17. 5- 20 20- 22. 5 22. 5- 25 25- 27. 5 >27. 5 C a t e gor y 1-61 Why is the shape of the histogram important? • The shape of the empirical distribution dictates which summary statistics should be used Symmetric – Use mean and standard deviation Skewed – Use median and five number summary 1-62 How to determine if your data is skewed or symmetric Pearson’s coefficient of skewness: sk = 3*(mean-median)/(standard dev.) Rule of Thumb: If sk<-.5 or sk>.5 then the distribution is skewed. Otherwise the distribution is symmetric. Negatively skewed Positively Skewed Symmetric -.5 .5 1-63 Symmetric Histogram Mean = .5013 Histogram for Diameter of 400 Elevator Rails Standard Deviation =.019 90 80 70 60 50 40 30 20 10 0 <=0. 455 . 455- . 465 . 465- . 475 . 475- . 485 . 485- . 495 . 495- . 505 . 505- . 515 . 515- . 525 . 525- . 535 . 535- . 545 >0. 545 C a t e gor y 1-64 Positively Skewed Distribution Median = 2.779 Histogram for Tim e Betw een Bank Custom er Arrivals 160 140 120 100 80 60 40 20 0 <=2. 5 2. 5- 5 5- 7. 5 7. 5- 10 10- 12. 5 12. 5- 15 15- 17. 5 17. 5- 20 20- 22. 5 22. 5- 25 25- 27. 5 >27. 5 C a t e gor y Five Number Summary Minimum 0.008 Q1 1.1578 Median 2.779 Q3 5.643 Maximum 29.001 1-65 How to Detect Outliers with Symmetric data Use the Empirical Rule 68% of data should be within one standard deviation of the mean xs 95% of the data should be within two standard deviations of the mean x 2s 100% of the data should be within three standard deviations of the mean x 3s Therefore, an observation is an outlier if it lies beyond three standard deviations from the mean or beyond the interval ( x - 3s, x + 3s) 1-66 How to detect an outlier with skewed data • Calculate the interquartile range or IQR = Q3 – Q1. • If a value is greater than Q3 plus 1.5*IQR or less than Q1 minus 1.5*IQR the it’s a moderate outlier • If a value is greater than Q3 plus 3*IQR or less than Q1 minus 3*IQR then it’s an extreme outlier 1-67 Construct Scatter Diagrams for the Relationship between two Random Variables • A scatter diagram (or scatter plot) is used to show the relationship between two variables • It contains one variable on the x-axis and the other variable on the y-axis • A scatter diagram shows how the two variables are related to each other, both the strength and direction of the relationship 1-68 Scatter Diagram Examples Positive Linear relationship y Curvilinear relationships y x Negative Linear relationship y x y x x 1-69 Scatter Diagram Examples Strong relationships Weak relationships y y x y x y x x 1-70 Scatter Diagrams Examples No relationship y x y x 1-71 Salary vs. Years of Education 1-72 How to Create a Scatter Diagram in Excel • Highlight the data making sure that the variable you want on the y-axis is on the right • Select “Insert” and then “Scatter” and click on the first option • Make sure to change the chart title, add axis titles. • Possibly delete the legend and change the start values for the axis. 1-73 Salary vs. Experience 160,000 140,000 Salary (dollars) 120,000 100,000 80,000 60,000 40,000 20,000 0 10 12 14 16 18 20 22 Experience (years) 1-74 What does the Scatter Diagram on the previous slide tell us? • The relationship between education and salary is positive (in general as education increases salary increases) • The relationship is fairly strong because the data point are closely gathered to each other • This scatter diagram indicates that while the variable education is helpful for predicting salaries, it will not yield perfect predictions. 1-75 Covariance and the Correlation Coefficient for the Linear Relationship between two variables • Covariance and Correlation Coefficient supplies a numeric value to the strength and direction of the linear relationship between two variables – Only concerned with strength of the relationship – No causal effect is implied 1-76 Covariance • Covariance is a measure of the linear relationship between two random variables • A positive covariance indicates a positive linear relationship between x and y (if x is below its mean then y tends to be below its mean and if x is above its mean then y tends to be above its mean) • A negative covariance indicates a negative linear relationship between x and y (if x is below its mean then y tends to be above its mean and if x is above its mean then y tends to be below its mean) 1-77 Covariance • A covariance near 0 indicates no linear relationship between x and y • A problem with covariance is that it depends on the units of measurement for x and y if we change from measuring in feet to inches the covariance will go up even though the overall relationship hasn’t changed. 1-78 Covariance – a Measure of Linear Association Between Two Variables • Remember the formula for variance is n s2 (x i x) i 1 n 1 2 n (x i x )( x i x ) i1 n 1 or how x varies with itself. The formula for Covariance is n (x i x)( yi y) Cov( x, y) s xy i1 n 1 and it measures how varies with y in a linear fashion. 1-79 Applying the Covariance Formula Cox(x,y) = Sum/(n-1) = 743000/9 = 82,555.5556 1-80 Calculating Covariance in Excel • In some versions of Excel, the covariance is not calculated correctly. • The Excel command is =Covar(highlight x values, highlight y values) • You should perform this command in Excel for the data set above and see if it matches the value 82,555.5556. • If you obtain 74,300 using the covar command (which is likely), you must multiply the value you obtain in Excel by n/(n-1) to obtain the correct value for covariance. 1-81 Correlation Coefficient • The sample correlation coefficient, rxy, is an estimate of population correlation coefficient and is used to measure the strength and direction of the linear between two random variables. • The correlation is a unit free measure (unlike the covariance) and falls between -1 and 1. 1-82 What Does the Correlation Coefficient Mean? • If all the points in a data set fall on a positively sloped line, rxy =1. • If all the points in a data set fall on a negatively sloped line, rxy =-1. • If there is no linear relationship between x and y then rxy =0. • The closer to -1, the stronger the negative linear relationship • The closer to 1, the stronger the positive linear relationship • The closer to 0, the weaker the linear relationship 1-83 Examples of Approximate rxy Values y y y x r = -1 r = -.6 y x x r=0 y r = +.3 x r = +1 x 1-84 Calculating the Correlation Coefficient Sample correlation coefficient: Cov( x, y) sxy rxy st.dev.( x) st.dev.( y ) sxsy From above, the standard deviation of x is 2.708 and the standard deviation of y is 38,189.037. 82,555.5556 rxy 0.7983 (2.708)(38,189.0037) A correlation of 0.7983 means that education and salary are positively related and the relationship is strong (because this values lies near 1) In Excel =correl(highlight x values, highlight y values) 1-85 What Does Correlation Mean? • Correlation provides a measure of linear association between two variables. A correlation coefficient is near 0 only means that there is a weak linear association between the two variables, not that there isn’t any relationship between the two variables. • A high correlation between two variables does not mean that changes in one variable will cause changes in the other variable. • We might find that the quality rating and the typical mean price of restaurants are positively correlated. However, simply increasing the mean price at a restaurant will not cause the quality rating to increase. 1-86