😊 Lecture 1: Introduction to Econometrics 1. What is Econometrics? - "Econo" dùng để chỉ kinh tế học, và "metrics" dùng để đo lường và phân tích. Nó tập trung vào việc định lượng các hiện tượng kinh tế bằng toán học và thống kê. - Kinh tế lượng được sử dụng rộng rãi trong kinh tế, tài chính, tiếp thị và quản lý. 2. Hai mục đích của Kinh tế lượng : - Dự đoán : Ví dụ, dự đoán lợi nhuận cổ phiếu bằng cách sử dụng dữ liệu quá khứ. - Đề xuất chính sách : Hiểu được mối quan hệ nhân quả, ví dụ, phân tích tác động của giáo dục đến thu nhập. 3. Các kiểu dữ liệu trong Kinh tế lượng : - Dữ liệu cắt ngang (Cross-sectional data): Dữ liệu được thu thập tại một thời điểm. - Dữ liệu chuỗi thời gian (Time-series data): Dữ liệu được thu thập trong các khoảng thời gian khác nhau. - Dữ liệu bảng (Panel data): Dữ liệu cắt ngang được quan sát trong nhiều khoảng thời gian. 4. Quy trình mô hình hóa : - Bắt đầu với một lý thuyết kinh tế. - Biên dịch nó thành một mô hình toán học. - Sử dụng dữ liệu để ước tính mô hình và đưa ra dự đoán hoặc khuyến nghị. Ex1: Understanding Econometrics a. Define Econometrics and explain its two primary purposes. Provide an example for each purpose. - Econometrics is a branch of economics that applies statistical and mathematical methods to analyze economic data. It seeks to quatify economic relationships, test hypotheses, and predict future trends using real-world data. - The two primary purposes of econometrics are: o Prediction: Econometrics can predict the future values of economic variables based on existing relationships and data. Example: Predicting future GDP growth based on past investment trends, inflation, and employment levels. o Policy prescription: Econometrics helps identify causal relationships between variables, which can guide policymakers to implement appropriate policies. Example: Using econometrics to analyze how a tax cut could influence household consumption, helping the government decide whether to implement it. b. Differentiate between Cross-sectional, Time-series, and Panel data. Provide a real-world example of each data type. - Cross-sectional data: Data collected at a single point in time, typically across multiple subjects. Example: urvey data on individuals' incomes and education levels in a specific year. - Time-series data: Data collected over time on the same subject or entity. Example: Monthly inflation rates of a country from 2010 to 2020. - Panel data (also known as longitudinal data): Combines both cross-sectional and time-series data, where the same subjects are observed over time. Example: Annual income and employment status of a group of individuals tracked from 2005 to 2015. Ex2: Application of Econometrics 1. Prediction vs. Policy Prescription: Identify whether the following scenarios are examples of prediction or policy prescription: a. Estimating the future unemployment rate based on historical data. -> Prediction: The goal is to use past data to forecast a future value. b. Determining the effect of minimum wage increases on employment levels. -> Policy Prescription: This involves understanding a causal relationship (the effect of wage increases on employment) to guide policy decisions. 2. Modeling Process: Outline the steps involved in building an econometric model starting from economic theory to policy recommendation. 1. Understanding the Problem: Begin by identifying the economic issue or hypothesis you want to address. For example, you might be interested in how inflation affects consumer spending. 2. Formulating a Conceptual Model: Use economic theory to establish a relationship between variables. For instance, economic theory suggests that as inflation rises, consumer spending may decrease due to reduced purchasing power. 3. Collecting Appropriate Data: Gather relevant data on the variables in your model. In this case, data on inflation rates and consumer spending over time are needed. 4. Looking at Data ( Descriptive Analytics): Before estimating the model, visually inspect the data to understand the distribution and relationships. For example, you might create scatter plots or calculate summary statistics. 5. Estimating the Model: Use econometric techniques, such as Ordinary Least Squares (OLS), to estimate the parameters of your model. This step involves running the regression analysis to quantify the relationship between inflation and consumer spending. 6. Making Inference and Prediction: Interpret the regression results. Based on the estimated coefficients, determine how much consumer spending is expected to change with a 1% increase in inflation. 7. Policy Prescription: Use the model's results to provide policy recommendations. For example, if the analysis shows that inflation significantly reduces consumer spending, policymakers might focus on stabilizing prices to encourage economic growth. 8. Iterate and Refine the Model: Finally, evaluate the model by testing for robustness, checking assumptions, and adjusting as necessary. This iterative process improves the accuracy and reliability of the analysis. Lecture 2: Probability and Statistics Refresher 1. Random Variables( biến ngẫu nhiên): - Biến ngẫu nhiên có các giá trị khác nhau dựa trên kết quả của một sự kiện không chắc chắn. - Có thể rời rạc (ví dụ, lăn xúc xắc) hoặc liên tục (ví dụ, lượng mưa). 2. Measures of Central Tendency (Các biện pháp của xu hướng trung tâm): - Mean Trung bình (Expected Value Giá trị kì vọng): Kết quả trung bình - Median Trung vị: Giá trị ở giữa. - Variance and Standard Deviation (Phương sai và độ lệch chuẩn): Đo lường mức độ phân tán hoặc lan tỏa của một tập dữ liệu. 3. Covariance and Correlation (Hiệp phương sai và tương quan): - Covariance Hiệp phương sai: Đo lường cách hai biến di chuyển cùng nhau. - Correlation Hệ số tương quan: Phiên bản hiệp phương sai được chia tỷ lệ, biểu thị cường độ và hướng của mối quan hệ (phạm vi từ -1 đến 1). 4. Joint and Conditional Distributions (Phân phối chung và phân phối có điều kiện): - Joint distributions (Phân phối chung) mô tả xác suất hai biến ngẫu nhiên xảy ra cùng nhau. - Kỳ vọng có điều kiện tập trung vào việc mô hình hóa giá trị kỳ vọng của một biến khi có sự hiện diện của biến khác. Ex3: Random Variables and Distributions 1. Discrete vs. Continuous Random Variables: Provide two examples of discrete random variables and two examples of continuous random variables in an economic context. - Discrete Random Variables: o Example 1: The number of cars sold by a dealership in a day. This variable can only take specific integer values (0, 1, 2, etc.). o Example 2: The number of defective products in a shipment. It can only be a whole number like 0, 1, 2, etc. - Continuous Random Variables: o Example 1: The temperature recorded in a city at noon. Temperature can take any value within a range, like 25.6°C or 32.1°C. o Example 2: The time it takes for a customer to be served at a fast-food restaurant. Time can take any real number, such as 3.25 minutes or 7.8 minutes. 2. Probability Distribution: Suppose the number of cars sold by a dealership in a day is a discrete random variable 𝑋 with the following probability distribution: a. Verify that this is a valid probability distribution. To verify if it’s a valid prbability distribution, we need to check if the sum of all probabilities equals 1: P(X=0) + P(X=1) + P(X=2) + P(X=3) = 0,1 + 0,3 + 0,4 + 0,2 = 1 Since the sum of all probabilities is 1, this is a valid probability distribution. b. Calculate the expected number of cars sold in a day. The expected value (mean) of X, denoted by E(X), is caiculated as follow: Substitute the values: E(X) = (0. 0,1) + (1.0,3) + (2.0,4) + (3.0,2) = 1,7 The expected number of cars sold in a day is 1,7. c. Calculate the variance and standard deviation of cars sold. The variance of X, denoted by Var(x), is calculated as follows: First, we calculate E(X2) (the expected value of X2): E(X2) = (02.0,1) + (12 . 0,3) + (22 .0,4) + (32.0,2) = 3,7 Now, calculate the variance: Var(X) = 3,7 – (1,7)2 = 0,81 The standard deviation is the square root of the variance: Std(X) = √0,81 = 0,9 Ex4: Measures of Central Tendency and Dispersion 1. Calculations: Given the following dataset representing the annual incomes (in $1000s) of 5 individuals: [40, 50, 60, 70, 80] a. Calculate the mean, median, and mode. - Mean (average) is calculated by summing all values and dividing by the number of values: Mean = 40+50+60+70+80 5 = 60 So, the mean income is $60 000. - Median is the middle value when the data is ordered. Since there are 5 values (odd number), the median is the middle one: $60 000 - Mode is the value that occurs most frequently. Since each value appears only once, there is no mode in this dataset b. Calculate the variance and standard deviation. - Variance measures the spread of the data points from the mean. The formula is: Where Xi represents each value in the dataset, and n = 5 (the number of data points). First, calculate the differences from the mean and square them: (40 -60)2 = 400 (50 -60)2 = 100 (60 -60)2 = 0 (70 -60)2 = 100 (80 -60)2 = 400 Now, sum the squared differences: Sum of squared differences = 400 + 100 + 0 + 400 = 1000 Finally, calculate the variance: Variance = 1000 5 = 200 The variance is 200. - Standard Deviation is the square root of the variance: Standard Deviation = √200 ≈ 14,14 2. Interpreting Covariance and Correlation: Explain the difference between covariance and correlation. Why is correlation often preferred over covariance when assessing the relationship between two variables? - The difference between covariance and correlation: o Covariance measures how two random variables move together. If both variables increase or decrease together, the covariance will be positive; if one increases while the other decreases, the covariance will be negative. However, the magnitude of covariance depends on the units of the variables, making it hard to compare across datasets. o Correlation is a normalized version of covariance that ranges from -1 to 1. It standardizes the relationship, making it easier to interpret. A value of 1 means perfect positive correlation, -1 means perfect negative correlation, and 0 means no correlation. - Why is correlation often preferred? Correlation is unit-free, allowing for easier comparison across different variables and datasets. Covariance's magnitude can change with the scale of the data, making it less interpretable. Ex5: Joint and Conditional Distributions 1. Joint Distribution: Consider two discrete random variables 𝑋 and 𝑌 with the following joint probability distribution: a. Calculate the marginal distributions of X and Y - Margin distribution of X: Sum the joint probabilities over all values of Y to get the marginal distribution of X P(X = 1) = P(X = 1, Y = 1) + P(X = 1, Y = 2) = 0,2 + 0,1 = 0,3 P(X = 2) = P(X = 2, Y = 1) + P(X = 2, Y = 2) = 0,3 + 0,4 = 0,7 So, the marginal distribution of X is: P(X = 1) = 0,3; P(X = 2) = 0,7 - Margin distribution of Y: Sum the joint probabilities over all values of X to get the marginal distribution of Y P(Y = 1) = P(X = 1, Y = 1) + P(X = 2, Y = 1) = 0,2 + 0,3 = 0,5 P(Y = 2) = P(X = 1, Y = 2) + P(X = 2, Y = 2) = 0,1 + 0,4 = 0,5 So, the marginal distribution of Y is: P(Y = 1) = 0,5; P(Y = 2) = 0,5 b. Determine the conditional distribution P(Y|X = 1) The conditional probability P(Y|X = 1) is calculate as: For X = 1, use the joint probabilities from the table: So, the conditional distribution P(Y|X = 1) is: 2. Conditional Expectation: Explain what conditional expectation E(Y|X) represents in the context of econometric modeling. The conditional expectation E(Y|X) represents the expected value of the dependent variable Y given that the independent variable X has taken on a specific value. In econometrics, it is the foundation of regression analysis, where are interested in predicting the mean of Y based on the values of X. For examples, in a wage regression model where Y is the wage and X is the years of education, E(Y|X) would represent the expected wage for someone with X years of education. The conditional expectation is often medeled as a linear function of X in simple regression, such as: This allows economists to estimate the relationship between X and Y, and make predictions about Y given values of X Lecture 3: Simple and Multiple Regression Analysis Phân tích hồi quy đơn và hồi quy bội 1. Simple Linear Regression (Hồi quy tuyến tính đơn) - Một kỹ thuật thống kê để mô hình hóa mối quan hệ giữa biến phụ thuộc (Y) và một biến độc lập (X). - Mục tiêu là tìm ra đường phù hợp nhất: , ở đó là giao điểm và là độ dốc. - OLS (Ordinary Least Squares Bình phương tối thiểu thông thường) được sử dụng để ước tính các tham số và 2. Multiple Regression (Hồi quy bội): - Mở rộng hồi quy đơn bằng cách bao gồm nhiều biến độc lập: - Giúp cô lập tác động của từng biến trong khi kiểm soát các biến khác. 3. Key Outputs of Regression (Các đầu ra chính của hồi quy): - Estimated coefficients (Các hệ số ước tính) : Cho chúng ta biết tác động của từng biến độc lập lên biến phụ thuộc. - Goodness of Fit Độ phù hợp (R-squared): Đo lường mức độ mô hình giải thích sự thay đổi trong biến phụ thuộc tốt như thế nào. Ex6: Simple Linear Regression 1. Conceptual Questions: a. What is the purpose of Ordinary Least Squares (OLS) in simple linear regression? The purpose of Ordinary Least Squares (OLS) is to estimate the parameters of the linear regression model in a way that minimizes the sum of the squared differences between the observed dependent variable (𝑌) and the predicted values (Ŷ) from the model. In other words, OLS finds the best-fitting line through the data points by minimizing the residuals (the vertical distances between the observed data points and the regression line). b. Explain the difference between the population regression function and the sample regression function. - The population regression function Hàm hồi quy dân số (PRF) represents the true, but unobservable, relationship between the dependent variable (𝑌) and the independent variable (𝑋) in the entire population: Here, and are the true, unknown parameters, and u is the error term. - The sample regression function Hàm hồi quy mẫu (SRF) is an estimate of the PRF based on a finite sample of data. It takes the form: Here, and are the OLS estimates of the population parameters, derived from the sample data. 2. Calculation Problem: Suppose you have the following data on years of education (X) and wages (Y) for 5 individuals: a. Calculate the OLS estimates for the intercept ( The formula for the OLS slope ( ) and slope ( ) is: Where: = = 10+12+14+16+18 = 14 5 500+600+700+800+900 5 = 700 Now calculate the terms for the numerator and denominator: ). Now sum the relevant columns: = 800 + 200 + 0 + 200 + 800 = 2000 = 16 + 4 + 0 + 4 + 16 = 40 Now, calculate the slope = 2000 40 : = 50 Next, calculate the intercept ( ) using the formula: = 700 – 50.14 = 0 Thus, the estimated regression equation is: b. Interpret the slope coefficient The slope coefficient ( = 50) means means that for each additional year of education, the predicted wage increases by $50. In this case, every extra year of education is associated with a $50 increase in wages. c. Calculate the coefficient of determination (R 2) The coefficient of determination (R2) measures how well the regression line fits the data. It is calculated as: - TTS: From the table above: TTS = (-200)2 + (-100)2 + 02 + 1002 + 2002 = 100000 - RSS: The predicted values ( i) from the regression are: = 50.10 = 500 = 50.12 = 600 = 50.14 = 700 = 50.16 = 800 = 50.18 = 900 Since the predicted values are exactly the same as the observed values, the residuals are 0, so: RRS = 0 Thus, the coefficient of determination is: R2 = 1 - 0 100000 =1 This means the model perfectly explains the variation in wages based on education, though in practice, a perfect, a perfect R2 like this is rare and typically indicates an over-simplified model Ex7: Multiple Regression 1. Conceptual Questions: a. How does multiple regression help in isolating the effect of one independent variable on the dependent variable? In multiple regression, the effect of each independent variable on the dependent variable is estimated while holding all other independent variables constant. This allows for a more accurate estimation of the relationship between one specific independent variable and the dependent variable because the model controls for the influence of other variables. For example, in a model where wage depends on both education and experience, multiple regression helps isolate the effect of education on wage by controlling for experience, ensuring the estimated impact of education is not confounded by the effects of experience. b. What assumptions must hold for the OLS estimators to be unbiased in multiple regression? For the OLS estimators to be unbiased, the following key assumptions must hold: 1. Linearity: The relationship between the dependent and independent variables must be linear. Tính tuyến tính: Mối quan hệ giữa các biến phụ thuộc và biến độc lập phải là tuyến tính. 2. No perfect multicollinearity: The independent variables should not be perfectly correlated with one another. If there is perfect multicollinearity, OLS cannot separate the individual effects of the independent variables. Không có đa cộng tuyến hoàn hảo: Các biến độc lập không được tương quan hoàn hảo với nhau. Nếu có đa cộng tuyến hoàn hảo, OLS không thể tách biệt các tác động riêng lẻ của các biến độc lập. 3. Exogeneity: The error term must not be correlated with the independent variables. If any of the independent variables are correlated with the error term, the OLS estimators will be biased. This often happens when there is an omitted variable or measurement error in the independent variables. Tính ngoại sinh: Thuật ngữ lỗi không được tương quan với các biến độc lập. Nếu bất kỳ biến độc lập nào có tương quan với thuật ngữ lỗi, các ước lượng OLS sẽ bị thiên vị. Điều này thường xảy ra khi có một biến bị bỏ sót hoặc lỗi đo lường trong các biến độc lập. 4. Homoscedasticity: The variance of the error term must be constant across all values of the independent variables. If this assumption is violated, the OLS estimators remain unbiased but are no longer efficient. Tính đồng dạng phương sai: Phương sai của thuật ngữ lỗi phải không đổi trên tất cả các giá trị của các biến độc lập. Nếu giả định này bị vi phạm, các ước lượng OLS vẫn không bị thiên vị nhưng không còn hiệu quả nữa. 5. No autocorrelation: For time-series data, the errors must not be correlated with each other. This means that past errors should not influence future errors. Không có tự tương quan: Đối với dữ liệu chuỗi thời gian, các lỗi không được tương quan với nhau. Điều này có nghĩa là các lỗi trong quá khứ không được ảnh hưởng đến các lỗi trong tương lai. 2. Interpretation Problem: Consider the multiple regression model: If = 2 and = 1,5, interpret these coefficients. - Interpretation of = 2 (education coefficient): This means that, holding experience constant, for each additional year of education, the expected wage increases by $2. In other words, the marginal effect of one more year of education is an increase of $2 in wages. - Interpretation of = 1,5 ( experience coefficient): This means that, holding education constant, for each additional year of work experience, the expected wage increases by $1.50. The marginal effect of one more year of experience is a $1.50 increase in wages. Ex8: OLS Estimator Properties 1. Geometry of OLS: Explain why the OLS residuals are orthogonal to the explanatory variables in the regression model. In OLS, the residuals (𝑢) are the differences between the observed values and the predicted values of the dependent variable (𝑌). One of the fundamental properties of the OLS estimator is that the residuals are orthogonal (perpendicular) to the explanatory variables (𝑋) in the regression model. This can be understood as follows: o When OLS estimates the parameters and (or more in the case of multiple regression), it minimizes the sum of the squared residuals. The first-order condition for minimization requires that the derivative of the sum of squared residuals with respect to each coefficient is zero. o Mathematically, this translates to the sum of the products of each explanatory variable and the residuals being zero: This implies that the residuals ui are uncorrelated with the explanatory variables Xi, meaning they are orthogonal (or perpendicular) in a geometric sense. In vector terms, the residual vector is orthogonal to the column space of the matrix of explanatory variables (𝑋). This orthogonality ensures that the OLS estimates are the best linear unbiased estimates (BLUE) under the Gauss-Markov assumptions. 2. Estimator Calculation: Given a matrix 𝑋 and vector 𝑌, write down the formula for the OLS estimator each component of the formula. The OLS estimator in matrix form is given by: Where: in matrix form. Explain o 𝑋 is the matrix of explanatory variables (also called the design matrix). Each row of 𝑋 corresponds to an observation, and each column corresponds to an independent variable. The first column typically contains ones to account for the intercept. 𝑋 là ma trận các biến giải thích (còn gọi là ma trận thiết kế). Mỗi hàng của 𝑋 tương ứng với một quan sát và mỗi cột tương ứng với một biến độc lập. Cột đầu tiên thường chứa các số 1 để tính đến giá trị chặn. o 𝑌 is the vector of observed values of the dependent variable. 𝑌 là vectơ các giá trị quan sát được của biến phụ thuộc. o 𝑋′ is the transpose of the matrix 𝑋, which changes rows to columns. 𝑋′ là chuyển vị của ma trận 𝑋, chuyển các hàng thành các cột. o (𝑋′𝑋)-1 is the inverse of the matrix product 𝑋′𝑋, which exists if 𝑋′𝑋 is invertible (i.e., if there is no perfect multicollinearity). (𝑋′𝑋)-1 là nghịch đảo của tích ma trận 𝑋′𝑋, tồn tại nếu 𝑋′𝑋 khả nghịch (tức là nếu không có đa cộng tuyến hoàn hảo). o is the vector of estimated coefficients, which includes the intercept and the coefficients for the independent variables. là vectơ các hệ số ước tính, bao gồm giá trị chặn và các hệ số cho các biến độc lập. Explanation of Each Component: - 𝑋′𝑋: This is the product of the transpose of the matrix 𝑋 and the matrix 𝑋 itself. It results in a square matrix where each element represents the covariance between pairs of independent variables. Đây là tích của phép chuyển vị của ma trận 𝑋 và chính ma trận 𝑋. Nó tạo ra một ma trận vuông trong đó mỗi phần tử biểu diễn hiệp phương sai giữa các cặp biến độc lập. - (𝑋′𝑋)-1: This is the inverse of the 𝑋′𝑋 matrix. It adjusts the covariance matrix of the explanatory variables to ensure that the resulting coefficients are correctly scaled. The inverse exists only if the columns of 𝑋 are linearly independent. Đây là ma trận nghịch đảo của ma trận 𝑋′𝑋. Nó điều chỉnh ma trận hiệp phương sai của các biến giải thích để đảm bảo rằng các hệ số kết quả được chia tỷ lệ chính xác. Nghịch đảo chỉ tồn tại nếu các cột của 𝑋 độc lập tuyến tính. - 𝑋′𝑌: This is the product of the transpose of 𝑋 and the vector of dependent variables 𝑌. It combines the information from the explanatory variables and the observed outcomes. Đây là tích của phép chuyển vị của 𝑋 và vectơ của các biến phụ thuộc 𝑌. Nó kết hợp thông tin từ các biến giải thích và kết quả quan sát được. - : This is the vector of estimated coefficients. It contains the intercept and the slope coefficients for the independent variables, providing the best linear fit to the data by minimizing the sum of squared residuals. Đây là vectơ của các hệ số ước tính. Nó chứa hệ số cắt và hệ số độ dốc cho các biến độc lập, cung cấp sự phù hợp tuyến tính tốt nhất với dữ liệu bằng cách giảm thiểu tổng bình phương của các phần dư. Ex9: Goodness of Fit 1. Coefficient of Determination: Define the coefficient of determination (R2) and explain what it indicates about a regression model. The coefficient of determination, denoted R2, is a measure of how well the indepentdent variables explain the variation in the dependent variable. It is defined as: Where: - RSS is the Residual Sum of Squares, which measures the variation in the dependent variable that is not explained by the regression model (i.e., the variation that remains in the residuals). RSS là Tổng bình phương phần dư, đo lường sự thay đổi trong biến phụ thuộc không được mô hình hồi quy giải thích (tức là sự thay đổi còn lại trong phần dư). - TSS is the Total Sum of Squares, which measures the total variation in the dependent variable from its mean. TSS là Tổng bình phương, đo lường sự thay đổi tổng thể trong biến phụ thuộc so với giá trị trung bình của nó. R2 represents the proportion of the variance in the dependent variable that is explained by the independent variables in the regression model. Its value ranges between 0 and 1: (R2 biểu thị tỷ lệ phương sai trong biến phụ thuộc được giải thích bởi các biến độc lập trong mô hình hồi quy. Giá trị của nó nằm trong khoảng từ 0 đến 1) o An R2 of 1 indicates that the model explains 100% of the variation in the dependent variable, meaning the fit is perfect. Giá trị R2 bằng 1 cho biết mô hình giải thích được 100% sự thay đổi của biến phụ thuộc, nghĩa là mô hình phù hợp hoàn hảo. o An R2 of 0 indicates that the model explains none of the variation, meaning the independent variables have no explanatory power. Giá trị R2 bằng 0 cho biết mô hình không giải thích được bất kỳ biến động nào, nghĩa là các biến độc lập không có khả năng giải thích. 2. Calculation Problem: Using the data from ex6, calculate R2 and interpret its value. From ex6, we know that: - The regression equation is - The predicted values = 50X I are: The predicted values match the observed values, meaning are no residual. Let’s calculate R 2 step by step. 1. TTS: Where is the mean of Y: Now, calculate for each observation: Sum these squared differences: TTS = 40000 + 10000 + 0 + 10000 + 40000 = 100000 2. RRS: Since the predicted values are exactly equal to the observed values, the residuals are 0 for all observations: RRS = 0 3. Calculate R2: Ex10: Interpretation of Regression Output Given the following regression output, interpret each component: The regression equation from this output is: Where represents wage (in $1000), Education is the number of years of education, and experience is the number of years of work experience. a. What is the estimated wage for an individual with 12 years of education and 5 years of experience? To estimate thw wage for this individual, substitude Education = 12 and Experience = 5 into the rehression equation: = 30.5 + 24.72 +7.5 = 62,72 So, the estimated wage is $62 720. b. Discuss the statistical significance of each coefficient. 1. Intercept ( = 30,5): - t-Statistic: 5,88 - p-Value: 0,0000 - The intercept is statistically significant because the p-value is very low (less than 0.05). This suggests that the intercept, representing the base wage when education and experience are both zero, is different from zero in a statistically significant way. Giá trị cắt có ý nghĩa thống kê vì giá trị p rất thấp (nhỏ hơn 0,05). Điều này cho thấy giá trị cắt, biểu thị mức lương cơ bản khi trình độ học vấn và kinh nghiệm đều bằng 0, khác với 0 theo cách có ý nghĩa thống kê. 2. Education ( = 2,06): - t-Statistic: 4,12 - p-Value: 0,001 - The coefficient for education is also statistically significant because the p-value is much smaller than 0.05. This indicates that education has a significant impact on wage. Specifically, for every additional year of education, the expected wage increases by $2,060. Hệ số cho giáo dục cũng có ý nghĩa thống kê vì giá trị p nhỏ hơn nhiều so với 0,05. Điều này cho thấy giáo dục có tác động đáng kể đến tiền lương. Cụ thể, cứ mỗi năm học thêm, tiền lương dự kiến sẽ tăng thêm 2.060 đô la. 3. Experience ( ): - t-Statistic: 5,00 - p-Value: 0,0000 - The coefficient for experience is statistically significant with a p-value less than 0.05. This shows that experience significantly affects wage. For each additional year of work experience, the expected wage increases by $1,500. Hệ số kinh nghiệm có ý nghĩa thống kê với giá trị p nhỏ hơn 0,05. Điều này cho thấy kinh nghiệm ảnh hưởng đáng kể đến tiền lương. Với mỗi năm kinh nghiệm làm việc bổ sung, mức lương dự kiến sẽ tăng thêm 1.500 đô la. c. What does the intercept represent in this context? The intercept of 30.5 represents the estimated wage for an individual with zero years of education and zero years of experience. In real-world terms, this is the base wage someone might earn even without any formal education or work experience. While the intercept is statistically significant in this case, it may not always have a practical meaning, especially when it is unrealistic for people to have zero education or zero experience in a particular dataset. Giá trị cắt 30,5 biểu thị mức lương ước tính cho một cá nhân không có năm học và không có năm kinh nghiệm. Theo thuật ngữ thực tế, đây là mức lương cơ bản mà một người có thể kiếm được ngay cả khi không có bất kỳ nền giáo dục chính thức hoặc kinh nghiệm làm việc nào. Mặc dù giá trị cắt có ý nghĩa thống kê trong trường hợp này, nhưng nó không phải lúc nào cũng có ý nghĩa thực tế, đặc biệt là khi không thực tế khi mọi người không có trình độ học vấn hoặc không có kinh nghiệm trong một tập dữ liệu cụ thể. Ex11: Applying Regression Analysis 1. Data Interpretation: You are provided with a dataset containing variables: Income, Education, Experience, and Age. Describe how you would set up a multiple regression model to study the effect of education and experience on income. What would you include as your dependent and independent variables? 1st. Set up the regression model: - The goal is to understand how Education and Experience influence Income. - The general form of the multiple regression model is: Where: o Income is the dependent variable (what we are try to explain) o Education and Experience are the independent variables (the factors that are believe to influence income). o is the intercept (baseline income when education and experience are zero). o and are the coefficients that measure how much income changes with each additional year of education and experience, respectively. o ui is the error term, representing other factors affecting income that are not captured by education and experience. 2nd. - Explanation of variables: Dependent Variable (Income): This is the outcome we are trying to predict or explain. In this case, income (likely measured in dollars or another currency). - Independent Variables: o Education: This represents the number of years of schooling. The model assumes that more years of education lead to higher income. o Experience: This represents the number of years someone has been working. The assumption is that more experience generally leads to higher income. 2. Hypothesis Testing: Suppose you estimate the following regression equation: Test the hypothesis that education has no effect on income at the 5% significance level. 1. State the null and alternative hypotheses: - Null hypothesis (H0): Education has no effect on income, i.e., =0 - Alternative hypothesis (H1): Education has a significant effect on income, i.e., ≠ 0. 2. Extract information from the regression output: From the regression equation, the coefficient of Education is = 5. To test whether education has a significant effect on income, we need the standard error of coefficient for education and the t-statistic (which might be given or calculated). 3. Calculate the t-statistic: The t-statistic is calculated as: Suppose the standard error of the coefficient for Education is 2. Then: t = 5/2 = 2,5 4. Compare with the critical value: at the 5% significant level, the critical value for a two-tailed t-test with a reasonable sample size (e.g., n≥30) is approximately 1,96 5. Decision: - If |t|>1,96, reject the null hyputhesis. - In this case, |t|=2,5, which is greater than 1,96. Therefore, we reject the null hypothesis and conclude that education has a significant effect on income at the 5% significant level. Ex12: Model Specification 1. Selecting Variables: Discuss the importance of including relevant variables in a regression model. What could be the consequences of omitting a key variable that is correlated with both the dependent and independent variables? Including relevant variables in a regression model is crucial for obtaining unbiased and consistent estimates of the relationship between the dependent and independent variables. When a key variable is omitted from the model, it can result in omitted variable bias. Here’s how omitting a relevant variable can cause problems: 1. Omitted Variable Bias: - If a key variable that influences both the dependent variable and one or more independent variables is omitted from the model, the estimated coefficients of the remaining variables will be biased. This is because the omitted variable's effect is falsely attributed to the included variables. - For example, if you are estimating the effect of education on income but omit the variable "ability" (which affects both education and income), the coefficient for education might be biased, as some of the effect of ability is mistakenly attributed to education. 2. Consequences: - Biased estimates: The estimated coefficients will not reflect the true relationship between the independent and dependent variables, leading to incorrect conclusions. - Invalid inferences: Hypothesis tests and confidence intervals might be misleading because the standard errors are incorrect. - Policy implications: If a model is used for policy decisions, biased estimates can lead to incorrect recommendations. For example, if ability is omitted from the education-income regression, policymakers might overestimate the benefits of additional education. 3. Mitigation: - Include control variables: Carefully consider which variables should be included based on theory and prior research. - Check for omitted variable bias: Use diagnostic tests and examine the residuals to see if important variables have been omitted. 2. Functional Form: Explain how you would determine whether a linear model is appropriate for your data. What diagnostics or tests might you perform? To determine whether a linear model is appropriate for your data, you need to check if the relationship between the dependent and independent variables is indeed linear. Here are steps and diagnostics you can use: 1. Visual Inspection (Scatter Plots): - Plot the dependent variable against each independent variable to visually inspect whether the relationship appears linear. - If the scatter plots suggest curvature or other non-linear patterns, a linear model might not be appropriate, and a non-linear transformation of variables (e.g., log or quadratic terms) may be needed. 2. Residual Analysis: - After estimating a linear regression model, plot the residuals (the differences between observed and predicted values) against the fitted values. - If the residuals display a clear pattern (e.g., a U-shape or an increasing/decreasing trend), this suggests that the relationship between the variables is non-linear. - Residuals should be randomly scattered around zero if the linear model is appropriate. 3. Testing for Non-Linearity: Ramsey RESET test: This test helps detect omitted non-linear relationships. It works by adding squared and cubic terms of the predicted values to the model and testing if these terms are jointly significant. If they are significant, this indicates that a linear model might not be appropriate. 4. Include Polynomial or Interaction Terms: - If the linear model is inadequate, you can include polynomial terms (e.g., 𝑋2) to account for curvature, or interaction terms between independent variables to capture non-linear effects. - For example, in a wage model, you could include Education 2 if you believe the effect of education on wages diminishes at higher education levels. 5. Log-Linear or Log-Log Models - Another option is to transform the data using logarithms. A log-linear model (where the dependent variable is logged) or a log-log model (where both the dependent and independent variables are logged) can help linearize relationships that are multiplicative or exponential in nature. - For example, a log-linear model might be appropriate if income grows exponentially with experience. Ex set 1: OLS and Multiple Regression 1. Understanding OLS Unbiasedness Problem: Given the wage regression model: Where Edu𝑖 is the years of education, Experience𝑖 is work experience, and 𝑢𝑖 is the error term: - What are the assumptions needed for OLS to be an unbiased estimator? - Derive the OLS estimator for 𝛽1 and explain why it’s unbiased under the assumptions of the Classical Linear Model (CLM). - The assumptions for OLS to be an unbiased estimator are: o MLR.1: Linearity in parameters. o MLR.2: Random sampling of observations. o MLR.3: No perfect collinearity among independent variables. o MLR.4: Zero conditional mean, meaning E(u|X) = 0 - OLS Estimator for : The OLS estimator for can be derieved by minizing the sum of squared residuals: Where Xi = Edui and Yi = Wagei. Under the CLM assumptions, the expected value of the OLS estimator is equal to the true population parameter: This means OLS is unbiased 2. Coefficient Interpretation Using the same wage equation, interpret the coefficient if the OLS estimate is 0.05. What does this tell us about the relationship between education and wages, holding experience constant? If = 0,05, it means that, holding experience constant, each additional year of education is associated with an increase in monthly wages of $0.05 (or 5 cents). Interpretation: For two individuals with the same experience, the one with one more year of education is predicted to earn $0.05 more per month. 3. Effect of Rescaling If the dependent variable (Wagei) is measured in dollars, and you convert it to cents, how will the OLS estimates change? Explain why this rescaling does not affect the relationship between variables. - Converting wages from dollars to cents means multiplying the dependent variable by 100. Since all coefficients in the OLS regression are linearly related to the scale of the dependent variable, all OLS coefficients will be multiplied by 100. For example, if = 0,05 when wage is in dollars, then in cents, = 5. - Why the relationship remains unchanged: The rescaling affects only the units in which the dependent variable is measured, not the underlying relationship between the variables. The interpretation (i.e., the change in wages per additional year of education) remains consistent but is now in a different unit. Ex set 2: Hypothesis Testing and Confidence Intervals 1. t-Tests Problem: In the following regression model: The estimated coefficient on Study Houri is 3,5 with a standard error of 1,2. a. Test the null hypothesis H0 : = 0 at the 5% significant level. b. Calculate the t-statistic and interpret whether study hours have a statistically significant effect on final grades. 1. Null Hypothesis: H0 : = 0 (study hours have no effect on final grades) 2. Alternative Hypothesis H1 : ≠ 0 (study hours have a significant effect on final grades) 3. t-Statistic Calculation: The t-statistic is calculated as: 4. Critical Value and Decision: - Degrees of freedom (df) ≈ sample size minus the number of estimated parameters. - For a two-tailed test at the 5% significance level and a large sample, the critical t-value from the tdistribution is approximately 1.96. - Since 𝑡 = 2.917 exceeds 1.96, we reject the null hypothesis. 5. Interpretation Study hours have a statistically significant effect on final grades at the 5% significance level. 2. F-Test for Multiple Restrictions Consider the model: You want to test whether the numberof bedrooms and bathrooms jointly have no effect on house price, I.e., H0 : - Formula the null and alternative hypotheses - Explain how you would perform an F-test to test joint restrictions. 1. Null Hypothesis: H0 : = 0 and = 0 (Bedrooms and bathrooms have no effect on house price) 2. Alternative Hypothesis: H1 : ≠ 0 or ≠ 0 (At least one of the variables, bedrooms or bathrooms, has an effect on house price) 3. F-Test Produre - Step 1: estimate the unrestricted model (including Bedroomi and Bathroomi) and calculate the Sum of Squared Residuals (SSRUR) - Step 2: estimate the restricted model (excluding Bedroomi and Bathroomi, i.e., the model with only Sizei) and calculate the Sum of Squared Residuals SSRR) - Step 3: compute the F-statistic: Where: o q = 2 (number of restrictions, i.e., = 0 and = 0) o n = number of observation o k = number of parameters in the unrestricted model 4. Decision - Compare the calculated F-statistic with the critical F-value from the F-distribution table (based on 𝑞 and 𝑛−𝑘 degrees of freedom) at the 5% significance level. - If 𝐹-statistic exceeds the critical value, reject the null hypothesis. - This would mean that bedrooms and bathrooms jointly have a significant effect on house price. 3. Confidence Intervals Using the model: Where Cigi is the number of cigarettes smoked by the mother during pregnancy, and the estimated coefficient for 𝛽1 is -0.004 with a standard error of 0.001. - Construct a 95% confidence interval for 𝛽1 . - Interpret the confidence interval in the context of the effect of smoking on birth weight. 1. Formula for Confidence Interval: - = -0,004, SE( ) = 0,001 The critical t-value for a 95% confidence interval with a large sample size is approximately 1,96 2. Confidence Interval Calculation -0,004± 1,96 ×0,001 = -0,004 ± 0,00196 = (-0,00596, -0,00204) 3. Interpretation - The 95% confidence interval for 𝛽1 is (-0,00596, -0,00204) - This means we are 95% confident that the true effect of smoking on birth weight is between a decrease of 0.00204 kg and a decrease of 0.00596 kg per cigarette smoked per day by the mother during pregnancy. - Since the confidence interval does not include zero, we conclude that the effect of smoking is statistically significant and negative, meaning that smoking during pregnancy is likely to reduce birth weight. Ex set 3: Functional Forms and Model Selection 1. Logarithmic Models Problem: Consider the following log-level model for predicting wages: If 𝛽1 = 0,08, interpret the result. How does an additional year of education affect the predicted wage, The model is in log-level form, meaning the dependent variable (Wage) is logged, but the independent variable (Education) is in its original level form. - Interpretation of 𝛽1 = 0,08 In a log-level model, the coefficient 𝛽1 is interpreted as a percentage change in the dependent variable for a one-unit change in the independent variable. Therefore, 𝛽1 = 0.08 implies that an additional year of education increases wages by approximately 8%, holding other factors constant. This means that for every extra year of schooling, the wage is predicted to increase by 8%. 2. Quadratic Models You are given the following quadratic model for predicting the number of hours slept by an individual: If 𝛽1 = -0,05 and 𝛽2 = 0,01, find the age at which the predicted amount of sleep is minimized. - This is a quandratic model, where Sleepi is predicted based on age and age sqared. - To find the age at which sleep is minimized, we need to find the turning point (vertex) of the quadratic function. The turning point occurs where the derivative with respect to age equals zero. - Step 1: Take the derivative of Sleepi with respect to Agei: - Step 2: Set the derivative equal to zero and solve for Agei: - Interpretation: The predicted amount of sleep is minimized at age 25. At this age, the effect of age on sleep reaches its lowest point, after which sleep starts to increase. 3. Model Selection Criteria Given two regression models for predicting house prices: Which model would you choose based on the adjusted R2? Why? - Adjusted R2 is a better metric for comparing models when there are different numbers of predictors, as it penalizes models for adding irrelevant variables. Unlike R 2, adjusted R2 will only increase if the added variables improve the model’s explanatory power significantly. - Comparison: o Model 1: R2 = 0,85, Adjusted R2 = 0,84 o Model 2: R2 = 0,87, Adjusted R2 = 0,84 - Decision: Since both models have the same adjusted R2 (0.84), there is no strong evidence that one model is better than the other based on adjusted R2. However, since Model 1 has fewer predictors (as suggested by the lower R2), it might be more parsimonious and simpler to interpret. Thus, Model 1 would be preferable, as it achieves similar performance with fewer variables, following the principle of simplicity (parsimony). Ex set 4: Prediction and Confidence Intervals 1. Prediction Interval Problem: Using the following regression model: If the predicted test score for a student who studied 10 hours is 85, and the standard error of the prediction is 2.5, calculate a 95% prediction interval for this student's score. 1. Formula for Prediction Interval: The formula for the prediction interval is: Where: - (the predicted test score), SEprediction = 2,5 (the standard error of the prediction), - (for a 95% confidence level with a large sample size). 2. Calculation 3. Interpretation - The 95% prediction interval for the student's test score is (80.1, 89.9). - This means we are 95% confident that the actual test score for the student who studied 10 hours will fall within this range. 2. Model Selection via AIC/BIC Problem: You have three models for predicting monthly income: - Model 1: AIC = 200.5, BIC = 215.3 - Model 2: AIC = 198.2, BIC = 210.0 - Model 3: AIC = 202.1, BIC = 216.7 Which model should you select based on AIC and BIC? Explain your reasoning. Solution: - AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are used to compare models by balancing model fit and complexity. o AIC: Lower values are preferred. o BIC: Also favors models with lower values but penalizes more complex models (models with more parameters) more heavily than AIC. - Comparison of AIC values: o Model 1: AIC = 200.5 o Model 2: AIC = 198.2 o Model 3: AIC = 202.1 Based on AIC, Model 2 has the lowest value (198.2), so it would be the preferred model. - Comparison of BIC values: o Model 1: BIC = 215.3 o Model 2: BIC = 210.0 o Model 3: BIC = 216.7 Based on BIC, Model 2 also has the lowest value (210.0), making it the best choice according to BIC as well. - Final Decision: Since Model 2 has the lowest AIC and BIC, it is the best model based on both criteria. This model strikes the best balance between goodness of fit and model simplicity.