Session 10: Be able to download and put data together, go over the steps from last time European social survey, understand the variables, structures, what round number, what countries, how to create dummy variables given the data set. DUMMY Variables: - For example male or female: whether an employee is male or female into account when thinking about the relationship between experience (x) and income (y), If females do have lower income than males of equal experience, by ignoring (OMITTING) sex we would inflate the size of the errors., not an ordinal variable, its an categorical variable: cant be included in the regression - If we estimate that kind of equation: b2 how much on average is the difference in wage btw them Interaction effect - On average men earn more than women, b2 will be negative bc its taken the intercept and shifts the line downwards if b2 is close to zero or not significant: with this data I have no conclusion that women earn less than men; close to zero but significant: women do earn less but not by very much Interpreting coefficients: - dummy covariate identifies two groups of respondents, those on which we observe 1 and those on which we observe 0. - the difference between the average (mean) values of the response in the two groups when all other covariates are kept constant and fixed (controlling for all other covariates) - • • is the difference we expect to observe between the salary of a man and the salary of a woman, controlling for experience Sympsons paradox: with female dummy no female dummy a trend that appears in different groups of data disappears when these groups are combined the reverse trend appears for the aggregate data • • On average, when controlling for Education and Tenure, female employees earn less than male employees The average difference between the hourly pay of a woman and the hourly pay of a man been equal to -1.788 (they earn less than that) The model with Female as additional covariate as a higher Adjusted R Square and a smaller Standard Error of the Estimate. Slope dummy: - Up to now we have assumed that the impact of a covariate on the response is the same regardless of the values taken by the other covariates Two covariates may interact, e.g. we may have a different effect for these combinations - E.g. does experience play the same role for males and for females? A slope dummy variable is given by the interaction between a dummy and a numeric variable It takes the value zero in some rows (when the dummy=0) and the value of a real independent variable elsewhere (when the dummy=1) Slope dummies are used when the data is not well modeled by two equally sloped lines line but fit two lines with different slopes (one per dummy category) Interactions are modeled by considering as additional covariate the product of the two variables interacting For Males For Females - The interaction term is significant. An additional year of working experience with the current employer has a different impact for males and females. It increases the males wage by 0.2019$/h and increases the females’ wage by 0.0656$/h (=0.2019-0.1363), controlling for other covariates. Slope dummy : generate - Dummy variable trap: Multicollinearity frequently arises with regards to “dummy variables.” The easiest solution is simply to omit one category when including dummy variables. This avoids the so-called “dummy variable trap” E.g. we did not introduce one dummy variable for female and one for male ANOVA and multiple regression: - ANOVA model reduces to a multivariate regression model having as unique covariate the factor. Multinomial Categorical Covariate (k levels): k-1 dummy variables - A categorical variable taking k values can be included as a covariate by representing it by means of a set of (k-1) dummy variables, each being the indicator of a given level of the variable. - The category left out is referred to as the reference category or control group. - The coefficients of the dummy are interpreted as the difference we would expect to observe between the value of the response for an individual on which the category labelled by the dummy is observed and the value of the response for an individual on which the reference category is observed. Wage example - wage: salary (in dollars per hour) educ: years of education tenure: number of years in the firm female: Dummy variable for type of contract (1= female, 0= not female) geo: multinomial categorical covariate 1=North and central US, 2= South, 3=West ‘North and central US’ is taken as the reference category (the ‘zero’) — Two dummies are created: — South: =1 if the employee lives in the South — West : =1 if the employee lives in West US Difference in the annual salary expected to be observed between an employee living in North-Central US and an employee living in the West Interpretation of coefficients: Recall Heteroscedasticity – Remedies - Sometimes we need to take the log transformation of the dependent variable to regain efficiency: the log transformation smoothes out the excess of variability of the observations. - STATA -> gen ln_wage=ln(wage) - When the response variable in multiple linear regression model is a logarithm of variable (as ln_wage), each covariate coefficient estimate - is the average change in the logarithm of the response variable associated with a unit change in the covariate controlling for all other covariates Therefore it quantifies the average percentage change in the original response variable associated with a unit increase in the covariate when all other covariates are kept constant. Logarithms in Regression Analysis: interpretation of the coefficient(s) Standard simple linear regression model: Y = β0+ β1 X+ε o As X increases by 1 unit, Y changes on average by β1 units The semi-log model is: Ln(Y) = β0+ β1 X+ε o As X increases by 1 unit, Y changes by (100⋅β1)% The log-log model is: Ln(Y) = β0+ β1 ln(X)+ε o As X increases by 1%, Y changes by β1% o Remark: When there is the suspect that heteroschedasticity is due to one (or more) covariate, we can go for a log transformation also of the covariate o Regression of ln(wage) on Education and Tenure Education coefficient estimate is 0.087: one unit increase in the number of years of Education on average leads to a 8.7% increase (100*b1%) in wage, when controlling for Tenure. Tenure coefficient estimate is 0.026: one additional year of experience with the current employer on average leads to a 2.6% increase (100*b2%) in wage, when controlling for Education. Regression of ln(wage) on Education, Tenure, Gender and Geo The wage of employees living in the West is on average 13.9% higher than the wage of employees living in North and Central US. Research paper: needs to look like a research paper, data , background, conclusion Professional writing : forget he knows ESS, start from the beginning: still say ESS biannual survey… NEW HELP! - Click File > Example datasets.. > Example datasets installed with Stata > use auto.dta. or to upload data file command: sysuse dataset name to open data file command: sysuse dataset name, clear Save data command: log using "Log", replace Save data you can browse the data entering the command browse or clicking the Data Editor (Browse) icon in the toolbar. Describe data - describes variable, does not show min max etc command: describe or command: casebook variable variable variable Replace variable replace variable=10 if variable>2 Find missing data missing cases are dots - command: tab variable, missing - command: tab variable, missing nolab (No labels of categorical variables but numbers instead) How to deal with missing data: should a variable with high proportion so higher than 5.8% of missing values be dropped from analyses? Omit the case from all analyses which is a listwise deletion Impute (estimate) missing values. Techniques include: mean substitution or estimation by regression Provides a full set of data, but results may be invalid if our estimates are wrong. Results of analyses with and without imputation should be compare command: drop if dependent_variable==. Data screening: univariate : impossible or improbable values, recover correct value if possible, Extreme values (outliers) Inspect using boxplot +/- 3.5 standard deviation of the sample mean command: graph box dependent_variable Normality of distribution of metric variables as the sample size increases, the distribution will usually approach normal skewness must between -1 and 1 Summarize only for continuous variables: command: sum variable or command: sum variable, detail Frequency bar graph: command: graph bar (count), over(variable to be counted) Histogram (continuous variables) command: histogram variable, norm Two way frequency table for 2 categorical values depends on which frequencies you want to show command: tabulate one variable other variable, cell command: tabulate one variable other variable, row command: tabulate one variable other variable, column Two-way frequency table for CATEGORICAL- CONTINUOUS command: tabulate categorical variable, sum(continuous variable) command: bys categorical variable: summarize continuous variable would calculate summary statistics for the continous variable separately for each category of the categorical variable. If clause: can be used to set conditions for many things summary table: summarize price if length==200, detail scatter plot: scatter price length if foreign==0 missing values count if foreign>0 count if foreign>0 & foreign!=. Create new variable in stata command: generate variable ex: generate price2 = price^2 creates a new variable called "price2" in your dataset, where each value in the "price2" variable is the result of squaring the corresponding value in the "price" variable. ex: generate young=(agea<25) if agea!=. creates variable that only includes age lower than 25 with no missing values Delete variable from dataset command: drop variable ex: drop if cntry != "United Kingdom" & cntry != "Italy" & cntry != "Denmark” ex: drop if Distance==. Keep only some variables in dataset with or function command: keep variable variable variable ... ex: keep if cntry == "United Kingdom" | cntry=="Italy" | cntry=="Denmark” Rescaling variables command: replace variable = ( variable - r(min) ) / ( r(max) - r(min) ) * 10 check if rescaled by looking at min and max command: summarize variable Summary table of variables: mean, standard deviation, min, max, skweness and kurtosis negative kurtosis means curve of sub group by level of categorical variable flatter than normal distribution its negative and higher than normal distribution its positive/more concentrated around mean command: tabstat dependent variable, by(independent variable)s(mean sd min max skewness kurtosis) ANOVA: If there is only one categorical variable, the analysis is referred to as one-way Anova If there are n factors the analysis is called n-ways Anova one-way anova if p value (prob > F) is smaller than alpha, then reject null hypothesis null hypothesis means that all means are equal by rejecting it it means at least one mean differs so the categorical variable has an effect on the dependent variable when you want to understand the influence of ONE categorical variable (ONE factor) on a numerical variable. command: oneway dependent variable independent variable SCHEFFE TEST: check which mean from which sub sample is different if p value less than alpha reject null hypothesis meaning that mean is different from the rest in this case package 4 mean different from all the rest while the rest of the means r equal always look at values below in each row command: oneway dependent variable independent variable, scheffe REGRESSION: Regression model coefficients and significance of coefficient (simple linear regression) if p value smaller than alpha we reject null hypothesis which states that coefficient or constant is equal to 0 —> coefficient / constant is significant if confidence interval not too wide and if it does not include 0 —> coefficient/constant is significant p value and confidence interval do not always match up - but it often happens that it is r squared shows proportion of variability described my model the higher the better the more variables the higher command: reg dependent variable independent variable reg is used to understand the dependency of two variables on the left the confidence interval is slightly larger, a few points and not under the line – outliers that increase variability SCATTER PLOT - only for one dependent and one independent variable command: scatter dependent variable independent variable || lfitci dependent variable independent variable lftici function shows fitline, without it no fitline or command: scatter dependent variable independent variable if dummy variable ==0 this will create scatterplot between dependent variable and the independent variable only for independent variables associated to the opposite of the dummy variable ex: a scatterplot of price against length for foreign cars, and then for domestic cars. command: scatter dependent variable independent variable if independent dummy variable ==1 This will create a scatterplot between dependent variable and the independent variable only for independent variables associated to the opposite of the dummy variable CREATE SCATTERPLOT GRAPHING ONE VARIABLES AGAINST IN TWO SUBGROUPS: Command ssc install sepscatter sepscatter one variable other variable, sep(subgroup) ex: sepscatter price length, sep(foreign) MULTI REGRESSION MODEL - same as simple linear regression model - every time you put a variable that makes sense your whole model will improve - command: reg dependent variable independent variable, independent variable, etc... COMPARING THE BETAS which independent variable contributes more to the model —> to understand the strength of each relationship - the higher the standardized beta, the higher contribution of that independent variable beta coefficient (Xk): b1 * Std. deviation of Xk / Std. deviation of Y A standardized coefficient represents the change in terms of standard deviations in the dependent variable that results from an increase of one standard deviation in an independent variable (so we can compare!) command: reg dependent variable independent variable, independent variable, etc..., beta how to find regression model between dependent variable and only one of the options of dummy variable command: regress dependent variable independent variable if dummy variable==1 command: regress dependent variable independent variable if dummy variable==0 weights in stata - weights are not crucial - they usually dont change model iteself - maybe a bit significance of coefficient but still good - Stata offers 4 weighting options: frequency weights (fweight), Frequency weights: sample with many duplicate observation analytic weights(aweight), Analytic weights: sample where each observation is a group-mean probability weights (pweight) Probability weights: sample where observations had a different probability of being sampled importance weights (iweight) weights that indicate the "importance" of the observation in some vague sense iweights have no formal statistical definition; any command that supports iweights will define exactly how they are treated command to add weights: we add weight into commands adding [w=] to the command, before the options (that is,before the comma) ex: summarize variable [w=pspwght] pspwght is a specific weight ex: regress dependent variable independent variable [w=pspwght] creating new weight command: gen overallwght = pspwght * pweight BEST LINEAR FIT LINE FOR REGRESSION IN CLUDING WEIGHTS command: twoway lfit dependent variable independent variable[w=pspwght] EVALUATION OF A MODEL Coefficient of determination, R Squared R2 = it measures the proportion of total variation in y explained by all X variables taken together - 0≤ R2 ≤1 The closer R2 is to 1 the better the goodness of fit of the model. - when comparing models, better to use r2 adjusted (adjusted by the degrees of freedom, which depend on the sample size and number of covariates) - r2 will always increase as you add a new variable whereas r2 adjusted wont necessarily - keep adding variables as long as it doesn't decrease r2 adjusted, if new variable makes small impact then you need to decide command: reg dependent variable independent variable etc MODEL ERROR VARIANCE The model with lower se fits the data better. If its value is higher, then also standard error of coefficient are higher and, ceteris paribus, will be more likely to find not significant results In STATA, it is called “Root MSE” command: reg dependent variable independent variable etc F-STATISTIC - must be higher than 1 - bigger than 4 we reject null hypothesis If you do not reject null hypothesis - estimated model is NO GOOD (The regressors you introduced haveno explanatory power for the dependent variable, y) - The “variability” of the estimated model is purely random, not in any way different from the “variability” of the residuals. - F is the ratio of the measures of variability of the two components and the worse the model is the closer F will be to 1 - p-value must be smaller than alpha to reject null hypothesis which states that all coefficients are equal to 0 command: reg dependent variable independent variable etc DIFFERENCE BETWEEN F-TEST AND T-TEST: - The F-test for overall significance compare the two models ( Education andTenure taken together linearly affect wages?) —> it shows if there is a linear relationship between all of the X variables considered together and Y. General rule of thumb is F>4, we reject the null hypothesis (—> at least one independent variable affects Y) - The t-test (on Education) compares these two models (Is there a significant marginal effect of Education on Wages after Tenure has been controlled for?): If at least one t-test is significant, then the F-test will be significant. - It could be the case that all t-test are not significant and F-test it is significant. This happens when the independent variables are highly correlated (multicollinearity) MULTICOLLINIEARITY - It can affect the reliability of regression results As a result, standard errors will be large and t-statistics small (unstable coefficients) Coefficient signs may not match prior expectations The Variance Inflation Factor (VIF) is an indicator of a multicollinearity problem - VIF measures how much does the standard errors of our estimated regression coefficients increase compare to the case in which our X had no relation with the other variables - We could see it as an indication of how much ‘precision’ we lose in our estimates— Interpreting VIF: - VIF values range from 1 to positive infinity. A VIF of 1 indicates no multicollinearity (perfect independence) - Common thresholds for concern: VIF > 5 or 10(serious multicollinearity problem) - E.g.: VIF=2 and SE=6 for Xk mean that the standard error of the coefficientβk is 2 times larger than if Xk had no correlation with the other Xs (should be 6/2=3 in absence of correlation - Command: vif : you have to write the command after the regression model CORRELATION - Multicollinearity occurs when two + independent variables in a regression model are highly correlated Correlation matrix to check for multicollinearity ONLY FOR NUMERICAL VARIABLES command: pwcorr independent variable independent variable, sig - pwcorr drops observations that have at least 1 missing values in the pairs — better than corr - sig option, you're asking Stata to not only calculate the correlation coefficients but also display the significance levels (p-values) associated with each correlation coefficient. These p-values indicate whether the correlations are statistically significant. or command: corr independent variable independent variable - corr drops observations that have at least 1 missin values in any of the variables in the matrix. or command: pwcorr independent variable independent variable, obs the number of non-missing observations in the correlation matrix or - command: pwcorr independent variable independent variable, obs casewise However, the addition of the casewise option tells Stata to handle missing values on a case-bycase basis. This means that if there are missing values in any variable for a particular pair of observations, Stata will use all available information for that specific pair to calculate the correlation. F-TEST ON SUBSET OF MODEL - used to understand weather two variables taken together are significant this is used when you detect multicollinearity between some variables - this test is used to see whether you should take all variables that are correlated out of the model or keep some in or an average - if p-value < alpha than reject null hypothesis so variables are statistically significant command: test independent variable independent variable ... only put subset of independent variables you think are correlated do this after calculating regression model REMEDIES FOR MULTICOLINEARITY Drop one (or more) collinear variables Transform the collinear variable in a single new indicator (e.g., take the mean of collinear variables) Retain the collinear variable but interpret the p-values with care The best option depends on the aims of the analysis and on the research questions if after removing a variable, the other variables are all significant, than this means that because of the collinearity we found a non-significant results in the previous model HOW TO CALCULATE CHANGE IN INDEPENDENT VARIABLE from regression model code) (table How much should 30280 charge for its hot dog to maintain its current market share? EVALUATING ASSUMPTIONS Gauss-Markov theorem: The OLS estimators bj are best linear unbiased estimators (BLUE) for βj Best: smallest variance (in the class of linear unbiased estimators). That is, the OLS estimators are the most efficient. For Efficiency, also assumptions 4 must be valid (as well as 1,2,6). However, it is also necessary that the error terms are random variables with means 0 (one of the condition stated in assumption 3). Linear: they are a linear function of yi Unbiased: E(bj) = β j (if, on average, it hits the true parameter value) For Linearity and Unbiasedness only assumptions 1,2 and 6 are required. If the error term has not only mean zero, but it is distributed normally(assumption 3), then OLS estimators are normally distributed. for the central limit theorem, even if errors are not perfectly normal but the sample size is high, the OLS are approximately normal. If errors are not normal and sample size is small, the OLS estimators are not normal and standard error can be biased. Inference in this case may be invalid (usually, standard errors are biased downward, and significance levels are higher than the correct ones.) Assumption 1: linearity check shape of graph using scatter plot command: scatter dependent variable independent variable || lfitci dependent variable independent variable linear line compare to command: scatter dependent variable independent variable || qfit dependent variable independent variable squared line if not linear but squared, change the regression model command: gen educ_squared=educ^2 command: reg wage educ educ_squared age and tenure are two common variables where you might need to square it for example if you want to make the variable tenure squared then Assumption 2: No serial correlation of errors We are not worried of serial correlation as long as we work with cross-sectional data Generally there is no serial correlation when the residuals show no pattern around the x axis Assumption 3: Normality of the error term The error term is normally distributed with N(0,1) for error terms command: predict new_variable name ex: predict y_hat This command is used to generate predicted (fitted) values of the dependent variable (y) based on a regression model that you have estimated earlier. predict residuals, r This command is used to generate the residuals of the regression model. Residuals are the differences between the actual values of the dependent variable (y) and the predicted values ( y_hat ) obtained from the regression model. sum residuals, detail check skewness between -1 and 1 kurtosis between 2 and 4 mean must be 0 variance 1 kdensity residuals, normal Assumption 4: Homoskedasticity Errors have equal variance, σ2 (homoscedasticity) this shoes homoskedasticity command: esta hettest do this after inserting the regression command remedies Sometimes it is sufficient to take the log transformation of the dependent variable to regain efficiency: the log transformation smoothes out the“excess” variability of the observations. Change the estimation method (e.g., Weighted Least Square). Apply formulas to get correct standard error (White heteroskedasticity- consistent standard errors) Solve the cause of heteroschedasticity (non-linearity, omitted variable...) command: generate ln_dependent_variable=ln(dependent_variable) Assumption 6: No endogeneity The error term ε is not correlated with any of the explanatory variables x1,x2, ... , xk. In other words, there is no endogeneity Often referred to as omitted variable bias Omitted relevant variables are variables correlated both with the dependent variable and with the independent variables included in the regression This is very hard to detect.... can notice it if after adding a variable the other variables change signs remedies Add a correlated regressor that is not in the model- Identify a problem in the sample selection If the sample selection is correlated with the variable of interest,narrow the generalizability of interpretation to a chunk of the population that satisfies the sample restriction how to save in stata statistics that can later be used command: sum independent variable display r(min) HOW TO INTERPRET DUMMY VARIABLES if coefficient of dummy variable is 5 for dummy variable female, this means that for female the dependent variable increases by 5 dummy variable "female" with a coefficient of -500. This would mean that, on average, females earn $500 less than males, assuming all other variables in the model are held constant. How to introduce categorical variables as dummy variable for regression model Command tabulate categorical_variable, gen(categorical_variable) create a dummy for each country: they will all be called cntry_* where * is the alphabetical rank attached to the country in variable cntry (cntry_1 has values 1 for Denmark and 0 otherwise, cntry_2 has values 1 for Italy and 0 otherwise, etc describe categorical_variable* to check it worked regress dependent variable independent_continous_variables categorical_variable1 categorical_variable2 [w = overallwght] this includes a weight, in this case categorical variable level 3 is the reference variable HOW TO CREATE A DUMMY VARIABLE command: gen variable_name = 0 replace variable_name = 1 if variable =="x” replace the new variable with one if another variable = a certain value here we wither insert the name or example: gen cee = 0 replace cee = 1 if cntry=="Poland” replace cee = 1 if cntry=="Hungary” replace cee = 1 if cntry=="Czech Republic” cee is now a dummy variable taking value 1 if countries are from Central and Eastern Europe and 0 otherwise - example generate high_educ=(educ>12) This dummy will be equal to 1 for values of education above 12 and equal to 0 for values of education below 12 The coefficient for high_educ is equal to 1.942451 In other words, people with more than 12 years of education have, on average, a hourly age of 1.94 dollars higher than people with less than 12 years of education HOW TO ADD DATASET TO EXCISTING DATASET AND COMBINE THEM command: append using "directory” example: append using "/Users/angelici/Dropbox/3. Corsi Bocconi/30280 - Applications For Management/2023-24/STATA/ESS6_Lab_East.dta” DUMMY VARIABLES A dummy variable is a variable that takes only a value of 0 or 1 to indicate the absence or presence of some categorical effect that can shift the outcome. Data analyses section - types of variables for dependency models Dependent variable (denoted Y)a variable that is to be predicted or that changes in response to an intervention; the value of this variable depends on the values of other variables The intervention is an independent variable (X) The outcome to be changed (e.g., skills, social conditions) is a dependent variable (Y) Independent variable (denoted X)a variable that is used to predict values of another variable, that records experimental categories such as participant/non-participant Confounding variable a variable that might interfere with or ‘confound’ the relationship between the independent and the dependent variable - background section - Common elements of a problem definition Research question Selection of a theory or other basis for definition of the specific study: a in-depth literature search has to be run on “recognized sources”. - Stata 35 Clear identification of the unit of analysis Clear identification of the fundamental concepts to be studied Expected or proposed relationships between the concepts - Respondents from Poland are more favourable (because the sign is positive) to immigrants from different race/ethnic groups compared to respondents from Great Britain by 0.015 on a scale from 0 to 1. The hypothesis test performed here is the following: The coefficient on PL is not statistically significant, because the p-value of 0.152 is larger than the threshold level of 0.05. The answer is thus no, attitudes towards immigrants don’t differ in Great Britain compared to Poland. To answer we regress immigration dummy on lrscale and inter- pret the coefficient. We keep the country dummy in the regression to test this relationship taking into account differences across countries. A oneunit increase in political orientations, i.e. moving from left to right by one-unit on a scale from 0 to 10, reduces positive attitudes towards different immigrants by 0.003. The answer is no, because this reduction is not statistically different from zero (p-value>0.05). STATA allows to speed-up this process by using a special notation that requires to specify whether the variables involved are categorical (as in the case of a dummy variable, indicated with the prefix i.) or continuous (as in the case of the lrscale variable, indicated with the prefix c.) and to add the symbol ## between the variable we want to interact. reg immigration_dummy i.cntry_num##c.lrscale [w=overallweight] Factor analysis III Factor extraction and rotation - extraction of factors (equivalent to fitting model) - interpretation of the cators, including rotation of factors to assist interpretation The two main analytical stages of FA are: 1. Extraction stage - Select the factor extraction method - Produce an “initial solution” - Decide how many factors to extract 2. The rotation stage - We give an interpretation to the factors Kaisers rule: if less than one it dienst pay off to have more factors if the alternative is to have one factor Step 2: Extraction: common variance or communality - The total variance of a variable is split into the sum of: o Common variance: variance of a variable that is shared with all other variables in the analysis Variable communality is the estimate of its hsared or common variance among the variables as represented by the derived factors o Unique ( specific plus error variance): the part of the variance that can be attributed to the single variable