A model for Happiness ECON 7590 Carlos Gámez 10/21/2009 Using the GSS 2008 we develop a statistical model for happiness. Variables and testing is done in order to get an appropriate model that can predict our variable of interest. Stata is being used for the statistical computations and some graphics. Also the SDA window from the GSS website is being used. Based on the GSS 2008 data, a proper model is tried to be created for the variable happy. We start first by keeping the variables that we are going to use on the project which are presented and described below. Next we delete those observations with missing variables, therefore we are left with complete observations. Filename used: gss2008.dta (Stata file) PART I: BRIEF DESCRIPTION OF THE VARIABLES Happy: General Happiness. We create a new variable d_happy that takes the value 1 if the respondent claim to be very happy or pretty happy. The new variable d_happy will take the value 0 if the answer was not too happy. After browsing the codebook of the variables of the GSS 2008 data, a small set of focus and doubtful variables is gathered: Variables that we are (subjectively) confident about (a) degree: Highest degree earned. We split this variable by college educated or not. The variable college is being generated, taking the value 1 if the respondent has a college degree and zero otherwise. (b) marital: Marital Status. We dichotomize the variable into a new one, married which takes the value of 1 if the respondent is married. (c) satfin: Satisfaction with Financial Situation. The variable satfin2 is being created, this new variable takes the value of 1 if the respondent claims to have satisfaction with his/her financial situation and zero of the respondent is not all satisfied. (f) satjob: job satisfaction. The variable satjob2 is created, taking the value of 1 if the respondent asserts to have moderate or strong job satisfaction. satjob2 takes the value of 0 if the respondents says that he/she is a little or very dissatisfied with the job or housework. (g) health: Respondent assessment of health. A variable health2 is created, taking the value 1 if the respondent’s health is good or excellent, zero otherwise. Variables that might be effective (Doubtful variables) (a) sex: Respondent’s sex. Variable gender takes the value 1 if male 0 if female. (b) wrkstat (working status: full time/part time/temp/laid off...etc). Variable wrkstat2 is 1 if working full time, zero otherwise. (c) age: Age of the respondent (ranges from 18 to 89 or more). (d) class (subjective social class identification: lower class/working class/middle/upper). The new variable class2 is one if class is middle or upper class and zero otherwise. (e) rincome: Respondents income. Highest bar correspond to respondent’s income $25000 or higher. Variable rincome2 takes value 1 if income above 25K, o otherwise. PART II : Manipulation of the data After deleting observations with missing variables, we are left with 788 out of the more than 1200 original observations. The quality of the data is a priority for our project. Also keeping the only the variables to be used will saved a lot of RAM memory. Another remark is that only logistic regression is being used since our response variable is binary and therefore a linear regression would give doubtful results since conditions like homoscedasticity are not being satisfied. Whenever logistic regression is being referenced, we should think of the model ๐ง = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + ๐ฝ2 ๐ฅ2 + ๐ฝ3 ๐ฅ3 + ๐ฝ4 ๐ฅ4 + โฏ + ๐ฝ๐ ๐ฅ๐ fitted to the equation 1 ๐(๐ง) = 1+๐ −๐ง PART III : Model Using Focus Variables A logistic regression in being run on the focus variables to look if indeed they can predict happiness. The following model is being applied: ๐ง = ๐ฝ0 + ๐ฝ1 ๐๐๐๐๐๐๐ + ๐ฝ2 ๐๐๐๐๐๐๐ + ๐ฝ3 ๐ ๐๐ก๐๐๐2 + ๐ฝ4 ๐ ๐๐ก๐๐๐2 + ๐ฝ5 โ๐๐๐๐กโ2 Results from STATA: From the above results we can see that the convergence was rather fast, only 4 iterations. Now the most important results are the fact that all our variables are positively correlated with happiness and statistically significant as we were expecting. So the variables college, married, satfin2, satjob2, health2 are all statistically significant explanatory variables on happiness. Because of the way these variables have been constructed, the interpretation is rather straightforward. For example, satjob2 has a 95% confidence interval on the positive real line, which means that job satisfaction is indeed is a indicator of happiness. Our models are going to be judged depending on how well they can predict the variable happiness using the estat classification command in STATA with several cutoff values that will maximize this specific number. This model correctly predicted 86.29% of the response variables with a cutoff value of 0.57. PART IV : Model Using Doubtful Variables Now we turn into the doubtful variables, we will study whether or not these variables can predict happiness with a % confidence level. Our variables are: gender, wrkstat2, age, class2. So the model will have the following form: ๐ง = ๐ฝ0 + ๐ฝ1 ๐๐๐๐๐๐ + ๐ฝ2 ๐ค๐๐๐ ๐ก๐๐ก2 + ๐ฝ3 ๐๐๐ + ๐ฝ4 ๐๐๐๐ ๐ 2 + ๐ฝ5 ๐๐๐๐๐๐๐2 Results from STATA using logistic regression: Based on the point estimations one would say that being a male makes people unhappier, having a full time job and being older would make you happier, but looking closely at the confidence interval we realized that at the end this variables are not statistically significant for predicting happiness. The variables class2, and rincome2 are statistically significant and positively correlated. With a cutoff of 0.5 we find that this models correctly predicts 85.15% of the response variable. PART V : Merge Statistically Significant Variables We already obtained that, in addition to the focus variables, we should add the variables class2 and rincome2. So let’ run another logistic regression over these variables and analyze their results. The model now becomes: Model A ๐ง = ๐ฝ0 + ๐ฝ1 ๐๐๐๐๐๐๐ + ๐ฝ2 ๐๐๐๐๐๐๐ + ๐ฝ3 ๐ ๐๐ก๐๐๐2 + ๐ฝ4 ๐ ๐๐ก๐๐๐2 + ๐ฝ5 โ๐๐๐๐กโ2 + ๐ฝ6 ๐๐๐๐ ๐ 2 + ๐ฝ7 ๐๐๐๐๐๐๐2 Results from STATA: The results show us that in our merged model, the variables: college, satjob2 and rincome2 are no longer statistically significant. However, with a cutoff value of 0.54 we achieved 86.93% of correctly predicted response variables, which is a small improvement from all previous attempts. All other variables remained statistically significant. Finally to make sure that the variables married, satfin2, health2, class2 all have a statistical impact on happiness, we will re-run our logistic regression with these variables and analyze the results: ๐ง = ๐ฝ0 + ๐ฝ1 ๐๐๐๐ ๐ 2 + ๐ฝ2 ๐๐๐๐๐๐๐ + ๐ฝ3 ๐ ๐๐ก๐๐๐2 + ๐ฝ5 โ๐๐๐๐กโ2 Model B With a cutoff value of 0.5, we find that there is a 86.42% of correctly classified values. Notice that the percentage decrease by just a little bit, but taking into account that subtracted 3 variables, it worth noticing that the model is much more simple and still predict happiness as well as our previous model. PART VI : Using the Receiver Operating Characteristic (ROC) curve to analyze models A and B In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC plot represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC plot that passes through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC plot is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993). Model A Model B PART VII : Conclusions We ended with two good models with advantages and disadvantages. Model A has a higher percentage of correct classification and higher area in the ROC curve. Model B has only 4 explanatory variables, making it more simple and explicit model while still having a high level of prediction and ROC area curve. Both models have a ROC curve above 0.80, which means that both models are considered “good”. As a final decision, I choose to have Model B due to its simplicity and high rate of prediction. Having 3 variables less and still have the characteristics of a good model gives Model B an edge over Model A. I conclude that a person who is married is more likely to be happy than a person on any other marital status. Financial satisfaction also has a clear impact on happiness as well as having good or excellent health. Finally your social class is a strong indicator of the happiness of an individual. People on the middle or upper class are more likely to be happier than those in the working or lower class.