A model for Happiness

advertisement
A model for Happiness
ECON 7590
Carlos Gámez
10/21/2009
Using the GSS 2008 we develop a statistical model for happiness. Variables and testing is done in order to get an
appropriate model that can predict our variable of interest. Stata is being used for the statistical computations
and some graphics. Also the SDA window from the GSS website is being used.
Based on the GSS 2008 data, a proper model is tried to be created for the variable happy. We start first by
keeping the variables that we are going to use on the project which are presented and described below. Next we
delete those observations with missing variables, therefore we are left with complete observations.
Filename used: gss2008.dta (Stata file)
PART I: BRIEF DESCRIPTION OF THE VARIABLES
Happy: General Happiness. We create a new variable d_happy that takes the value 1 if the respondent claim to
be very happy or pretty happy. The new variable d_happy will take the value 0 if the answer was not too happy.
After browsing the codebook of the variables of the GSS 2008 data, a small set of focus and doubtful variables is
gathered:
Variables that we are (subjectively) confident about
(a) degree: Highest degree earned. We split this variable by college educated or not. The variable college is
being generated, taking the value 1 if the respondent has a college degree and zero otherwise.
(b) marital: Marital Status. We dichotomize the variable into a new one, married which takes the value of 1 if
the respondent is married.
(c) satfin: Satisfaction with Financial Situation. The variable satfin2 is being created, this new variable takes
the value of 1 if the respondent claims to have satisfaction with his/her financial situation and zero of the
respondent is not all satisfied.
(f) satjob: job satisfaction. The variable satjob2 is created, taking the value of 1 if the respondent asserts to
have moderate or strong job satisfaction. satjob2 takes the value of 0 if the respondents says that he/she is a
little or very dissatisfied with the job or housework.
(g) health: Respondent assessment of health. A variable health2 is created, taking the value 1 if the
respondent’s health is good or excellent, zero otherwise.
Variables that might be effective (Doubtful variables)
(a)
sex: Respondent’s sex. Variable gender takes the value 1 if male 0 if female.
(b) wrkstat (working status: full time/part time/temp/laid off...etc). Variable wrkstat2 is 1 if working full time,
zero otherwise.
(c)
age: Age of the respondent (ranges from 18 to 89 or more).
(d) class (subjective social class identification: lower class/working
class/middle/upper). The new variable class2 is one if class is middle or upper class and zero otherwise.
(e) rincome: Respondents income. Highest bar correspond to respondent’s income $25000 or higher. Variable
rincome2 takes value 1 if income above 25K, o otherwise.
PART II : Manipulation of the data
After deleting observations with missing variables, we are left with 788 out of the more than 1200 original
observations. The quality of the data is a priority for our project. Also keeping the only the variables to be used
will saved a lot of RAM memory.
Another remark is that only logistic regression is being used since our response variable is binary and therefore a
linear regression would give doubtful results since conditions like homoscedasticity are not being satisfied.
Whenever logistic regression is being referenced, we should think of the model
๐‘ง = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1 + ๐›ฝ2 ๐‘ฅ2 + ๐›ฝ3 ๐‘ฅ3 + ๐›ฝ4 ๐‘ฅ4 + โ‹ฏ + ๐›ฝ๐‘› ๐‘ฅ๐‘›
fitted to the equation
1
๐‘“(๐‘ง) = 1+๐‘’ −๐‘ง
PART III : Model Using Focus Variables
A logistic regression in being run on the focus variables to look if indeed they can predict happiness. The
following model is being applied:
๐‘ง = ๐›ฝ0 + ๐›ฝ1 ๐‘๐‘œ๐‘™๐‘™๐‘’๐‘”๐‘’ + ๐›ฝ2 ๐‘š๐‘Ž๐‘Ÿ๐‘Ÿ๐‘–๐‘’๐‘‘ + ๐›ฝ3 ๐‘ ๐‘Ž๐‘ก๐‘“๐‘–๐‘›2 + ๐›ฝ4 ๐‘ ๐‘Ž๐‘ก๐‘—๐‘œ๐‘2 + ๐›ฝ5 โ„Ž๐‘’๐‘Ž๐‘™๐‘กโ„Ž2
Results from STATA:
From the above results we can see that the convergence was rather fast, only 4 iterations. Now the most
important results are the fact that all our variables are positively correlated with happiness and statistically
significant as we were expecting. So the variables college, married, satfin2, satjob2, health2 are all statistically
significant explanatory variables on happiness. Because of the way these variables have been constructed, the
interpretation is rather straightforward. For example, satjob2 has a 95% confidence interval on the positive real
line, which means that job satisfaction is indeed is a indicator of happiness.
Our models are going to be judged depending on how well they can predict the variable happiness using the
estat classification command in STATA with several cutoff values that will maximize this specific number.
This model correctly predicted 86.29% of the response variables with a cutoff value of 0.57.
PART IV : Model Using Doubtful Variables
Now we turn into the doubtful variables, we will study whether or not these variables can predict happiness
with a % confidence level. Our variables are: gender, wrkstat2, age, class2. So the model will have the following
form:
๐‘ง = ๐›ฝ0 + ๐›ฝ1 ๐‘”๐‘’๐‘›๐‘‘๐‘’๐‘Ÿ + ๐›ฝ2 ๐‘ค๐‘Ÿ๐‘˜๐‘ ๐‘ก๐‘Ž๐‘ก2 + ๐›ฝ3 ๐‘Ž๐‘”๐‘’ + ๐›ฝ4 ๐‘๐‘™๐‘Ž๐‘ ๐‘ 2 + ๐›ฝ5 ๐‘Ÿ๐‘–๐‘›๐‘๐‘œ๐‘š๐‘’2
Results from STATA using logistic regression:
Based on the point estimations one would say that being a male makes people unhappier, having a full time job
and being older would make you happier, but looking closely at the confidence interval we realized that at the
end this variables are not statistically significant for predicting happiness. The variables class2, and rincome2 are
statistically significant and positively correlated. With a cutoff of 0.5 we find that this models correctly predicts
85.15% of the response variable.
PART V : Merge Statistically Significant Variables
We already obtained that, in addition to the focus variables, we should add the variables class2 and rincome2.
So let’ run another logistic regression over these variables and analyze their results.
The model now becomes:
Model A
๐‘ง = ๐›ฝ0 + ๐›ฝ1 ๐‘๐‘œ๐‘™๐‘™๐‘’๐‘”๐‘’ + ๐›ฝ2 ๐‘š๐‘Ž๐‘Ÿ๐‘Ÿ๐‘–๐‘’๐‘‘ + ๐›ฝ3 ๐‘ ๐‘Ž๐‘ก๐‘“๐‘–๐‘›2 + ๐›ฝ4 ๐‘ ๐‘Ž๐‘ก๐‘—๐‘œ๐‘2 + ๐›ฝ5 โ„Ž๐‘’๐‘Ž๐‘™๐‘กโ„Ž2 + ๐›ฝ6 ๐‘๐‘™๐‘Ž๐‘ ๐‘ 2 + ๐›ฝ7 ๐‘Ÿ๐‘–๐‘›๐‘๐‘œ๐‘š๐‘’2
Results from STATA:
The results show us that in our merged model, the variables: college, satjob2 and rincome2 are no longer
statistically significant. However, with a cutoff value of 0.54 we achieved 86.93% of correctly predicted response
variables, which is a small improvement from all previous attempts. All other variables remained statistically
significant.
Finally to make sure that the variables married, satfin2, health2, class2 all have a statistical impact on happiness,
we will re-run our logistic regression with these variables and analyze the results:
๐‘ง = ๐›ฝ0 + ๐›ฝ1 ๐‘๐‘™๐‘Ž๐‘ ๐‘ 2 + ๐›ฝ2 ๐‘š๐‘Ž๐‘Ÿ๐‘Ÿ๐‘–๐‘’๐‘‘ + ๐›ฝ3 ๐‘ ๐‘Ž๐‘ก๐‘“๐‘–๐‘›2 + ๐›ฝ5 โ„Ž๐‘’๐‘Ž๐‘™๐‘กโ„Ž2
Model B
With a cutoff value of 0.5, we find that there is a 86.42% of correctly classified values. Notice that the
percentage decrease by just a little bit, but taking into account that subtracted 3 variables, it worth noticing that
the model is much more simple and still predict happiness as well as our previous model.
PART VI : Using the Receiver Operating Characteristic (ROC) curve to analyze models A and B
In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the
false positive rate (100-Specificity) for different cut-off points. Each point on the ROC plot represents a
sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no
overlap in the two distributions) has a ROC plot that passes through the upper left corner (100% sensitivity,
100% specificity). Therefore the closer the ROC plot is to the upper left corner, the higher the overall accuracy of
the test (Zweig & Campbell, 1993).
Model A
Model B
PART VII : Conclusions
We ended with two good models with advantages and disadvantages. Model A has a higher percentage of
correct classification and higher area in the ROC curve. Model B has only 4 explanatory variables, making it more
simple and explicit model while still having a high level of prediction and ROC area curve. Both models have a
ROC curve above 0.80, which means that both models are considered “good”.
As a final decision, I choose to have Model B due to its simplicity and high rate of prediction. Having 3 variables
less and still have the characteristics of a good model gives Model B an edge over Model A.
I conclude that a person who is married is more likely to be happy than a person on any other marital status.
Financial satisfaction also has a clear impact on happiness as well as having good or excellent health. Finally your
social class is a strong indicator of the happiness of an individual. People on the middle or upper class are more
likely to be happier than those in the working or lower class.
Download