Using the Harmonized CHARLS A sample analysis for the HRS Summer Workshop This document provides users with an example of how to perform some simple descriptive analyses with the Harmonized China Health and Retirement Longitudinal Study (Harmonized CHARLS) data using Stata. Research question: Are their gender differences in cognition and is this difference different for different cohorts? Analysis parameters: Our analysis will examine cognition scores for both male and female and examine the gender difference across age categories. Part A – Measuring cognition CHARLS surveys cognition using a several different measures. For this analysis we will consider two measures. The first is memory recall based on a respondent’s ability to immediately repeat in any order ten Chinese nouns just read to them (immediate word recall) and to recall the same list of words 4 minutes later (delayed recall). The Harmonized CHARLS includes a summary score of these two word recall measures which is the sum of the number of number of words the respondent recalled in the immediate word recall and the delayed word recall. This summary score is stored in the Harmonized CHARLS variable r1tr20. Information about the distribution of this variable can be found in the Harmonized CHARLS codebook or by using the sum command in Stata: sum r1tr20 Variable Obs Mean r1tr20 14294 7.148244 Std. Dev. 3.433539 Min Max 0 20 Part B – Measuring gender and age The Harmonized CHARLS stores gender in the variable ragender. The coding of this variable can be found in the Harmonized CHARLS Codebook or by using the tab command in Stata: tab ragender, m ragender:R Gender Freq. Percent Cum. 1.male 2.female .d:dk .m:missing .r:refuse 8,471 9,221 2 8 3 47.85 52.08 0.01 0.05 0.02 47.85 99.93 99.94 99.98 100.00 Total 17,705 100.00 1 Age is stored in the Harmonized CHARLS variable r1agey. Let’s look at a brief summary of the ages sampled in CHARLS using the sum command in Stata: sum r1agey, d r1agey:w1 R age in years 1% 5% 10% 25% Percentiles 43 46 47 51 50% Smallest 22 22 25 31 58 75% 90% 95% 99% 66 74 78 85 Largest 100 101 101 102 Obs Sum of Wgt. 17682 17682 Mean Std. Dev. 59.22611 10.16237 Variance Skewness Kurtosis 103.2738 .5764451 2.911414 For this analysis let’s consider 5 year age groups. To build a categorical variable of age groups we can use the egen cut command: egen r1agecat = cut(r1agey), at(45,50,55,60,65,70,75,103) icodes tab r1agecat, m r1agecat Freq. Percent Cum. 0 1 2 3 4 5 6 . 3,415 2,568 3,543 2,917 1,892 1,385 1,609 376 19.29 14.50 20.01 16.48 10.69 7.82 9.09 2.12 19.29 33.79 53.80 70.28 80.97 88.79 97.88 100.00 Total 17,705 100.00 2 Part C – Applying CHARLS weights Like the HRS, CHARLS provides weights to use in producing population estimates. For this analysis, we use individual-level weights which account for both the household and individual non-response. This weight is recorded in the Harmonized CHARLS variable r1wtrespb. To set this weight for our analysis we will use the svyset command in Stata: svyset [pw=r1wtrespb] Once we have set the survey weight we can then prefix our commands with svy in Stata to have Stata produce weighted estimates. Part D – Analyzing cognition across genders Using the Stata svy mean dialog, we first estimate our cognition scores for the entire Chinese population and then separately for males and females. svy: mean r1tr20 Survey: Mean estimation Number of strata = Number of PSUs = 1 14294 Mean r1tr20 7.320842 Number of obs Population size Design df Linearized Std. Err. = = = 14294 439988435 14293 [95% Conf. Interval] .0456508 7.23136 7.410323 svy, subpop(if ragender==1): mean r1tr20 3 Survey: Mean estimation Number of strata = Number of PSUs = 1 16004 Mean r1tr20 7.428225 Number of obs Population size Subpop. no. obs Subpop. size Design df Linearized Std. Err. .0682556 = = = = = 16004 501824823 6770 207192844 16003 [95% Conf. Interval] 7.294436 7.562014 svy, subpop(if ragender==2): mean r1tr20 Survey: Mean estimation Number of strata = Number of PSUs = 1 15997 Mean r1tr20 7.226687 Number of obs Population size Subpop. no. obs Subpop. size Design df Linearized Std. Err. .0611051 = = = = = 15997 500334685 7513 232416974 15996 [95% Conf. Interval] 7.106914 7.34646 It does appear that men on average have a slightly higher cognitive score than women. We can also test whether the difference between cognition scores for men and woman is statically significant: svy: mean r1tr20, over(ragender) 4 Survey: Mean estimation Number of strata = Number of PSUs = 1 14283 Number of obs Population size Design df = = = 14283 439609818 14282 _subpop_1: ragender = 1.male _subpop_2: ragender = 2.female Over Mean r1tr20 _subpop_1 _subpop_2 7.428225 7.226687 Linearized Std. Err. .0682559 .0611054 [95% Conf. Interval] 7.294435 7.106912 7.562015 7.346461 test [r1tr20]_subpop_1=[ r1tr20]_subpop_2 Adjusted Wald test ( 1) [r1tr20]_subpop_1 - [r1tr20]_subpop_2 = 0 F( 1, 14282) = Prob > F = 4.84 0.0278 Using an Adjusted-Wald test to test the differences between our mean estimates for men and women, we see that indeed there seems to be a somewhat statistically significant difference between cognition in men and women. Part E – Analyzing cognition across genders and age categories We also wanted to consider whether there might be a cohort effect where we see different sorts of differences across age groups. Again let’s test this using the svy mean dialog: svy: mean r1tr20, over(ragender r1agecat) 5 Survey: Mean estimation Number of strata = Number of PSUs = Over: _subpop_1: _subpop_2: _subpop_3: _subpop_4: _subpop_5: _subpop_6: _subpop_7: _subpop_8: _subpop_9: _subpop_10: _subpop_11: _subpop_12: _subpop_13: _subpop_14: 1 13995 ragender 1.male 0 1.male 1 1.male 2 1.male 3 1.male 4 1.male 5 1.male 6 2.female 2.female 2.female 2.female 2.female 2.female 2.female Number of obs Population size Design df = = = 13995 430382865 13994 r1agecat 0 1 2 3 4 5 6 Over Mean r1tr20 _subpop_1 _subpop_2 _subpop_3 _subpop_4 _subpop_5 _subpop_6 _subpop_7 _subpop_8 _subpop_9 _subpop_10 _subpop_11 _subpop_12 _subpop_13 _subpop_14 8.580753 7.669246 7.609232 7.215394 7.441968 6.23575 5.615598 8.460031 7.512848 7.414409 7.030072 6.597254 5.967469 4.442941 Linearized Std. Err. .1344866 .1806594 .1217161 .0984498 .3388299 .155438 .3246259 .1239588 .1533828 .1705229 .1305285 .1740406 .1915682 .1512499 [95% Conf. Interval] 8.317141 7.315129 7.370652 7.02242 6.777817 5.931071 4.979288 8.217055 7.212198 7.080162 6.774218 6.256111 5.59197 4.146471 8.844365 8.023362 7.847812 7.408369 8.10612 6.54043 6.251908 8.703007 7.813499 7.748657 7.285925 6.938397 6.342969 4.739411 6 It does appear that the gender difference in our youngest cohort is smaller than the gender difference in our oldest cohort. Let’s now test the statistical significance of both: test [r1tr20]_subpop_1=[ r1tr20]_subpop_8 Adjusted Wald test ( 1) [r1tr20]_subpop_1 - [r1tr20]_subpop_8 = 0 F( 1, 13994) = Prob > F = 0.44 0.5092 test [r1tr20]_subpop_7=[ r1tr20]_subpop_14 Adjusted Wald test ( 1) [r1tr20]_subpop_7 - [r1tr20]_subpop_14 = 0 F( 1, 13994) = Prob > F = 10.72 0.0011 Using an Adjusted-Wald we can see that there is not likely a statically significant difference in cognition between men and women for our youngest cohort but we can see that there is a statically significant difference in cognition between men and women for our oldest cohort. 7