Statistics Exam Bachelor of Public Health Science Jørgen Holm Petersen Jolene Masters Pedersen December 2006 Statistics Exam FSV-Statistics Introduction The purpose of the exam is to analyze the association between the social class of the parents during the child’s childhood and subsequent unemployment by the grown-up child, taking into consideration possible confounders and effect modifiers. Particular interest is in understanding the dependence of unemployment on continuously measured intelligence. This will be accomplished by the use of two different statistical methods, Mantel Haenszel and multiple logistic regression. The Data The data material contains information about a segment of a much larger investigation of school pupils in 1968. The data was collected by interviewing the same individuals over almost a 25 year period. The analysis will include the dependant binary variable unemployment, the primary independent variable social class of the family, and the secondary variables, sex, intelligence test scores, education, self rated health, education of parents, type of residence during childhood, and number of serious illnesses. Descriptive Statistics The study contains 3151 people. The study population is composed of 70.6% people who have been unemployed for less than one year, and 29.4% who were unemployed for a year or more. As illustrated in figure 1, social class I (the highest social class) contains only 4.6 %( 145) of the study population, and social class V (the lowest social class) has 19.9 %( 628) of the population. The largest percent of people belong to social class III, 37.4% (1179). Jolene Lee Masters Pedersen Page 2 of 18 Statistics Exam FSV-Statistics 50,0% Percent 40,0% 30,0% 20,0% 10,0% 0,0% soc.gr. i soc.gr. ii soc.gr. iii soc.gr. iv soc.gr. v Social class of parents Figure 1: Social Class of Parents 50.4% (1588) of participants were boys and 49.6% (1563) girls, making the population almost equally distributed among the sexes. A majority of the population, 76.8% never suffered a serious illness, and19.0% have suffered one serious illness, and only 4.2% suffered two serious illnesses.. Most of the study participants reported being very content with their health 70.3%, 22.4% were content, 4.3% were not content, and 3.0 % were very discontent. Table one shows the participants intelligence test scores. 11.4% (359) of the people scored -25 on the intelligence test making them the lowest scorers. The highest scorers on the intelligence test, scoring 41+, made up 32.6% (1027) of the study population. Intelligence -25 26-30 Percent (frequency) 11.4% (359) 13.0% (410) 31-35 36-40 41+ 20.2% (635) 22.8%(720) 32.6% (1027) Table 1: Intelligence Test Scores Use descriptive statistics specifically for continous variables, e.g. Mean and standard deviation, histograms or Box-plots. Both parents and the children reported their level of education. Figure 2 shows the educational level of the participants. A majority of the people were educated as craftsmen38.8% (1222), only 6.5% (204) had a long education , and 19.7%(526) had no Jolene Lee Masters Pedersen Page 3 of 18 Statistics Exam FSV-Statistics education. 50,0% Percent 40,0% 30,0% 20,0% 10,0% 0,0% Long Education Medium long education Short education crafts man etc. No education Education Figure 2: Education levels of the Participants Percent 50,0% 40,0% 30,0% 20,0% 10,0% 0,0% Long education Medium education crafts man short education very short education No education Parents Education Figure 3: Educational level of the Parents Figure three depicts the educational levels of the parents. The largest percent of the parents, 41.5% (1089) were crafts men. Many had no education 31.6% (996) and only 6.0% (156) reported a long education. The number of missing persons in most of the data are around 15% making the number of people included in the final analysis smaller then if their were no missing people but there is still a large number of participants remaining The purpose of the descriptive section is to get an idea of who the studied individuals are. Association between Social Class of the Parents during Childhood and Subsequent Unemployment Table 2 illustrates the association between the social class of the parents during the subject’s childhood and subsequent unemployment. In each social class there are more Jolene Lee Masters Pedersen Page 4 of 18 Statistics Exam FSV-Statistics people that have been unemployed for <1 yr., than for ≥1 year. Social class V, the lowest social class, had the largest percentage (35.6%) of persons unemployed for ≥1 year. Social group I, the highest social group, had the second highest percentage of unemployment at 29.5%, but it should be considered that there were very few people in this social class (145). In general, it seems the tendency is for the lower the social class the higher the level of unemployment. The significant p value is 0.004. This points to an association between social class of the parents and subsequent unemployment. The null hypothesis for the chi squared test is that the effect of interest is zero. This should be rejected if the p-value is significant. The gamma value is also highly significant (p=0.002), pointing to an ordinal association in the data. The null hypothesis for gamma is the same as for chi-squared, that the effect of interest is zero. Social Class of parents soc.cl.I soc.cl.II Soc.cl.III Soc.cl.IV Soc.cl.V Total Unemployed <1 yr. Percent (frequency) Unemployed ≥1yr. Percent (frequency) 70.5% (86) 75.7% (193) 72.7% (748) 72.3% (420) 64.4% (331) 71.1% (1778) 29.5% (36) 24.3% (62) 27.3% (281) 27.7% (161) 35.6% (183) 28.9% (723) Table 2: Association between Social Class of Parents and Unemployment χ2= 15.538 df(4) p=0.004, , γ=0.103 p=0.002 Note in the above table that a simpler (and better) would exclude the “Unemployed <1 yr” column as it is redundant. It is desirable to have a table text (or a figure text) contain also a short description of what is seen. For example: Table 2: Association between Social Class of Parents and Unemployment. A non-monotone relationship is seen with social class I and V having higher prevalences of unemployment than the three other classes. The association between social class and unemployment is statistically 2 significant. (χ = 15.538 df(4) p=0.004, , γ=0.103 p=0.002). Mantel Haenszel Recoding In order to conduct a Mantel Haenszel analysis it is necessary to recode social class of the parents (the primary independent variable) into a binary variable. To decide the best way to recode, the homogeneity within the strata of different combinations of variables were tested by conducting a stratified analysis and using the χ2 test. The variable social class of the parents was recoded into social classes I-IV and social class V because social classes Jolene Lee Masters Pedersen Page 5 of 18 Statistics Exam FSV-Statistics I-IV have around the same risk of being unemployed but that risk sharply increases in social class V. The χ2 test was insignificant for social classes I-IV (p=0.686) and the null hypothesis (that the effect of interest is zero) was accepted. Table 3 illustrates the association between the recoded variable social class of the parents and subsequent unemployment. Unemployed <1 yr. Unemployed ≥1yr. Total Percent (frequency) Percent (frequency) 72.8% (1447) 27.2% (540) 100%(1987) Soc. class I-IV 64.4% (331) 35.6% (183) 100% (514) Soc. class V 71.1% (1778) 28.9% (723) 100% (2501) Total Table 3: Association between social class of the parents recoded and subsequent unemployment Social Class of Parents χ2=14.109, df=1 p=<0.001, OR=1.481 (1.206-1.820) There is a significant difference between the unemployment rates in social classes’ I-IV and social class V. As you can see in Table 3, 27.2% (540) of people in social classes IIV were unemployed for one year or more and 35.6%(183) in social class V were unemployed for one year or more. The significant p-value is <0.001, which indicates that the null hypothesis (that the effect of interest is zero) should be rejected. Therefore, there seems to be an association between the social class of the parents and subsequent unemployment. The odds ratio is 1.481 (1.206-1.820) which means that the risk of unemployment for one or more years is 48% higher for the people in social class V than the people in social classes I-IV. Using the above mentioned method of testing how to recode variables, parents education was recoded into three variables, long-medium education (χ2=0.009(1),p=0,923), craft, short or very short education (χ2=1.731(2),p=0,42) and no education. Self rated health was also recoded into three variables from the original five. The variables are very content, content and not content-very discontent. Confounding According to epidemiological principles a variable should be chosen as a confounder if it is a variable associated with both the outcome and the exposure but does not fall on the Jolene Lee Masters Pedersen Page 6 of 18 Statistics Exam FSV-Statistics casual pathway between the associations. In this particular case, in order to check for confounding first we must explore the relation between the secondary variables (sex, intelligence test, educational level, self rated health, educational levels of parents, type of residence during childhood, and number of serious illnesses) and the dependant variable subsequent unemployment. Then we must do the same for the secondary variables (sex, intelligence test, educational level, self rated health, educational levels of parents, type of residence during childhood, and number of serious illnesses) and the primary independent variable social class of the parents. We test for possible associations by using the chisquared test. Unemployed <1 yr. Percent(freq) 73.7%(1381) 66.8%(399) 52.1%(101) Unemployed ≥1yr. Percent(freq) 26.3%(493) 33.2%(198) 47.9%(93) χ2-value 44.818 v. content content not contentv.discontent 64.5% (178) 35.5% (98) 26.837 Intelligence -25 62.7%(215) 37.3%(128) 26-30 69.3% (382) 30.7%(169) 31-35 71.7% (441) 28.3%(174) 36-40 75.6%(667) 24.4%(215) 41+ 75.9 % (390) 24.1%(124) 9.742 Residence Copenhagen 68.1% (584) 31.9%(274) in mid. city 72.0%(157) 28.0%(61) Childhood sm. city 71.6%(633) 28.4%(251) rural 74.0%(151) 26.0%(53) 102.412 Education Long 81.9%(289) 18.1%(64) Medium 77.4%(281) 22.6%(82) Short 71.9%(879) 28.1%(343) Craftsman 53.9%(283) 46.1%(242) None 70.1%( 169) 29.9% (72) 5.290 Parent Long/med. 26.8%(309) Education Craft, short, 73.2%(844) v.short 68.6%(591) 31.4%(271) None 73.3%(1502) 26.7%(547) 33,913 Serious 0 63.0%(318) 37.0%(187) Illnesses 1 55.4%(62) 44.6%(50) 2 74.4(974) 25.6%(335) 17.926 Sex Boy 66.9%(909) 33.1%(449) Girl Table 4: Association between secondary variables and Unemployment Self Reported Health DF P-Value 2 <0.001 4 <0.001 3 0.021 4 <0.001 2 0.071 2 <0.001 1 <0.001 Table 4 explores if the secondary variables are associated with the dependant variable unemployment. As you can see, the parents education is not significant (p= 0.071) so it will not be considered as a confounder. On the other hand, self reported health, Jolene Lee Masters Pedersen Page 7 of 18 Statistics Exam FSV-Statistics intelligence, residence in childhood education, sex, and serious illness all have significant p values which means that the null hypothesis, that the effect of interest is zero can be rejected. These variables are all associated with unemployment. More specifically, more girls (33.1%, 449) have been unemployed for one year or more than boys (25.6%, 335). The more content people are with their health the less likely they are to be unemployed for one year or more. For example, persons who are very content are unemployed for one year or more 26.3% (493), where as people who are not content or very discontent have been unemployed for a year or more 47.9%(93). There is a tendency for the people who scored higher on the intelligence test to have less incidence of unemployment for 1 or more years than people who scored lower on the intelligence test. The people who grew up in Copenhagen have the lowest rate of unemployment for one or more years (24.1%). The people with no education have the highest percentage of unemployed people for a year or more (46.1%) and people with a medium length education have the lowest percentage (18.1%). Self Reported Health Intelligence Residence in Childhood Education Parent Education Serious Illness Sex v. content content not contentv.discontent -25 26-30 31-35 36-40 41+ Copenhagen mid. city sm. city rural Long Medium Short Craftsman None Long/med. Craft, short, v. short None 0 1 2 Boy Girl Social Classes I-IV Percent(freq) 80.9%(1426 78.9%(440) 67.4%(122) ) 68.6%(194) 70.6%(271) 76.0%(457) 77.6%(536) 87.0%(853) 85.9%(544) 80.8%(800) 76.1%(191) 72.4%(716) 94.4%(185) 91.5%(311) 86.3%(289) 77.7%(896) 64.3% (308) 99.6%(267) 90.3%(1140) Social ClassV Percent(freq) 19.1%(336) 21.1%(118) 32.6%(59) χ2-value DF P-Value 18.600 2 <0.001 31.4%(89) 29.4%(113) 24.0%(144) 22.4%(155) 13.0%(127) 14.1%(89) 19.2%(190) 23.9%(60) 27.6%(273) 5.6%(11) 8.5%(29) 13.7%(46) 22.3%(257) 35.7% (171) 0.4%(1) 9.7%(123) 76.070 4 <0.001 46.746 3 <0.001 135.957 4 <0.001 398.989 2 <0.001 58.8%(547) 81.2% (1559) 73.2%(350) 75.2%(79) 78.6%(1167) 78.6%(1144) 41.2%(383) 18.8%(360) 26.8% (128) 24.8%(260) 21.4%(317) 21.4%(311) 16.268 2 <0.001 0.000 1 0.993 Jolene Lee Masters Pedersen Page 8 of 18 Statistics Exam FSV-Statistics Table 5: Association between secondary variables and Social class of parents. Table 5 explores if the same secondary variables are associated with the primary independent variable social class of the parents. Sex is not associated with the social class of the parents p=0.993 (obviously) so it must be ruled out as a confounder. The other variables have highly significant p values of <0.001. There appears to be a tendency for people in Social class V (32.6%) to be not content or very discontent with their health compared with the people from Social classes I-IV (67,4%). There is a tendency for people with lower intelligence score to be part of social class V compared with the people with higher intelligence test scores. Of the people with a score of -25 on the test, 31.4% are part of social class V and of the people with the highest score of 45+, only 13.0% belong to social class V. The secondary variables: self reported health, residence in childhood, intelligence, serious illness and education are all possible confounders because they are associated with both unemployment and the social class of parents in childhood and they do not fall on the casual pathway. Therefore, they must be included in the analysis if the relation between the dependent variable unemployment and the primary independent variable social class of the parents is to be properly understood. Mantel Haenszel Analysis Now the Mantel Haenszel analysis will test the effect of each possible confounder. The Mantel-Haenszel method is used to control for confounding. The Mantel Haenszel statistic tests the null hypothesis that the strata are conditional independent. When the Mantel Haenszel value is insignificant the variable being controlled for is conditional independent. The Breslow day statistic tests if the odds ratios are the same throughout the strata, in other words it tests the homogeneity of the odds ratios. The null hypothesis for the Breslow Day test is that the odds ratios are homogeneous in different strata. Therefore, if the Breslow Day statistic is insignificant the difference in the different strata is so small you can calculate the Mantel- Haenszel common odds ratio. If the Breslow Day test is significant you should stop the analysis there because the strata are most likely very different and have effect modification. Jolene Lee Masters Pedersen Page 9 of 18 Statistics Exam FSV-Statistics Effect Modification An effect modificator is a variable that modifies the size or the direction of exposure outcomes on the effect. When you have effect modification the association between exposure and outcome is different in the different categories of the effect modificator. They are tested for by using Breslow Day and logistical regression analysis. To check for effect modification one should stratify and then check the odds ratio for a correlation. The odds ratio should be closely examined for patterns of odds ratios across the different strata (how different they look, any trends). If there is effect modification there will be substantial differences in the association between strata. Self-rated Health v. content content notcontentv.discontent Intelligence -25 26-30 31-35 36-40 41+ Childhood Residence Copenhagen mid. City sm. city rural Education Long Medium Short Craftsman None Parents Social Class Unemployed <1 yr. Unemployed ≥1yr. I-IV V I-IV V I-IV V 75.7% 67.6% 68.9% 63.6% 53.3% 47.5% 24.3% 32.4% 31.1% 36.4% 46.7% 52.5% I-IV V I-IV V I-IV V I-IV V I-IV V 68.4% 58.0% 63.6 62.8 70.7 66.1 73.8 63.6 77.1 68.8 31.6% 42.0% 36.4 37.2 29.3 33.9 26.2 36.4 22.9 31.3 I-IV V I-IV V I-IV V I-IV V 74.7 81.8 71.3 54.5 74.2 62.0 73.4 66.8 25.3 18.2 28.7 45.5 25.8 38.0 26.6 33.2 I-IV V I-IV V I-IV V I-IV V I-IV V 75.7 72.7 82.0 86.2 79.2 69.6 73.3 68.1 54.4 53.2 24.3 27.3 18.0 13.8 20.8 30.4 26.7 31.9 45.6 46.8 Jolene Lee Masters Pedersen Breslow Day (df) 0.755 (2) Mantel Haenszel 0.001 MH Odds Ratio (c.i.) 1.409 (1.1441.736) 0.658 (4) 0.003 1.379 (1.1201.699) 0.018 (3) <0.001 1.487 (1.2041.837) 0.665 (4) 0.112 1.198 (0.9661.485) Page 10 of 18 Statistics Exam FSV-Statistics 0.199(2) 0.001 1.418 (1.153I-IV 75.6%(1178) 24.4%(380) 1.745) V 66.7%(240) 33.3%(120) I-IV 63.3%(221) 36.7%(128) 1 V 62.5%(80) 37.5%(48) I-IV 59.5%(47) 40.5%(32) 2 V 42.3%(11) 57.7%(15) Table 6: Mantel Haenszel Test of the secondary confounders, self rated health, intelligence, childhood residence, education and serious illness. Serious Illness 0 In table 6, you can see that childhood residence is an effect modificator because it has a Breslow Day test of (p=0.018). Therefore, childhood residence can not be considered a confounder. The effect of family social class is different in the various residences (small, medium, etc.)The M.H. and common odds ratios can not be commented on because it is heterogeneous. Table 6 shows that the Mantel Haenszel test is significant for self rated health, intelligence, education, and serious illness. This means that there is still an association between the social class of the parents and subsequent unemployment even when we have controlled for each confounder individually except when we control for education. As you can see, the Mantel Haenszel value for education is 0.112. This means that the association between unemployment and social class can be explained by education. When we control for education, parent social class and unemployment become independent. The Mantel Haenszel common odds ratio shows the risk of unemployment based on the parent’s social class during childhood when each confounder is controlled for. Take self rated health for example, the Mantel Haenszel common odds ratio tell us that if you are in social class V you have a 41% greater risk of unemployment than people in social classes I-IV, when you control for self rated health. To test the size of the effect of confounding, compare the original odds ratio (from table 4, 1.481 (1.206-1.820)) with the Mantel Haenszel common odds ratio from each strata. OR M .H .commonOR *100 OR Equation 1: Using equation one, the effect for the confounder self rated health is -0.61% which means that controlling for self rated health makes the association between unemployment and the parents social class only -0.61% stronger. This is a very small difference. The effect of the confounder intelligence is 6.89%, which means that intelligence explains 6.89% of Jolene Lee Masters Pedersen Page 11 of 18 Statistics Exam FSV-Statistics the association between unemployment and the parent’s social class. The effect of the confounder serious illness is 4.25% which means that serious illness explains 4.25% of the association between unemployment and the parent’s social class. As you can see in Table 7 the confidence intervals shown in conjunction with the Mantel Haenszel common odds ratios are quite broad, and the true value of the common odds ratio falls anywhere between the two values in the confidence interval. Take serious illness for example, Common OR=1.418 (1.153-1.745). The confidence interval is very broad and the true value could be anywhere in between 1.153 – 1.745. A possible reason for the broadness is that there are relatively few people in some of the strata. Logistic Regression Analysis From logistical regression we can determine which explanatory variables influence the outcome as well as evaluate the probability of a particular outcome. The dependant variable can only be binary and the independent variables can be of any kind. Logistic regression gives us information from the estimated logistic regression coefficient, estimated odds ratio with the 95% confidence interval and the Wald test statistic with an associated p-value. The Wald Test statistic tests the null hypothesis that the relevant logistic regression coefficient (beta) is equal to zero. The higher the Wald test (it depends on the degrees of freedom), the more significant the p-value will be , because a high Wald Test means that the SE is low compared to the difference between the two Beta parameters. The overall Wald test is a weight, an average of the Wald tests in between categories. The information from the logistical regression analysis is used to determine whether each variable is related to the outcome and to quantify how much this is so. Model seeking In order to test the connection between the various explaining variables and unemployment, backwards model seeking was preformed. It’s important to keep in mind the hierarchical principle that one must not remove variables if there was still a remaining interaction with that variable. First the model included each variable and combination of variables: [sex, social class, intelligence, education, self rated health, parents education, childhood residence, serious illness, sex*social class, sex*intelligence, Jolene Lee Masters Pedersen Page 12 of 18 Statistics Exam FSV-Statistics sex*education, sex*self health, sex*parent education, sex*childhood residence, sex*serious illness, social class*intelligence, social class*education, social class*self health, social class*parents education, social class*childhood residence, social class*serious illness, intelligence*education, intelligence*self health, intelligence*parent education, intelligence*childhood residence, intelligence*serious illness, education*self health, education*parents education, education*childhood residence, education*serious illness, self health*parent education, self health*childhood residence, self health*serious illness, parent education*childhood residence, parent education*serious illness, childhood residence*serious illness] .The least significant variable was removed and the logistical regression analysis was preformed again, repeating the process until only significant interactions and variables remained. The order in which the variables were excluded and their corresponding p-values are depicted in table 7. Variable 1.family social class*family education 2.family education*child residence 3. family social class*sex 4.sex*serious illness 5.family social class*self rated health 6.serious illness*intelligence 7.self reported health*child residence 8.family education*sex 9.self rated health*serious illness 10.family social class*intelligence 11.sex*child residence 12.self rated health*intelligence 13.family social class*education 14.family education*intelligence 15.family education*self rated health 16.child residence*intelligence p-value 0.943 0.852 0.779 0.668 0.653 0.590 0.538 0.480 0.432 0.437 0.449 0.409 0.348 0.229 0.417 0.304 Variable 17. education* child residence 18.self rated health*education 19.education*intelligence 20.serious illness*education 21.serious illness*child residence 22.family social class*child residence 23.family social class*serious illness 24.family social class 25.family education*serious illness 26.family education*education 27.family education 28.self rated health*sex 29.sex*intelligence 30.sex*education 31.intelligence p-value 0.308 0.248 0.330 0.372 0.243 0.220 0.158 0.725 0.154 0.197 0.244 0.492 0.395 0.189 0.172 Table 7: Order in which variables were removed from logistic regression model Before presenting the final logistic regression model it should be explained that the variable intelligence coded into five categories was included in the model opposed to the linear variable intelligence. An original model seeking was performed with the linear variable and it was found that intelligence is not linear. This was found by including different combinations of intelligence variables (one at a time) and seeing which p-value was the highest compared with the linear value. The p-values follow: intelligence linear (0.500) intelligence squared (0.892); intelligence linear (0.547) intelligence cubed (0.547); Intelligence in 5 categories(0.053, intelligence linear (0.598); only linear (0.057) Jolene Lee Masters Pedersen Page 13 of 18 Statistics Exam FSV-Statistics only intelligence in 5 categories (0.008). As you can see, intelligence in 5 categories was the most significant of all the variables and combinations of variables tested. Therefore the logistic regression analysis was made again including intelligence in five categories as a variable and leaving the linear variable out. The section on linearity is confusing. Testing whether the effect of intelligence is linear can be done in two ways. Either by including also higher order terms, e.g. intelligence squared, followed by testing whether the higher order term is statistically significant. One does not involve the p-value for the linear term in a model that includes also intelligence squared. In the second approach, the linear term is included as well as a categorical version of intelligence. Again, only the p-value for one of the two – here the categorical version – is relevant. Equation 2 shows the final logistic regression model that was chosen. 0.686 0.146 x P(unemployment | sex, education, selhealth , childresid ence, seriousill .) sexgirl e 0.686 0.146 x sexgirl 1 e 0.694 x edulo 1.294 x edume 1.070 x edush 0.748 0.209 x shc 0.555 x shnc 0.802 x shvd 0.201x crcp 0.259 x crmc 0.013x crsc 0.233x si 0.694 x edulo 1.294 x edume 1.070 x edush 0.748 0.209 x shc 0.555 x shnc 0.802 x shvd 0.201x crcp 0.259 x crmc 0.013x crsc 0.233x si Where: X assumed value is 1, if the person is a girl. assumed value is 1, if the person had a long education. X edume, assumed value is 1, if the person had a medium education. X edush, assumed value is 1, if the person had a short education. X educr, assumed value is 1, if the person was educated as a craftsman. X shc, assumed value is 1, if the person was content with self health. X shnc, assumed value is 1 if the person was not content with self health. X shvd, assumed value is 1 if the person is very discontent with self health. X crcp, assumed value is 1, if the person grew up in Copenhagen. X crmc, the assumed value is 1, if the person grew up in a medium sized city. X chsc, assumed value is 1, if the person grew up in a small city. X si, assumed value is 1, for each serious illness. X sexgirl edulo, Equation2: Logistic Regression Equation Sex Boy Girl Education None Craftsman Short Medium Long Beta Wald P-value OR(95% c.i.) 0 0.416 23.957 78.004 40.892 42.115 56.655 12.492 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 1 1.586(1.319-1.908) 0 -0.748 -1.070 -1.294 -0.694 Jolene Lee Masters Pedersen 1 0.473(0.376-0.595) 0.343(0.248-0.474) 0.274(0.196-0.384) 0.500(0.340-0.734) Page 14 of 18 Statistics Exam Self-Rated Health Very content Content Not content Very discontent Child-residence Rural Small city Medium city Copenhagen Serious Illnesses 0 Per illness Constant N FSV-Statistics 0 0.209 0.555 0.802 0 -0.013 0.259 -0.201 0 0.233 -0.686 3151 12.040 3.150 5.998 8.125 13.668 0.005 5.675 2.321- 0.007 0.076 0.014 0.004 0.003 0.943 0.017 0.128- 5.370 0.020 1 1.233(0.978-1.553) 1.741(1.117-2.714) 2.229(1.285-3.869) 1 0.988(0.701-1.391) 1.296(1.047-1.605) 0.818(0.632-1.059) 1 1.262(1.037-1.537) Table 8: Final Logistic Regression Model It is imperative that social class is included in the model – it is the whole purpose of the analysis! Table 8 reports the final logistic regression model with the following as the reference group; boy, no education, very content with self rated health, grew-up in rural Denmark, and 0 serious illnesses. According to the model, sex, education, self rated health, child residence and serious illness all have an effect on subsequent unemployment. It seems that girls are 59% more likely than boys to have subsequent unemployment. There is a tendency for people who are very content with their health to suffer less unemployment compared to people who are just content (23% higher risk), not content (74% higher risk) and very discontent (123% greater risk). People who live in medium sized cities have 29% greater risk of being unemployed then people who live in rural areas. People who live in Copenhagen have only 82% of the risk that people who live in rural areas of being unemployed. Not surprisingly with each additional serious illness that a person suffers their risk of unemployment increases by 26%. Jolene Lee Masters Pedersen Page 15 of 18 Statistics Exam FSV-Statistics 1,0 Upper 95% c.i. Lower 95%c.i. Odds Ratio 0,8 0,6 0,4 0,2 none craftsman short medium long Education Level Figure 4: Education Level and Risk of Unemployment Figure 4 explores the relationship between education level and risk of subsequent unemployment. In both the Mantel Haenszel analysis and the logistic regression model education is an important variable that has an effect on unemployment. As you can see, the risk of unemployment has a tendency to decrease as the amount of education increases with the exception of a long education. A possible reason for this is that there are relatively few people (6%) who have a long education. It could also be that people with a long education lack social skills to work or that their education was long, but the field provides few jobs. It could also be explained by the 95% confidence intervals themselves. All of the confidence intervals surround 0.346-0.384. This indicates that the true values of the numbers could be the same, and the amount of education does not make a difference on the subsequent unemployment. It’s interesting that none of the confidence intervals surround one. This explains that some education is better than no education at preventing unemployment. Discussion There are advantages and disadvantages of using both models to analyze the data. A draw back of using the Mantel-Haenszel is that the number of strata rapidly increases when attempting to control for the effects of more confounding variables so it makes it impossible to estimate stratum specific odds ratios. An increased number of strata leave a very small number of persons in each strata causing statistical uncertainty. Another Jolene Lee Masters Pedersen Page 16 of 18 Statistics Exam FSV-Statistics problem with Mantel Haenszel is that you have to dichotomize the primary variables, for example social class of the family, which causes information to be lost. With these considerations in mind the logistic regression gave more information. Table 6 and table 8 show that both models identify self rated health, education and serious illness as confounders. Differences are that Mantel Haenszel identifies childhood residence as an effect modificator and logistic regression sees it as a confounder. Mantel Haenszel rules out sex as a possible confounder because it is not associated with social class of the parents but the logistic regression model (which does not consider social class of the family to be associated with subsequent unemployment) shows that it has an effect of unemployment. Finally, Mantel Haenszel shows that intelligence is a confounder between social class of the family and unemployment but logistic regression does not include intelligence in the final model. Both models show that education is more important than intelligence in explaining subsequent unemployment. Possible Errors Possible errors in the analysis are type I and type II errors. Type I errors are made when a significant result is obtained, and the null hypothesis is rejected when it is in fact true. Type I errors are usually estimated to occur 5% of the time. Type II errors occur when a non-significant result is obtained when the null hypothesis is true. A possible type II error in this analysis could have occurred because the Wald Test does not look at the order of categories, therefore variables could have been taken from the logistic regression model that should have been there. Conclusion This analysis used Mantel Haenszel and multiple logistic regression to explore the association between the social class of the parents during the child’s childhood and subsequent unemployment by the grown-up child, taking into consideration possible confounders and effect modifiers. The Mantel Haenszel analysis identified self rated health, intelligence, serious illness and education as confounders and childhood residence as an effect modificator. It was also discovered that education explains the association between social class and unemployment. The logistic regression model identifies sex, education, self-rated health, childhood residence and serious illness as possible Jolene Lee Masters Pedersen Page 17 of 18 Statistics Exam FSV-Statistics confounders. This analysis did not show that unemployment is dependent on continuously measured intelligence, but rather that unemployment is influenced by the above measured confounders and effect modificator and in particular education. This is a good paper with reflections that show considerable understanding of quantitative methods. Include a sentence or a section with your research hypotheses, that is what you in advance think the association between social class and unemployment will be. This correspond to the situation you will be in when analyzing data for research purposes, and makes it more realistic when discussing the finding relative to the prior expectations. There are some shortcomings. An important use of the descriptive statistics section is to be able to discuss to what extent the later results can be generalized. It would have improved the paper if it had had a table of descriptive statistics for all relevant variables (i.e. not including the response: unemployment). The primary variables of interest – here social class and intelligence – should always be included in the final model irrespective of whether they are statistically significant or not. The results of the different analyses should be compared and its should be made clear how different conclusions can be reached depending on which variables that are controlled for. Grade: 7 (12 point scale) Jolene Lee Masters Pedersen Page 18 of 18