J Clin Epidemiol Vol. 48, No. 5, pp. 657-666, 1995 Copyright 0 1995 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0895-4356/95 $9.50 + 0.00 0895-4356(94)00163-4 EVALUATING A TEN QUESTIONS SCREEN FOR CHILDHOOD DISABILITY: RELIABILITY AND INTERNAL STRUCTURE IN DIFFERENT CULTURES M. S. DURKIN,‘,*,‘* W. WANG,4.5P. E. SHROUT,6 S. S. ZAMAN,‘Z. P. DESAP and L. L. DAVIDSON1~*~iO M. HASAN,* IGH Sergievsky Center, Faculty of Medicine, Columbia University, New York, NY, U.S.A., 2Division of Epidemiology, School of Public Health, Columbia University, New York, NY, U.S.A., ‘New York State Psychiatric Institute, New York, NY, U.S.A., ‘Division of Biostatistics, School of Public Health, Columbia University, New York, NY, U.S.A., SNathan Kline Institute, Rockland, New York, NY, U.S.A., Department ofPsychology, New York University, New York, NY, U.S.A., Departments of Psychology, New York University, New York, NY, U.S.A., ‘Departments of Psychology and Special Education, University of Dhaka, Dhaka, Bangladesh, *Department of Neurophyschiatry, Jinnah Postgraduate Medical Centre, Karachi, Pakistan, 9Department of Social and Preventive Medicine, University of the West Indies, Mona, Kingston, Jamaica and lODepartment of Pediatrics, Faculty of Medicine, Columbia University, New York, NY, U.S.A. (Received in revised form 30 August 1994) Abstract-This paperusesfive strategiesto evaluatethe reliability and other measurement qualitiesof the Ten Questionsscreenfor childhooddisability. The screenwasadministered for 22,125children,aged2-9 years,in Bangladesh,Jamaicaand Pakistan.The test-retest approach involving small sub-samples was useful for assessing reliability of overall screeningresults,but not of individual itemswith low prevalence.Alternative strategies focus on the internal consistency and structure of the screen as well as item analyses. They provide evidence of similar and comparablequalities of measurementin the three culturally divergent populations, indicating that the screen is likely to produce comparabledata acrosscultures.One of the questions,however,correlateswith the other questionsdifferently in Jamaica,whereit appearsto “over-identify” childrenasseriously disabled.The methodsand findings reported here have general applications for the design and evaluation of questionnaires for epidemiologic research, particularly when the goal is to gather comparabledata in geographicallyand culturally diversesettings. Child developmentdisorders Cross-culturalcomparison Disability Epidemiologic methods Questionnaires Reliability Reproducibility of results INTRODUCTION When comparing epidemiologic characteristics of a health condition across populations that differ in language and other aspects of culture, the comparability of the assessmentprocedures is a specialconcern. Cross-cultural equivalence is *All correspondence and reprint requests should be addressed to: Dr Maureen Durkin, Columbia University, Sergievsky Center, 630 W. 168 Street, New York, NY 10032, U.S.A. 651 especially problematic when assessments depend on verbal reports of individuals sampledfrom the population. In such instances, researchersmust not only develop a survey questionnaire that is standard and unambiguous, but must also show that population characteristics such as preferred language, level of education and cultural values do not affect the quality of the assessment. Quality of measurement is typically characterized in terms of reliability (the degree to which a measurementproduces systematic variation) and 658 M. S. Durkin validity (the degree to which the measurement is useful for its intended purpose) [ 1,2]. Validity of a screening instrument is the ultimate criterion for choosing a screen. It is tested by comparing the screen to an established external criterion, such as a clinical assessment. Reliability is important because it is a necessary (but not sufficient) condition for validity. If the screen does not produce systematic variation, it cannot be valid. Moreover, reliability is a measurement characteristic that can often be improved, both by clarifying questions and measurement procedures, and by averaging replicate measurements. Such improvements will often improve validity. Furthermore, reliability can be studied prior to fielding of costly validity studies. Estimates of reliability are obtained by replicating measurements, and often the extent to which replication will be obtained can be inferred from examination of responses to similar items within a screen (see below). In this paper we focus on reliability of a screen for an additional reason. In cross-cultural research the examination of reliability and internal structure of measures can provide some assurance that the measures do not vary according to culture or translation. The logic of the reliability analyses can be extended to individual items, as well as composite screening scores, and items that appear to lack cross-cultural robustness can be removed or revised to provide more comparable measurement. An advantage of reliability over validity analysis for assessing cross-cultural comparability is its reliance on internal properties of the screen rather than an external criterion that may itself vary across cultures (due to differences in training, clinical style and other factors). The screening instrument evaluated here is the Ten Questions, a questionnaire designed to detect serious childhood disabilities in 2- to 9-year-old children. It is intended as a tool for focussing scarce professional resources in heterogeneous cultures in developing nations. In previous papers we have shown this screen to be sensitive cross-culturally for detecting serious cognitive, motor and seizure disabilities, but that it is not sensitive for identifying serious vision and hearing disabilities that have not been previously detected [3,4]. A fundamental question addressed in the present paper is: To what extent does the screen produce similarly systematic data in the three populations studied, communities in Bangladesh, Jamaica and Pakistan? To address this question, the paper et al. uses five alternative approaches to evaluate reliability. We begin with a brief review of the classic mathematical definition of reliability [5]. Theory of reliability and its estimation Suppose that the variable X represents the result of an assessment process. The variance of X, a$, is a population parameter that is likely to differ across populations. According to classic reliability theory, it is useful to decompose ai into at least two components, 0: = ai + a$, where ai is variance due to non-systematic stochastic processes (random error) and a$ is variance due to systematic differences between objects or persons being measured. The reliability coefficient is a ratio of the population parameters, a: and a$: a $, = 8:/a; = a’,/[a’, + a:]. The reliability coefficient varies from zero (X is due entirely to unsystematic stochastic processes) to unity (Xis due entirely to systematic individual differences). For a complete development of p$, and its implications, see Lord and Novick [5]. To estimate p$r we need to operationalize what is meant by systematic variation of X, and then to design a study to collect data on systematic variation. The most common design calls for making the X measurement at two points in time (the test-retest design). Changes in the X values for the same respondent are used to estimate ai, and this can be used with the observed Xvariance to estimate p&. Although theoretically and intuitively appealing, this design includes systematic biological, psychological and social changes over time in its estimate of ai, and consequently may underestimate the instantaneous reliability of the first assessment. Some of these changes may be due to the interview process itself: informants may think about the questions and form new opinions about how they should have answered. The test-retest design is also subject to memory artifacts: respondents may remember at time 2 random responses made at time 1, and thereby inflate the reliability estimate. Methodologists who address these issues recommend that the second assessments be carried out after a long enough period to reduce memory artifacts, but promptly enough to reduce the probability of systematic changes. Recommendations of how long the period should be are more products of opinion than science. Inferences about the degree of stochastic variation in X can be made on the basis of data other than those collected over time. If different Ten Questions Reliability 659 questions can be asked in a questionnaire about strategies involve analyses of the individual a single underlying condition or process, then the items. One looks at test-retest reliability, another degree to which the answers to those questions at factor scores from a factor analysis, and the are systematic can be used to establish that the final one is an analysis of the item response assessments are reliable in the general sense process (described below). described above. These inferences are made on the basis of the internal consistency of the MATERIALS AND METHODS questionnaire responses. Three indicators of internal consistency used in this paper are: (1) The Ten Questions is a brief questionnaire Cronbach’s alpha coefficient [6]; (2) factor administered to parents as a personal interview. loadings from a factor analysis [7]; and (3) the Five of the questions are designed to detect item response curve [8], an indicator of how well cognitive disability, two questions relate to a given item distinguishes between respondents movement disability, and there is one question with high and low scores on the trait the each on seizures, vision and hearing, respectively instrument is intended to measure. Unlike (see Appendix) [3,4, 12, 131. The target age test-retest designs that obtain replicate measure- group is 2-9 years. The Ten Questions screen is ments over time, internal consistency designs intended as a rapid and low-cost method of attempt to obtain replicate measurements within case-finding in communities such as those in less a single interview session. In addition to developed countries where many or most providing information about the reliability of the seriously disabled children have never received questionnaire items, the internal consistency professional services. design provides the basis for designing composite Three features of the questionnaire design are measures that are more reliable than the original intended to enhance its appropriateness and items. The degree to which reliability is expected measurement qualities under diverse cultural and to improve in the composites is described socioeconomic conditions: the questions are mathematically by Spearman [9] and Brown [lo] simple with a yes-no response format; they focus [see also 111. on universal abilities that children in all cultures normally acquire, rather than on culturally Like test-retest measurements of reliability, estimates based on internal consistency have specific behaviors; and they ask the parent, in limitations. Random biological and psychologijudging whether or not the child has a disability, cal processes that are irrelevant to the interests of to compare the child to others of the same age the investigator may affect all of the responses and cultural setting. given in a single interview. For example, stress or A two-phase design, screening followed by illness at the time of the assessment may clinical evaluations, was implemented in commuintroduce transient variation in X that is not nity settings in Bangladesh, Jamaica and recognized as error variation by the internal Pakistan [3,4, 12, 14-191. The screening took consistency estimators of reliability. Reliability place during house-to-house surveys and covered estimates may also be inflated if the related items all 2- to 9-year-old children in selected in the questionnaire are so similar that they are communities. In Bangladesh and Pakistan affected by the same sources of confusion. These cluster sampling was used to obtain probability limitations are offset, but not eliminated, by samples of Bangladesh and Karachi, Pakistan, considerations of the cost and feasibility of respectively. In Jamaica, all households in a internal consistency. contiguous area of Clarendon Parish were The fact that no single strategy for evaluating surveyed. The Ten Questions screen and a the degree to which a measure produces household questionnaire, translated into the systematic data is wholly satisfactory has national language of each country, were prompted us in the present analysis of the Ten administered by community workers trained as Questions to employ five complementary strat- interviewers for this study. In all, 58 interviewers egies for making inferences about the quality of screened more than 22,000 children in the three the data. The first two look at the screen as a countries (Table 1). The participation rate was single, global measure; one evaluates reliability greater than 98% in each country. The main in terms of test-retest agreement between overall reason for non-participation was that no adult screening results obtained on two occasions; the was present on at least three visits. other in terms of internal consistency of the ten To assess test-retest reliability, repeat screendisability questions (items). The remaining three ing and household questionnaires were adminis- 660 M. S. Durkin et al. tered 2 weeks after the original survey for consecutive samples of 101 children in Bangladesh and 52 children in Pakistan. Test-retest data were not available from Jamaica. Statistical analysis The responses to the Ten Questions were considered negative (coded 0) if no problem was reported and positive (coded 1) if a problem was reported. To assess test-retest consistency of the sum of positive responses to the screening questions we computed Pearson correlation coefficients [20]. To evaluate testretest reliability of dichotomous outcomes (individual item responses and overall positive vs negative screening results), we computed kappa coefficients [21]. To measure internal consistency reliability we computed Cronbach’s alpha [6]. Factor analysis [7], used to evaluate the reliability of individual items, was based on tetrachoric correlations [5,22] of the items. The factor loadings on a single factor are interpreted as the correlation between each item and the common factor measured by the ten questions as a whole. If the item evokes random or unreliable responses, its factor loading is expected to be zero. The final method used to evaluate the reliability and cross-cultural comparability of individual items involved evaluation of item characteristic curves [8]. Each of these curves provides insight into the ability of a specific item to distinguish children with lower from those with higher scores on a disability scale (a scale comprised of items from the Ten Questions screen found in the factor analysis to measure a common dimension, excluding the item under consideration). The curve for a given item describes the probability that the item will be endorsed for children with increasing levels of disability (i.e. scores on the disability scale). Curves that are steep in the center of the graph indicate the item effectively distinguishes between children with lower and higher levels of disability. Because only reliable items could show such relationships, the item curve analysis provides an indirect method of evaluating reliability. In conjunction with the visual examination of the item characteristic curves, we used a Mantel-Haenszel procedure [23] to test whether the odds of endorsing an item is the same in a given population compared to a reference population adjusting for overall severity (as measured by the remaining items). A summary odds ratio greater than one, for example, indicates that an item is more likely to be endorsed in a given population than in a reference population, holding disability level constant. RESULTS The three populations of children are similar in terms of age and gender distribution, but they differ in socioeconomic and cultural characteristics (Table 1). In education and most economic indicators, Jamaica is the most developed and Bangladesh is the least developed of the three populations (Table 1). Karachi is the most urbanized of the three. The populations have different rates of response for some items and similar rates for other of the Ten Questions (Fig. 1). The most striking difference is for Question 10 (which reads: “Compared to other children, is this child in any way backward, dull or slow?‘); in Jamaica it elicited a much larger proportion of positive responses than any of the other questions and than any question in the other two populations. In Pakistan, the percentages of positive responses to Questions 1, 5, 6 and 9 were considerably higher than in the other two populations. Otherwise, the patterns of responses in Fig. 1 show some similarity in the three populations. In all three populations, the questions on milestones (Question 1) and unclear speech (Question 9) elicited frequent positive responses, while the questions on comprehension (Question 4) learning (Question 7) and no speech (Question 8) elicited the fewest positive responses. The overall percentages screened positive (on any question) were much higher in Jamaica and Pakistan than in Bangladesh (15.6 and 14.7 vs 8.2%, respectively). Analysis of the reliability of the screenas a global measure The test-retest reliability results for the screen as a whole (for both the sum of positive items, and for the dichotomous outcome of positive/ negative screen) indicate acceptable or good reliability in both populations where test-retest data were available (Bangladesh and Pakistan, Table 2). In Bangladesh and Pakistan the test-retest reliabilities for total scores (correlation coefficients) were, respectively, 0.58 and 0.83 (Table 2). Consistent with these results, the internal consistency reliabilities indicated by Ten Questions Table Reliability 1. Number and background characteristics of the children screened and clinically evaluated in the three populations Bangladesh Number of children screened Number of inierviewers Child characteriastics Boys (%) 2-5 years (%) 69 years Attends school (among ages 5-9) (%) Born at home (%) Received any immunizations (%) Household characteristics Language Religion from semi-rural residence q q Pakistan 5,461 8 6,365 19 52.6 48.0 52.0 49.2 52.4 47.6 53.8 53.0 47.0 57.5 89.2 96.5 30.6 68.8 48.6 32.6 91.1 77.2 Bangla Muslim (91.4%) English Christian (89.0%) Urdu Muslim (94.4%) 26.6 50.4 45.4 26.8 49.9 7.3 * 55.5 80.3 74.0 4.4 9.7 94.5 77.3 68.3 39.3 99.4 39.4 area. alpha coefficients for these two countries were 0.60 in Bangladesh and 0.66 in Pakistan. We also have the alpha coefficient of reliability for Jamaica, which is 0.60, similar to the other two countries. Also shown in Table 2 are the test-retest reliabilities for the dichotomous screening outcome (positive vs negative), which are lower than those for the continuous outcome. This is as expected because of the loss of information resulting from dichotomizing a continuous variable. - Jamaica 10,299 31 Agricultural occupation Rural residence (%) Electricity (%) Radio (%) Water tap (%) Mother attended primary school (%) *All 661 Analysis items of the reliability of individual screening The test-retest results for individual items are generally not informative due to the limited number of persons who were positive on each item (Table 2). For four of the items in Bangaldesh and five of the items in Pakistan, kappa coefficients could not be calculated due to lack of variation (no positive responses within the retest sample). For the remaining questions, the estimated kappa coefficients are generally Bangladesh Jamaica Pakistan Ql Q2 Q3 44 QS 46 Q7 Q8 Q9 QIO Any Q* Ten questions Fig. 1. Percentage with positive responses to the Ten Questions in the populations Jamaica and Pakistan. surveyed in Bangladesh, M. S. Durkin et al. 662 Table 2. Test-retest reliability of the Ten Questions: kappa coefficients for dichotomous, Pearson correlation coefficients for continuous variables (95% confidence intervals*) Bangladesh Pakistan Number of children retested 101 52 Global screening result 0.58 (0.43-0.69) 0.48 (0.24, 0.72) Sum of problems reported on the ten questions Screened positive (positive on any of the ten questions) Individual 0.83 (0.734.90) 0.67 (0.37, 0.97) items 1. Milestones 0.49 0.79 (-0.12, 1.0) (0.38, 1.0) 2. Vision 1.00 t (0.80, 1.O) 3. Hearing 0.52 t (0.15, 0.89) 4. Comprehension t t 5. Movement 0.49 0.79 (-0.12, 1.0) 0.39, 1.0) 6. Seizures 0.32 0.66 (-0.18, 0.57) 0.03, 1.0) 7. Learning t 1.00 (1.0, 1.0) 8. No speech t t 9. Unclear speech 0.58 1.oo (0.21, 0.95) (0.73, 1.0) 10. Slowness t -0.02 (-0.05. 0.01) *For dichotomous variables, confidence intervals were computed with non-null standard errors, except when the point estimate of the kappa coefficient was equal to 1. tThe kappa coefficient could not be calculated because no children in the sample had a positive response to this question. unstable (indicated by wide confidence intervals), and not useful for assessing the comparability of the reliability of the screen in different cultures. 0 .E -2 0 1.0 r 0.6 - The two alternative methods for evaluating the reliability and cross-cultural comparability of individual items, factor analysis and item characteristic curves, are not as constrained by sample size as the test-retest approach because they make use of data on all children screened. The factor loadings of the ten questions are notably consistent across the three populations (Fig. 2). In all three populations, the questions on motor (Questions 1 and 5) and cognitive (Questions 4, 7, 8,9 and 10) disability have high loadings, indicating that they are correlated with a common factor; hence reliable. Also in all three countries, the loadings for the questions on vision (Question 2) hearing (Question 3) and seizures (Question 6) are relatively low, indicating either unreliability or that those items each measure something distinct from cognitive and motor disability. The pattern of eigenvalues [7] from the factor analysis unequivocally suggest a one factor model for the ten questions in each of the three countries (data not shown). Item characteristic curves were constructed for all ten items but only the curves for three exemplary items are shown [Fig. 3(a-c)]. To construct these curves, we plotted the proportion responding positively to each item for groups of respondents defined using the sum of only the motor and cognitive items (excluding the item under consideration), because it was only these seven items that form a scale with high factor loadings on the common factor. In all three populations, the questions on milestones b I& Q1 44 QS 46 47 * Bangladesh + Pakistan 0 Jamaica Q8 Q9 QlO Eight items Fig. 2. Factor loadings for the each of the ten questions in the three populations, estimated by unweighted least squares, one factor model. Ten Questions Motor 0 I 2 Motor 100 ..z cognitive disability 3 score* 4 5 disability score* disability score* 6 Cc) 0 Bangladesh 80 s5 cognitive + Jamaica 60 I * Pakistan Motor cognitive Fig. 3. Item characteristic curves for three of the ten questions, three populations. (a) Unclear speech (Question 9); (b) Slow (Question 10); (c) Vision (Question 2). (Question l), movement disability (Question 5), learning (Question 7), no speech (Question 8) and unclear speech (Question 9) have curves that are steep in the center of the graph and, therefore, appear useful for distinguishing children with disability (as measured by the six motor and cognitive items other than the one being considered). The curves for Question Reliability 663 9 exemplify this pattern and are shown in Fig. 3(a). Question 10 (on slowness) shows similar steep curves in Bangladesh and Pakistan [Fig. 3(b)]. In Jamaica the curve for this item rises steadily but the probability of a positive response to this question is high even for those with few positive responses to the other questions [Fig. 3(b)]. In all three populations, the curves for Questions 2 (vision), 3 (hearing) and 6 (seizures) show only a weak if any positive association with the sums of the seven motor and cognitive questions. For example, the curves for Question 2 [Fig. 3(c)] show that among children with positive responses to most of the seven cognitive and motor items, no more than 25% had reported problems with vision. These observations are consistent with the factor loadings in suggesting that either these three questions are unreliable (contain stochastic noise or random error) or that they measure something distinct from cognitive and motor disability. Although we have stressed similarities in their shapes, the item characteristic curves are not identical across the populations. The MantelHaenszel test results (Table 3) reveal that all ten items show significantly different relationships to severity of disability (measured by the sum of positive responses to other items) in at least two of the cross-cultural comparisons. Most of the items are significantly different in all three pairwise comparisons (i.e. confidence intervals exclude 1; Table 3). This Mantel-Haenszel test is used by psychometricians to detect item “bias” in the following sense: holding constant the estimated degree of overall disability, the probability of specific reported problems appears to differ across samples. Though significant due to large sample sizes, the differences are relatively modest for most items except Question 10, (slowness). Children who have few other problems are much more likely to be reported to be slow by parents in Jamaica than by parents in the other two cultures (the summary odds ratio for Question 10 is 22.8 when Jamaica is compared to Bangladesh and 10.6 when Jamaica is compared to Pakistan). Thus, if we were to rely on Question 10 to determine whether one population of children has a greater proportion of cognitively disabled children than another, we would most certainly obtain a biased impression. On the other hand, we note that the question is not without some strengths. Jamaican parents, like those in the other two cultures, are much more likely to attribute slowness to the children M. S. Durkin et al. 664 with apparent multiple or severe disabilities than other problems reported those without [Fig. 3(b)]. DI!XUS!3ION We have compared the reliability of the Ten Questions screen in three different cultures as a means of assessing the extent to which it achieves its goal of cross-cultural comparability. A secondary purpose of this paper has been to demonstrate how multiple methods of assessing reliability can provide complementary perspectives on the extent to which a questionnaire produces systematic data. Four of the five methods we used provided useful information on the reliability and cross-cultural comparability of the Ten Questions. These included the methods Of: (1) Test-retest reliability of the global screening result, which showed good consistency over time in the screening results for samples of children for whom repeated administrations of the screen were available. Similar levels of reliability (coefficients in the range of 0.6-0.8) have been reported for other instruments designed to detect disability, such as the Vineland Adaptive Behavior Scales [24] and the Mental Function Index [25]. (2) The computation of alpha coefficients, which indicated good and comparable levels of inter-item consistency in all three populations. (3) Factor analysis, which produced factor loadings indicating high levels of reliability for seven of the ten items in all three populations. (4) The construction of item characteristic curves, which demonstrated some consistency across the three cultures in the relationship of specific items to the common factor measured by the screen. A fifth method of assessing reliability, test-retest reliability of individual items, was not informative about the reliability or cross-cultural comparability of individual questions because stable estimates of kappa could not be made for rare problems with retest sample sizes even as large as 101 in Bangladesh and 52 in Pakistan. Because epidemiologists commonly study conditions of low prevalence, and because replication studies involving larger samples are rarely practical, these results illustrate a common limitation of test-retest studies in epidemiology. This limitation was avoided by the internal consistency approaches, which were able to make use of data on all 22,125 children screened. Within the test-retest strategy, an alternative approach is to carry out studies in a “fortified” (e.g. patient) sample, with a high prevalence of disorder. Thompson and Walter [26], however, point out that the reliability of a measure in a Table 3. Mantel-Haenszel odds ratios* (95% confidence intervals) indicating variations between countries in the odds of positive responses to each of the seven cognitive-motor questions on the ten questions screen, among children with matching scores on the sum of the remaining six cognitive-motor questions Index population: Pakistan Jamaica Jamaica Reference population Pakistan Bangladesh Bangladesh Questions 1. Milestones 1.43 0.41 0.27 (1.09, 1.89) (0.30, 0.57) (0.20, 0.36) 4. Comprehension 0.30 0.71 2.48 (0.17, 0.51) (0.43, 1.16) (1.41, 4.39) 5. Movement 0.72 0.44 0.58 (0.53, 0.98) (0.31, 0.63) (0.42, 0.81) 7. Learning 0.50 1.00 2.03 (0.28, 0.90) (0.60, 1.69) (1.22, 3.26) 8. No speech 0.34 0.61 1.83 (0.20, 0.56) (0.38, 0.99) (1.11, 3.03) 1.53 0.43 0.29 9. Unclear speech (1.18, 1.98) (0.32, 0.59) (0.22, 0.37) 10. Slowness 2.24 22.8 10.61 (1.54, 3.25) (16.4, 31.6) (8.34, 13.50) *An odds ratio greater than 1 indicates that the probability of a positive response to the question is greater in the index population than in the reference population, among children with matching scores on the sum of the remaining six questions; an odds ratio less than 1 indicates that the probability of a positive response is lower in the index population than in the reference population. Ten Questions high prevalence sample cannot necessarily be generalized to a community population where prevalence is low, since some measures of reliability are affected by prevalence. The reliability of a measure is best evaluated in a sample that is representative of the population for which the measure is intended. The similarity across cultures in the patterns of factor loadings and item characteristic curves indicates considerable cross-cultural comparability in the ways that the ten questions correlate with each other. The high loadings and steep curves for the seven cognitive and motor items in all three populations indicate that these items are reliable in each of the populations studied. In all three populations the questions on vision, hearing and seizures have the lowest factor loadings as well as flat item curves, suggesting that these questions may consistently identify two different groups of children with vision, hearing or seizure disabilities-one group with cognitive and/or motor disability and the other group without these complications.* Thus, the lack of evidence of reliability for the questions on vision, hearing and seizures could reflect the heterogeneity of childhood disability and its causes. Both the factor analysis and item characteristic curve methods of evaluating the reliability of individual items assume those items measure a common factor or trait that is measured by the remaining items in the scale. If this assumption does not hold for the vision, hearing and seizure questions, these methods are not informative about the reliability of those items. Our inability to demonstrate reliability of the questions on vision, hearing and seizures, therefore, does not necessarily imply that these questions are not useful. A previous analysis of the validity of the Ten Questions suggested that dropping any one item would result in a loss of sensitivity [4]. The item characteristic curves for Question 10 (on slowness) reveal a striking difference across countries, suggesting that Question 10 is a biased item and should be used with caution in comparing children of different cultural backgrounds. Although the cross-cultural differences in the item characteristic curves are most striking *As with vision, hearing and seizure problems, motor disabilities such as cerebral palsy are by no means always complicated by cognitive disorder. Yet the factor loadings for the question on movement disability are high in all three countries. This may reflect the fact that serious mental retardation is typically associated with delayed motor milestones, even if the child has no specific movement disorder. Reliability 665 for Question 10, the Mantel-Haenszel statistics revealed statistically significant differences across the populations for every individual item. Though it is true that the large sample sizes render even small differences statistically significant, these findings, nevertheless, serve to remind us that neither the individual items nor their total sum should be used uncritically to compare levels ofdisability across cultures. This limitation of the screen underscores our initial intent and previous recommendation [4, 12, 13, 17,271 that the questions be used as a first phase screen rather than an ultimate measure of disability. As long as the children who appear to have problems according to the screen are assessed in a second phase by professionals or others trained to take cultural norms and practices into account, then the potential bias, for example of over-identification of disability in Jamaica based on Question 10, is not problematic. The fact that the majority of the questions showed the expected relation to the overall level ofdisability [Fig. 3(a)] was reassuring concerning their utility as screening questions. In conclusion, these results provide considerable evidence that the Ten Questions as a whole is a reliable questionnaire and that indicators of its reliability are comparable across populations that differ in culture and level of socioeconomic development. We would not be able to support this conclusion had the reliability analysis been limited to a test-retest study of a few hundred children. The use of multiple methods of assessing the reliability of the Ten Questions has shown notable consistency as well as an important item-specific difference across cultures. The approach to assessing reliability demonstrated here has broad applications in epidemiology, especially when the goal is to gain a comprehensive understanding of the reliability of one’s data or to assess the comparability of data collected from diverse groups or settings. Acknowledgements-This work was supported by the BOSTID Program of the National Academy of Sciences (U.S.A.), the Epilepsy Foundation of America, the National Institute of Neurological Diseases and Stroke (R29 NS27971-01, R29 NS27971-02, R29 NS27971-03, R29 NS27971-04) the New York State Psychiatric Institute, and the Gertrude Sergievsky Center of Columbia University. The authors would like to acknowledge the contributions of Drs Zena Stein, Lillian Belmont, Zakin Hasan, Mervyn Susser, Marigold Thorburn and others to the design and conduct of this research project. REFERENCES 1. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and Quantitative Methods. Belmont, CA: Lifetime Learning Publications; 1982. 666 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. M. S. Durkin Feinstein AR. Clinical Epidemiology. Philadelphia: WB Saunders Co.; 1985. Zaman S, Khan N, Islam S et al. Validity of the Ten Questions for screening serious childhood disability: results from urban Bangladesh. Intern J Epidemioll990; 19(3): 613-620. Durkin MS, Davidson LL, Hasan ZM er al. Validity of the Ten Questions screen for childhood disability: results from population-based studies in Bangladesh, Jamaica and Pakistan. Epidemiology 1994; 5(3): 283-289. Lord FM, Novick MR. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley; 1968. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrica 1951; 16: 297-334. Harman HH. Modern Factor Analysis, 2nd Edition. Chicago, IL: University of Chicago Press; 1967. Suen HK. Principles of Test Theories. Hillsdale, NJ: Lawrence Erlbaum Associates; 1990. Spearman C. Correlation calculated from faulty data. Psychometrica 1910; 3: 271-295. Brown W. Some experimental results in the correlation of mental abilities. Psychometrica 1910; 3: 296322. Shrout PE, Yager T. Reliability and validity of screening scales: Effects of reducing scale length. J Clin Epidemiol 1989; 42: 69-78. Durkin MS, Zaman S, Thorburn M et al. Screening for childhood disability in less developed countries: rationale and study design. Intern J Ment Health 1991; 20: 47-60. Belmont L. Screening for severe mental retardation in developing countries: the International Pilot Study of Severe Childhood Disability. In: Berg JM, Ed. Science and Technology in Mental Retardation. London: Methuen; 1986: 389395. Thorburn MJ, Desai P, Durkin MS. A comparison of the key informant and the community survey methods in the identification of childhood disability in Jamaica. Ann Epidemiol 199 1; 1: 255-26 1. Thorburn MJ, Desai P, Davidson LL. Categories, classes, and criteria in childhood disability-experience et al. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. from a survey in Jamaica. Disability Rehab 1992; 14(3): 122-132. Thorbum MJ, Desai P, Paul TJ et al. Identification of childhood disability in Jamaica: the ten question screen. Intern Rehab Res 1992: 15: 115-127. Durkin MS, Davidson’LL, Hasan ZM et al. Screening for childhood disability in community settings. In: Thorbum M, Marfo K, Eds. Practical Approaches to Childhood Disability in Developing Countries: Insights From Experience and Research Jamaica: 3D Projects; 1990: 179-197. Durkin MS, Davidson LL, Hasan ZM et al. Estimates of the prevalence of childhood seizure disorders in communities where professional resources are scarce: results from Bangladesh, Jamaica and Pakistan, Paed Perinatal Epidemiol 1992; 6: 166180. Stein ZA, Durkin MS, Davidson LL et al. Guidelines for identifying children with mental retardation in community settings. In Assessment of People with Mental Retardation. Geneva: World Health Organization; 1992. Fleiss JL. Statistical Methods for Rates and Proportions Second Edition. New York: Wiley; 1981. Fleiss JL. The Design and Analysis of Clinical Experiments. New York: Wiley; 1986. Mislevy RJ. Recent developments in the factor analysis of categorical variables. J Educ Stat 1986; I 1: 3-3 1. Holland PW, Thayer DT. Differential item functioning and the Mantel-Haenszel procedure. In: Wainer H, Brain H, Eds. Test Validity. Hillsdale, NJ: Lawrence Erlbaum Assoc; 1988. Sparrow SS, Balla DA, Cicchetti DV. The Vineland Adaptive Behavior Scales, Survey Manual. Circle Pines, MN: American Guidance Service; 1984. Pfeffer RI, Kurosaki TT, Chance JM et al. Use of the Mental Function Index in Older Adults: reliability, validity, and measurement of change over time. Am J Epidemiol 1984; 120(6): 922-935. Thompson WD, Walter SD. A reappraisal of the kappa coefficient. J Clin Epidemiol 1988; 41(10): 949-958. Susser MW. Mental retardation and handicap in the developing world: an overview of broad issues. Intern J Ment Health 1981; 10: 117-119. APPENDIX The Ten Questions 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Compared with other children, did the child have any serious delay in sitting, standing or walking? Compared with other children does the child have difficulty seeing, either in the daytime or at night? Does the child appear to have difficulty hearing? When you tell the child to do something, does he/she seem to understand what you are saying? Does the child have difficulty in walking or moving his/her arms or does he/she have weakness and/or stiffness in the arms or legs? Does the child sometimes have fits, become rigid, or lose consciousness? Does the child learn to do things like other children his/her age? Does the child speak at all (can he/she make himself/herself understood in words; can he/she say any recognizable words)? For 3- ro 9-year-olds ask: Is the child’s speech in any way different from normal (not clear enough to be understood by people other than his/her immediate family)? For 2-year-olds ask: Can he/she name at least one object (for example, an animal, a toy, a cup, a spoon)? Compared with other children of his/her age, does the child appear in any way mentally backward, dull or slow?