instrumental variable methods for the estimation of test score reliability

Journal of Educational Statistics Fall 1983, Volume 8, Number 3, pp. 223-241 INSTRUMENTALVARIABLEMETHODSFOR THE ESTIMATIONOF TEST SCORERELIABILITY RUSSELL ECOB Inner London Education Authority AND HARVEY GOLDSTEIN University of London Institute of Education EducaKey words:InstrumentalVariables,Errorsin Variables,Reliability,Longitudinal, tionalAttainment method of InstrumentalVariablesis suggestedas an alternativeto traditionalmethods for estimatingthe reliabilityof mental test scores which avoids certaindrawbacksof thesemethods.The consistencyand efficiencyof the instrumental variablemethodare examinedempiricallyusing data from the BritishNational Child DevelopmentStudyin an analysisof 16 year, 11 year and 7 year old scoreson tests of mathematicsand reading. ABSTRACT.The In the simple regression model (with the usual assumptions about the error term u) y = a + fix + u, it is well known (Goldstein, 1979) that if the observed independent variable x contains errors of measurement and we wish to estimate the regression coefficient on the "true" value of x, then the ordinary least squares (OLS) estimator is inconsistent. The simplest and most common model relating the true value to the observed value of x is x = T+ e, (1) where T is the true value, e the random error of measurement, and Cov(T, e) = 0. It is supposed, therefore, that we wish to estimate the parameters a, /Pin y = a+ pT+ e + fixx + (e - Pie). =a + It is because x is correlated with (e - fpe) that the OLS estimator, b, is an inconsistent estimator of ft. A consistent estimator is given by b/R, where R is the reliability of x and defined as R = Var(T)/Var(x), where Var(x) = Var(T) + Var(e). 223 224 Ecob and Golstein In many situations,the value of R is very close to 1, and any adjustmentto the usualestimatecan be safely ignored.In other applications,for examplein mental testing, R may be considerablyless than 1 so that an adjustment becomesnecessary.In a linear model with severalfurtherindependentvariables, the estimators of the associated regressioncoefficients will also be inconsistentif OLS is used. Consistentestimatesmay be obtainedby adjusting the observed covariance matrix of the independentvariables so that the observedvariancescorrespondingto variablescontainingmeasurementerror have estimatesof their measurementerrorvariancesubtractedprior to inversion of the matrixand calculationof the coefficients(Warren,White,& Fuller, 1974).To do this, it is importantto have accurateand consistentestimatesof the measurementerrorvariances,or alternatively,reliabilities.In this paperwe exploresomenew proceduresfor obtainingsuch estimatesbasedon instrumental variabletechniques. ReliabilityEstimation A large body of literatureexists concerningproceduresfor obtaininggood estimatesof reliabilityfor mental tests that consist of a fixed set of response items. Details can be found in Lord& Novick (1968, Chapter4). The methods that havebeen proposedcan be dividedusefullyinto "internal"and "external" methods. Internal Estimates of Reliability In the internal methods, various relationshipsbetween the test item responses are used to providereliabilityestimates,of which the best known is "coefficientalpha."Undergeneralassumptionsaboutthe conditionalindependence of item responses,and constancy of measurementerror variances,a useful lower bound on the reliabilitycan be estimated,which under certain furtherassumptionsleads to a consistentestimateof the reliability.We write, similarlyto (1), xji = Tji+ eii, wherexi is the observedvalue of itemj for individuali, Tjiis the truevalue of item j for individual i, and eji is the measurementerror. A conditional independenceassumptionis usuallymade;thatis, for givenTj,,Ti,j :j' Cov(ej,ie-,) = 0. (2) The conditionsfor a consistentestimatorof the reliabilityare ji - aj+ >, (3) whereaj and T;can be interpretedas the difficultyof itemj and the abilityof individuali. This implies that the items are not only "unidimensional"but essentially equivalent apart from a location shift. If (2) does not hold, then Instrumental VariableMethods 225 coefficientalpha is not necessarilya lower bound and may even overestimate the truereliability. We see, therefore, that there are two serious drawbacksto 'internal" estimatesof reliability.First, thereis the difficultyof satisfying(3) to obtain a consistent estimatorof the reliability,or equivalently,of the measurement errorvariance.While there is a considerableliteratureon this and the more in practiceit seemsextremelydifficultto generalconceptof unidimensionality, ensurethat item responsesare indeeddeterminedby a singlequantityfor each individualsuch as givenby (3). For the types of educationaltests we deal with in this paper,it seems even less likely that a unidimensionaltraitis operating. A moredetaileddiscussionof this topic is givenby Goldstein(1980).Secondly, assumption(2), often known as the "local independence"assumption,seems somewhat unreasonablea priori. It is difficult to imagine that if a given individualfails one item then the probabilitiesof successon lateritems are the same as when the individualsucceedson the earlieritem. Nevertheless,there seems to have been little, if any, serious study of this problem and the consequenteffect of nonzero correlationson reliabilityestimates.A further discussion of this point in this context of latent trait models is given by Goldstein (1980). Thus, there is as yet no really satisfactorymethod for obtaininga consistentestimateof reliabilityusing "internal"methods,nor of even providing a lower bound. We suggest that estimates based on these methodsshouldbe treatedwith somecaution. External Estimates of Reliability The most obvious method of estimatingreliabilityor measurementerror varianceis to carry out repeat measurements.Thus, we have (droppingthe suffix i), for two applicationsof a test, X, = T + e,; X2-= T+ e2; and Var(X, - X2)= Var(e, - e2)= 2a,2 - Cov(e , e2)}, where a,2 denotes the constantmeasurementerrorvariance.For many physicalmeasurementsit is reasonableto assumeindependenceerrors,that is, Cov(e,, e2) = 0, so that we havea,2 -= Var(Xi - X2). For mentaltests, however,this usuallywill not be a reasonableassumptiondue to the presenceof memoryeffects, learning, and so on. If morethanone test relatingto the sameattributeis available,then by assumingsuitablerelationshipsbetween the true scores on the tests, it is possible to obtain reliabilityestimates.The usual assumptionis that the tests are congenericso that for a set of p tests, X,= a + bjT+ e1, j= 1,...,p. Ecob and Golstein 226 The observedcovariancematrixof the Xi contains p( p + 1) elementsand if we assumeCovj,i(ejej,)= 0 and Cov(Tjej)= 0, the matrixis a function of the bj and errorvariancesoj, which gives 2p parameters.Hence, for threeor more tests, unique estimates,based for exampleon maximumlikelihood,are available.Details of this approachare given in Joreskog(1971).Althoughit is not quite as seriousas in the simple test-retestcase, this method also has the difficulty that the measurementerrors of the tests may be correlated,for example,because of day to day fluctuationsamong examinees,and the like. This immediatelyraises the questionof definitionof true score, but we shall postponea discussionof that until a latersection. In the next sectionwe proposea generalizationof congenerictests to include any variablehavingnonzerocorrelationwith the test whose reliabilitywe wish to measure.Such an "instrumentalvariable"does not requireany assumption about unidimensionalityor independence;also, unlikethe simpletest-retestor the congenerictest models,the possibilityof choosingany variablemeansthat we can searchfor those that arelikelyto be uncorrelatedwith the measurement error eP. The possibility of dropping both these restrictiveassumptionsis attractiveand the remainderof the paper investigatesthis problemusing an extensivelongitudinaldata set. The Data The data come from the National Child DevelopmentStudy (NCDS) that followedup a cohort of 17,000childrenborn in one week of March 1958, at the ages of 7, 11 and 16. The childrenbelonged to the first year-groupfor whom the minimumschool-leavingage was 16 years. A descriptionof the social and educationaldata (amongothers)collectedat these ages is given in Fogelman(1976). Testingand scoringin the NCDS was carriedout by the class teacherwho also noted the child'sachievements.Becausethe studywas a nationalstudy of all childrenborn in a particularweek, most childrenselectedwere testedin a differentsituationand by a differenttester. Four possiblesituationsgivingrise to responsevariationare as follows: 1. the environmentin which the test is administered; 2. the processof test administration; 3. the coding and scoring of the test (this includes the interpretationof the correctnessof the response);and 4. day-to-dayvariationin individualtest performance. Becauseonly one test of a given type was done by each child at each occasion, the sourcesof variation1-4 aboveare confounded.It is important,however,to distinguish"day-to-day"variationfrom changesin true score over time. We can regardvariationover time as contributingeitherto measurementerroror to true score variation or to both. A reasonable estimate of the true score at a Instrumental VariableMethods 227 particularmomentwould be obtainedfrom a movingaverageof scores taken at successivetimeintervalsbeforeand after.The continuouschangein truetest scoreover,say, a weekis thereforeregardedas beingsupplementedby random error to produce the observedday-to-dayvariation.The variouseducational measurementsin the NCDS were completedwithin a week for each child, so that any true scorechangesover not morethan a 1-weekperiodare effectively regardedas part of day-to-day variation. In addition to these sources of measurementerror,there will typicallyremainan unexplainedvariationthat can be conceptualizedas the variationbetweenthe responseto an item and its hypotheticalreplication. Cronbach,Gleser,Nanda, & Rajartnan(1972) arguethat test evaluationor studies,whichalso view a particulartest as a samplefrom a "generalizability" universeof tests and which use experimentaldesigns to estimateindividually the above components of variation, should be carried out prior to test administration.We use here a "test-specific"interpretationof true score that treatstruescoreas relevantonly to the particulartest. A justificationfor this is given by Goldstein (1979), althoughthe methods used in this paper can be extended to a full "generalizability"approach.Goldstein (1979), using the same NCDS data, also drew attentionto the use of instrumentalvariablesin estimatingthe relation between mathematicsand reading attainmentswhen measuredat different ages. He emphasizedthe potential usefulnessof this method when impreciseprior knowledgeabout the reliabilityof the earlier attainmentscoresis available,and pointedout that little was knownabout the degree to which the instrumentalvariablesused satisfied the conditions of consistency. In this paper, the propertiesof a variety of instrumentalvariables are examinedin the context of the regressionof 16 years attainmenton 11 and 7 years attainmentfor mathematicsand readingtest scoresseparately.Comparisons are made with the use of ordinaryleast squaresand also with the use of the internalestimatesof the reliabilitycoefficientsfor the 11-yearattainment givenin Goldstein(1979). VariablesEstimation Theoryof Instrumental GeneralTheory SupposeX1i,X2i are the observedvaluesof test score variablesmeasuredas deviationsfrom theirmeans at the first and second occasionsand let thembe the predictorand dependentvariables,respectively,in a simple linear regression model.Furthermore,denote theirtruevaluesas Tli and T2, and the errors of observationor measurementerrors for the ith subject as eUi and e2i (i = 1,...,n). Then,as before: Xi = Tli + eli; X2i T2i + e2i, (4) 228 Ecob and Golstein and a modelrelatingthe truevaluesat eachoccasionis T2i Tli+ ui. (5) If Z1is theobservedvalueof anothervariable,calledtheinstrumental variable, then n n bIv = ZiX2i (6) ZiXi is called the instrumentalvariableestimatorof the regressioncoefficient 3 (Johnston,1972,p. 279). From(4), (5) we have biv=-(fP ZTi + Ziui+ Ze2i)( Z,Xi)', and bv - p = (2 Zu, + Zie2i - P I2Zieli)(I ZiXli)-'. In terms of the sample correlations,rze, 2Ze and rze, betweenthe instrumentalvariableZ and the disturbanceand errors of measurementof X2, X,, respectively,and the reliabilityR of X1, bIv - Urzu + Ze/2 e2rze2 e zx, rze (7) whereae,, ae2, u,are, respectively, the standarddeviationsof the errorson the 1st occasion,2nd occasion,and the disturbanceterm.As the samplesize tends to infinity,the followingconsistencyconditionis obtained: + ouPZU aezPZe2- = 0. floaePZe, Thus, a generalcondition for consistencyis that the three correlationspzu,, PZe29 PZe, are all zero. Equation (6) shows that if the predictorand dependent variablesare interchanged,the instrumentalvariable estimatorbecomes its reciprocal. The efficiency of the instrumentalvariableestimatoris the squareof the multiplecorrelationof the instrumentalvariableset with the predictor.This assumesthat the estimatoris consistent. The Use of Many Instrumental Variables Whenwe havep instrumentalvariablesZ, j = 1,...,p, blv = 2 t ij jZijX2i 1 ( c i ZijXli. (8) j The combinationof Z, that gives the most efficient estimate of biv can be found by choosing the constant cj so that Corr(YIc-Z.-, Xi) is a maximum. This is the usual two-stage,least squaresestimator(Johnston, 1972, p. 380). The ci are then the sample regression coefficients, bj, of X, on Z, j = 1,...,p. Letting Xi =-- b,Z, , we obtain biv = (2 XliX2i)/E XIXi,. Instrumental VariableMethods 229 S The Use of Dummy Variablesas Instrumental Variables The previousdiscussionhas assumedthe existenceof instrumentalvariables that can be modelled as having simple linear relationshipswith the first occasion variables.Two other cases can be distinguished.First, where an intervalscaled instrumentalvariablehas a nonlinearrelationshipto the first occasionvariable,and second,when the instrumentalvariableis categoric(e.g., measuredon an ordinalor nominalscale). In the first case the nonlinear relationshipcan be modelled, say by a polynomial function, or the instrumentalvariablecan be groupedinto categories.In the lattercase, eachcategorycan be representedin the usualway by a dummy variable.This takes the value 1 for this categoryand 0 for every other category. Let X,,,, XIrTbe two observations on the first occasion variable belonging to the same instrumentalvariablecategory,r. Using the dummy instrumentalvariablesto estimatethe first occasionvariablegives the estimate r,,whichis the meanvalueof all observationsin categoryr. Substitutingin (8) gives biv = r p, X2 2r X, r prSir)-2 where X2r is the mean of the X2r in the category r, and Pr is the proportion in category r. This is essentiallythe "Method of grouping"as introducedby Wald(1940). The literatureon groupingmethods(e.g., see Madansky,1959; Neyman & Scott, 1951;Wald, 1940)using observedvalues of X, has tended to focus on conditionsfor consistentestimatesand on the relativeefficiencyof different groupingsratherthan quantifyingthe inconsistencyof variousgroupingmethods. The resultson the NCDS data, given below, go some way to remedying this situationfor a particulardata set. VariableMethodsto the NCDS Data Applicationof Instrumental Selection of Variables In all, 50 variablesare consideredas instrumentalvariables,being measured at ages 7, 11 and 16. These consist of test scores, teacher ratings and backgroundvariables.The test scoresare of readingand mathematicsat each age, and in addition,of generalability and copyingdesigns scoresat age 11. The teacherratingsareof readingand mathematicsat all ages,and in addition, of oral abilityand creativityat age 7, of oral abilityand generalknowledgeat age 11, and of practicalsubjectsat age 16. The "background"variablesare socialclass and indicesof behaviorin the home at all threeages;the numberof childrenin the household,birthorderof the child,an index of accommodation facilities and childrens' heights at ages 11 and 16, overcrowding at 11; and 230 Ecob and Golstein region, indices of school behavior,and a varietyof feelings towardschool at 16. The readeris referredto Davie, Butlerand Goldstein(1972)and Fogelman (1976)for a morecompletedescriptionof thesevariables. The variablesused as dependentvariablesin the regressions,that is, the test scores at age 11 and 16 of mathematicsand reading,are transformedto have standardnormal distributionsin the same way as in Goldstein (1979) who showedthatnearlinearrelationshipsbetweenobservedscoresresulted. As the relativeefficiencyof an instrumentalvariableestimatoris proportional to the squareof the correlationwith the predictor,variablesare only retained for furtheranalyseswhen this correlationis greaterthan 0.3. This eliminatesall the "background"variablessave socialclass,but only one of the teacherratings(that of outstandingabilityin any area at age 11) and none of the test scores,leaving25 variablesin all. All cases with missingvalueson any of these 25 variablesare excluded,leaving5,371 cases with test scoresat each age. In the appendixto Fogelman(1976), Goldsteinshows that the attritionof subjectsin the study does not affect to any markedextent the relationships found and that test scores for subjectshaving missing values on the backgroundvariablesshowno significantdifferencesfromothersubjects. In the resultsreportedhere, all instrumentalvariablesare treatedas sets of dummyvariablesfor reasonsgiven earlierin that section. For the test scores the dummyvariablecodinginto five roughlyequalsize groupsensuresthat the relativeloss in predictiveefficiencyfromdummyvariablescomparedto simple linearregressionis alwaysless than6 percent. In fact, using these instrumentalvariablesas dummyvariablesor as interval scale variablesgives very similar results, with the maximum differencein regressioncoefficientsfor any instrumentalvariablebeing2 percent. Forming Hypothesesof Error Structurein PredictionRelations As in the section on ExternalEstimatesof Reliabilitywe assume that the true scoreof a test comprisesthat componentof test scorethatis unaffectedby day-to-dayvariationby the particulartestor or by the test situation,and is specific to the particulartest used. We now examine the correlationof the variablesto be consideredas instrumentalvariableswith measurementerrors on the first occasionand secondoccasiontests and with the disturbanceterm. These variablesare teacherratingson a varietyof attainments,test scoresand social class. The teacher ratings, like the test scores, generallywill contain measurementerror, thus reflectingand being reflectedby variationsin the child'sinterestin subjectsand day-to-dayvariationsin the type of relationship to the teacher.Thus, a teacherwho has very recentlyseen a child do a good piece of work or show a keen interest may tend to rate him higher than Instrumental VariableMethods 231 otherwise.If the same contributoryfactors affect test score, then a teacher ratingmade at the same time as the test would be expectedto have a positive correlationwith the test score measurementerror.Likewise,wherea different attainementis ratedat the sametime as the test, similarcorrelationsmay exist, althoughpresumablythey wouldbe smaller. Thereare two variablesthat we hypothesizewill have a zero correlationwith test score measurementerror.These are teachers'ratingstaken at a different point in time and social class. We would expect none of the sources of measurementerrorto relate to teacherratingat a point in time 4 or 5 years away. Nor would we expect social class, which shows little variationfor an individualover short time periods,to relateto any of the sourcesof measurement error. Finally,we examinethe relationof the disturbancetermsto teacherratings and social class. Either of these variables,particularlysocial class, may be correlatedwith the disturbancesif they relate to the dependentvariableonce the predictorvariablehas been controlledfor. The hypothesesformulatedabovemay be summarizedas follows: H1. Teacherratingson a test, wherethe ratingis at the same time as the test, will be positivelycorrelatedwith test scoremeasurementerror. H2. Teacher ratings on a different attainmentfrom that tested, where the ratingis at the same time as the test, will be positivelycorrelatedwith test scoreerrorbut to a lesserextentthan for the sameattainment. H3. Teacherratingswhen the child is at a differentage from that of the test will be uncorrelatedwith test score measurementerrorwhetheror not the sameattainmentis tested. H4. Teacher ratings are not correlatedwith disturbanceterms from the regressionof second-occasionscoreon first-occasionscore. H5. Socialclassis correlatedwith disturbanceterms. H6. Socialclass if not correlatedwith test score measurementerror. Generally,teacherratingsare held to correlatewith test score measurement errorsonly when tested at the same time as the test and not to be correlated with equationdisturbances.In contrast,socialclass is hypothesizedas correlating with disturbancesbut not with test score measurementerrorwhen measuredat the sametime as the test. Examiningequation (7) we see that these six hypothesesgive rise to the followingpredictions.The hypothesesgiving rise to each predictionare given in bracketsafterthe prediction. Pl. Comparingteacherratingsat 11 years,the lowest estimateof / will occur for the teacherratingof the same attainment(from HI to H4, particularly H2 affectingrze,), and at 16 yearsthe highestvalue will occurfor the rating of the sameattainment(fromH, to H4, particularlyH2 affectingrze2). 232 Ecob and Golstein P2. For a givenattainment,teacherratingswill give higherestimatesof 3 when measuredat 16 than 11 years(fromH, to H4). P3. For a givenattainment,teacherratingswill give higherestimatesof P when measuredat 7 ratherthan 11 years(fromH, to H4). P4. There is no differencein estimatesbetween teachersratings at 7 years (fromH3, H4). P5. Social class gives higherestimatesof P than teacherratingsat 7 and 11 years(fromH, to H6). P6. Socialclass will give similarestimatesof 3 irrespectiveof the age at which measured(fromH5, H6). Note that these predictionsare not unequivocaltests of the hypotheses.For instance,even if P6 holds,one could conceiveof differentcorrelationsof social class with measurementerrorat differentages, these termsbeing counteracted by different correlationswith equation disturbances.This, however, seems unlikely. Results Tables I and II give estimated regressioncoefficients for reading and mathematics,respectively,for 16-yearattainmenton 11-yearattainmentusing a varietyof teacherratingsand socialclass as instrumentalvariablesat ages 7, 11, and 16. Using the ungrouped11-yeartest score as instrumentalvariable gives the ordinaryleast squaresestimate,and assumingthe samplesize is large enough to make use of the asymptotic properties,this enables reliability estimatesto be calculatedfor each choice of instrumentalvariableby dividing the ordinaryleast squaresestimateby the instrumentalvariableestimate.Each predictionwill be examinedin turn. P . This holds for mathematicsusing teacherratingsboth at ages 11 and 16 and for readingfor teacherratingsat 11 but not 16. P2. This holds for comparableteacher ratings at ages 11 and 16 for both readingand mathematicstest scores. P3. This only holds for one out of the six possible comparisons,namely, teacherratingof "number"for readingattainmentregression. P4. Thisholds for both attainments. P5. Thisholds in all cases. P6. Thisholds at all ages for both attainments. Note that predictionsP4 and P6 specify no differencesbetweenregression coefficients,whereasthe other predictionsare of a differencein a specified direction.In fact, for P4 and P6 the differencesbetweencoefficientsare small in relationto the standarderrors,whichhave been estimatedin the usualway, and are strictlyapplicableonly if the estimatesareconsistent. The predictionsare all seen to hold generallywith the exceptionof P3. This implies the rejection of H3 or H4 or both. Rejecting H3 implies that the ~ Instrumental VariableMethods C13 233 " , " - C) C)N 00 C) 0 C) 0 V 0000 cl- r N-M -00 F . --. C ;•,C-),C ) • ,-- C,3, cd bONO =41 x, ON u 0 I-Sr CA 00 cd CA C, ZZ • r d d d r- 4~ k C1 bD ) 'Tt C= 9 OON \o CS 'd 6666 I-~ Zt C13 c -: S0 - •, 0t-o0 %f)itCI 0?I) Cs C~sC~ C~Cs - 000 O)C o Pc o• "00 4 C13 cj c~t `Cd u 13C1 C1 C13 Ee Q, tQt 1 C 234 Ecob and Golstein 0- -- 00 >cl ~3 ~ cl Cd _ r-0) _ _ _ _ 000 qzt r- 00 - vs \oO 00 00000 0ON ~ o _ ---a l c 000 "~" 0( 6 C6~ SC, 000000a- O r) r 0 cl \9 00< -U4 w \u C~tj c-?cd clc $-4 4t - u ?5l cl dU F-4 cl F-4 cl c CIA 00 r Instrumental VariableMethods 235 differencesin coefficientsamong the differentteacherratingsat age 7 should be similarto thoseusingthe threecorrespondingteacherratingsat age 11. This is true with the one exceptionbeing the reversalin the relativemagnitudeof the teacher ratings of reading and number between ages 7 and 11 for mathematics. The smallercoefficient estimatesusing 7-year ratherthan 11-yearratings could be explained by a correlationof 11-year rating with errors in the dependent variable, which counteracts the correlationwith errors in the independentvariable,the 7-year ratings having lower correlationswith the dependentvariable. If H4 is the sole reason for the failureof P3, this suggeststhat given the 11-yeartest score,the partialcorrelationof teacherratingswith socialclass at 11, whichif H5 holds is correlatedwith the disturbances,would be higherfor 11-yearteacherratingsthan for 7-yearteacherratings.This is not the case. It seems,then, that we shoulddiscardsocial class as a suitableinstrumental variable due to its correlationwith the equation disturbances,and teacher ratingsat 16 years are positivelycorrelatedwith test score errorat 16 (form HI1,H2) and probablyalso with equationdisturbances.This leaves a choice betweenteacherratingsat 7 and 11 years.As H3 does not hold, we cannotbe completelycontentwith usingthe same-attainment, 7-yearteacherratings,and in addition, it is not known how highly correlatedthese are with the disturbances.It was suggestedearlier,however,that we shouldexpectthe 11-year ratingsto be more highly correlatedwith the disturbances.As H2 also holds, the wisestchoicewould seem to be the ratingof a differentattainment(out of mathematicsor reading) at age 7. For the reading attainmentthis gives a reliabilityof 0.81 (using teacherrating of numberat age 7) and for mathematics attainmentgives a reliabilityof 0.89 using teacherratingof readingat age 7. In fact, the choicebetween7 and 11 yearsfor the instrumentalvariables makes little differencefor mathematicsattainmentgiving a reliabilityof 0.86 using readingrating at age 11, and for readingattainmentthe differencein reliabilityestimateis only 0.005. The instrumentalvariableestimatechosenhere is subjectto possibleinconsistenciesfrom two causes operatingin opposite directions.If H3 does not hold and the errorsin the instrumentalvariablehave positivecorrelationswith the test scoreerror,whicharehigherat age 11 that at age 16, then the expected value of the estimateis higherthan the true value. On the other hand, if H4 does not hold and the errors in the instrumentalvariable have a positive correlationwith the disturbanceterm,then the expectedvalueis lowerthanthe true value. If both H3 and H4 do not hold, then these biases operatedin opposite directionsand an evaluationof the meritsof the estimatedependon furtheranalysisof the relativestrengthsof the inconsistenciesfrom the two causes. Ecob and Golstein 236 The questionnaturallyariseshere as to whetherthe use of tests of the same attainmentat different ages is necessaryto obtain reasonableestimates of reliabilitycoefficientsby this method. Estimatesof the reliabilityof li-year readingtest scoreobtainedby regressing16-yearmathematicsscoreon 11-year readingscore using as instrumentalvariablesteachers'ratingsof readingand mathematicsattainmentsseparatelyat 11 years, gave reliabilityestimatesof 0.76 and 0.61, respectively,comparedwith the value of 0.81 give above. This suggeststhat the disturbancetermsin this regressionare correlatedwith the mathematicsteachers'ratings. For the regressionof 16-yearreading test on 11-yearmathematicstest using teachers'ratingsof mathematicsand reading separatelyas instrumentalvariables,reliabilityestimatesof 0.85, 0.68, respectively,are obtained,comparedwith the value of 0.88 given above.Careshould thereforebe taken when estimatingreliabilitiesby this method to use similar attainmentsas dependentand predictorvariables. Use of GroupedFirst-Occasion Variableas Instrumental Variable Table III gives estimated regression coefficients for both reading and mathematicswhen the first-occasionvariableis grouped into 2, 3, 5, or 7 groupsof equalsize. The inconsistencyof the groupingestimator(bG)relativeto the ungrouped (OLS)estimator(boLs)is givenby k b= b - bG (9) bk - bOLS TABLEIII Estimated Regression Coefficientsand Standard Errors Using the GroupedPredictoras Instrumental Variable Ungrouped (OLS) No. of groups (of equal size) 7 5 3 Reading 0.797 (0.0082) k Mathematics k 1.00 0.748 (0.0097) 1.00 0.808 (0.0085) 0.810 (0.0086) 0.94 0.93 0.755 (0.0100) 0.763 (0.0102) 0.93 0.86 0.777(0.0109) 0.73 0.84 0.780 (0.0121) 0.70 0.818(0.0092) 2 0.827 (0.0103) Varyingposition of dichotomywith 2 groups Proportionin lower test group 0.89 0.2 0.810(0.012) 0.93 0.687(0.014) 1.58 0.4 0.6 0.8 0.823 (0.010) 0.822 (0.010) 0.789 (0.012) 0.87 0.87 1.04 0.765 (0.012) 0.802 (0.012) 0.805 (0.014) 0.84 0.49 0.46 Note. Standarderrorsin brackets;k is definedin equation9. Instrumental Variable Methods 237 whereb1v is the instrumentalvariableestimator(using an appropriateteacher rating) and is assumedto be consistent.As suggestedin the Use of Dummy Variablessection,substantialinconsistenciesare indicatedin thesedata, which are greaterwith a largernumberof groups.Thus, thereis a trade-offbetween inconsistencyand efficiency,the latterbeing greateras the numberof groupsis increased.For reading attainmentthe lowest estimate of reliability,arising from the divisioninto two groups,is 0.97. This is higherthan any of the values derivedfrom the regressionestimatesby instrumentalvariablesmethodsgiven in Table I. Furthermore,the estimatedregressioncoefficientis seen to vary with the point of dichotomy.Whereasfor readingthe lowest estimateoccurs for both extremedivisions,for mathematicsthe lowest estimateoccurs when divisionis at the lowerend of the scale and the highestestimatewhen division is at the higherend. If the assumptionis made that the correlationsof the groupedfirst-occasion variablewith the disturbancesand errorin the second-occasionvariableare both zero, then using the reliabilityestimatesfrom the previoussectionwe can substitutein (7) to obtain the correlationswith the errorin the first occasion variable,rze,. These are given in Table IV, assumingthe reliabilityestimates givenin the Resultssection(0.81 and 0.89) arecorrect. Thus, the correlations,while reasonablyconstantfor reading,are systematically decreasingfor mathematics.We have no good explanationfor this but possible causes are nonhomogeneityof errorsin the mathematicstest or a nonzerocorrelationbetweentruescoreand errorsof measurement. The Use of GroupedTest Score as an Instrumental Variable Test scores of readingand mathematicsat 7, 11, and 16 years, a score of GeneralAbility at 11 years (with Verbaland Nonverbalcomponents)and a CopyingDesign Test are consideredas instrumentalvariables,and the results are givenin TableV. Sinceshort-termfluctuationsin attainmentwill be correlatedto some extent over all attainments,we would expect the argumentsand predictions(given previously)in relationto teacherratingsto apply to test scores.P1 is satisfied trivially in the light of the results on the use of a grouped predictor as TABLEIV Correlationsof DichotomizedInstrumental Variableand First-occasionMeasurementErrorfor Different Division Points belowdivisionpoint Proportion Reading Mathematics 0.2 0.4 0.5 0.6 0.8 0.280 0.390 0.302 0.226 0.329 0.233 0.305 0.128 0.324 0.120 Ecob and Golstein 238 TABLEV RegressionEstimates of 16-yearAttainment Test on 11-yearAttainment Test in Reading and Mathematics Using GroupedTest Scores as Instrumental Variables Instrumentalvariable At 7 years: ReadingTest MathematicsTest At I1 years: ReadingTest MathematicsTest GeneralAbilityTest:Verbal :Nonverbal :Overall CopyingDesignsTest At 16years: ReadingTest MathematicsTest Reading Mathematics 0.920 (0.015) 1.004 (0.020) 0.822 (0.017) 0.850 (0.018) 0.810 (0.009) 0.995 (0.012) 0.942 (0.012) 0.982 (0.014) 0.957 (0.012) 0.989 (0.032) 0.866 (0.014) 0.763 (0.010) 0.826 (0.013) 0.901 (0.014) 0.860 (0.013) 0.933 (0.032) 1.197 (0.013) 1.053 (0.015) 0.949 (0.015) 1.315 (0.017) Note. Standarderrorsin brackets. instrumental variable. P2 is satisfied, but P3 is again contradicted in one of four cases by the behavior of the reading tests at 7 and 11 when used as instrumental variable for mathematics attainment. Generally, the behavior of test scores is similar to the teacher ratings and the standard errors are similar, giving little indication for preference of one set of variables over the other. Discussion Our resultssuggestthat differingcorrelationsof instrumentalvariableswith measurementerrors account for the observeddifferencesin regressionand reliabilityestimates, although social class has a negligible correlationwith measurementerrorbut a nonnegligiblecorrelationwith the errorof prediction. The estimatedcorrelationcoefficient between true scores on reading and mathematicstests at 11 and 16 years is respectively-0.96and 0.90. The estimatedreliabilitiesusing the selectedinstrumentalvariables(teacherratings of an unrelatedabilityat age 7 as the predictor)are 0.81 and 0.89 for reading and mathematics, respectively. These compare with the values of 0.82 and 0.94 given in Goldstein (1979) by split-halfitem analysis on a subsampleof 300 cases. While the values for reading are similar, the value obtained by item analysis for mathematicsis somewhat higher than any obtained for the instrumentalvariablesused here, althoughthe differenceis of the same order as the standarderrorof the separateestimates. Instrumental VariableMethods 239 For reading attainment the estimated standard error of the reliability estimate obtained by item analysis is 0.030, while the instrumentalvariable methodgives 0.012. A split-halfestimateusing all availabledata wouldhave a standarderrorof about 0.007. For mathematicsthe relevantstandarderrors are 0.020,0.014 and 0.005. Using a groupingof the predictorvariableitself as an instrumentalvariable gives estimatesof the reliabilitythat are higherthan any obtainedusing other variablesas instrumentalvariables,irrespectiveof the numberof groupsused. Theseestimators,we suggest,are not to be recommended. Whileinstrumentalvariableestimationhas had a long history(earlypapers on theoryand applicationin the economicfield includeDurbin, 1954;Madansky, 1959; Reiersol, 1945; Sargan, 1958; Wald 1940),it has not yet become generally accepted as an estimation method in the social and educational fields. Sargan(1958), in discussingthe (unknown)correlationof instrumental variableswith measurementerrorin an economiccontext states: It is not easy to justify the basic assumptionconcerningthese errors, namelythat they are independentof the instrumentalvariables.It seems likely that they will varywith a trendand with the tradecycle. Insofaras this is true,the methoddiscussedhere will lead to inconsistentestimates of the coefficients.Nothing can be done about this since presumablyif anythingwere known about this type of error,better estimatesof the variablescould be produced.It must be hoped that the estimatesof the variablesare sufficientlyaccurate,so that systematicerrorsof this kind are small.(p. 396) We have arguedthat comparisonsof differentinstrumentalvariables,considered separately,can throw some light on the errorstructurein the data, and thus lead to better knowledgeof the consistencyof the estimatesproduced. Furthermore,it is also our view that this approachprovidesa flexibletool for an empirical study of the various assumptions needed to produce good estimates. Finally,fourissuesseemparticularlyworthyof attention: 1. Obtaining estimates of the standard errors of the difference between different instrumentalvariable estimates (these will be lower than those obtainedusing the individualstandarderrorsand assumingindependence). This wouldenablea morecarefulanalysisof the hypothesesof the paper. 2. Obtaininggood estimatesof the standarderrorsof the reliabilityestimates producedby instrumentalvariablesmethods. 3. Examinationof the use of more than one instrumentalvariablein connection with a single predictorin terms of the efficiency and consistencyof estimates. Ecob and Golstein 240 4. The study of differing reliabilities and measurement error variances in different groups such as social classes, to incorporate these into linear model estimates. Acknowledgements We aregratefulto the NationalChildren'sBureaufor permissionto use the data and to Dougal Hutchinsonfor his commentson an earlydraftof the paper.The workwas carriedout largelyon a grant from the National Instituteof Education,Washington (NIE-G-77-0065). References Anderson,E. B. Comparinglatentdistributions.Psychometrika, 1980,45, 121-134. L. structural relation. R. 84-90. Bivariate Brown, Biometrika,1957,44, Cronbach,L. J., Gleser,G. C., Nanda, H., & Rajartnan.Thedependability of behavioural measurements:Theoryof generalisabilityfor scores and profiles. New York: Wiley, 1972. Davie, R., Butler, N. R., & Goldstein, H. From birth to seven: A reportof the National Child DevelopmentStudy. London: Longmans, 1972. Durbin, J. Errors in variables. Review of Institute of International Statistics, 1954, 22, 23-54. Fogelman,K. Britain'ssixteenyear olds.London:NationalChildren'sBureau,1976. Goldstein,H. Somemodelsfor analysinglongitudinaldata on educationalattainment. Journal of the Royal Statistical Society, 1979, A, 142, 402-442. Goldstein,H. Dimensionality,bias, independenceand measurementscale problemsin latent trait test score models. British Journal of Mathematical and Statistical Psychology, 1980, 33, 234-246. Johnston, J. Econometricmethods.New York: McGraw-Hill, 1972. 1971,36, Jireskog,K. G. Statisticalanalysisof sets of congenerictests. Psychometrika, 109-133. Kendall,M. G., & Stuart,A. Theadvancedtheoryof statistics(Vol. 2). London:Griffin, 1977. Layard,M. W. J. Robustlargesampletests for homogeneityof variances.Journalof the American Statistical Association, 1973, 68, 195-198. Lord, F. M., & Novick, M. R. Statistical Theories of Mental Test Scores. Reading, Mass.: Addison-Wesley, 1968. McDonald, R. The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 1981, 34, 100-117. Madansky,A. The fitting of straightlines when both variablesare subject to error. Journal of the American Statistical Association, 1959, 54, 173-205. Nair, K. R., & Banerjee.A note of fittingstraightlines if both variablesare subjectto error. Sankya, 1942, 6, 331-333. Neyman, J., & Scott, E. L. On certain methods of estimatingthe linear structural relation. Ann. Math. Statist., 1951, 22, 352-355. Reiersol, O. Confluence analysis of means of instrumental sets of variables. Arkiv. for Mathematik, AstronomiOch Fysik, 1945, 32. Instrumental Variable Methods 241 Sargan,J. D. The estimationof economicrelationshipsusing instrumentalvariables. Econometrica, 1958, 26, 393-415. Wald, A. Fitting of straightlines if both variablesare subject to error.Ann. Math. Statist., 1940, 11, 284-300. Warren,R. D., White, J. K., & Fuller, W. A. An errors-in-variables analysis of managerial role performance. Journal of the American Statistical Association, 1974, 69, 886-893. Authors ECOB,RUSSELL,Researchstatisticianin the Researchand Statisticsdivisionof the Inner LondonEducationAuthority,England.Specializations:Longitudinalstudies, educationalmeasurement, effectsof schooling. GOLDSTEIN, HARVEY, Professor of statistical methods in the departmentof Mathematics,Statisticsand Computingat the Universityof London Institute of Education.Specializations:Longitudinalstudies,test score theory,educationalmeasurement.

instrumental variable methods for the estimation of test score reliability

Related documents

Products

Support

instrumental variable methods for the estimation of test score reliability

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib