Estimation of Item Response Models Mister Ibik Division of Psychology in Education Arizona State University EDP 691: Advanced Topics in Item Response Theory 1 Motivation and Objectives • Why estimate? – Distinguishing feature of IRT modeling as compared to classical techniques is the presence of parameters – These parameters characterize and guide inference regarding entities of interest (i.e., examinees, items) • We will think through: – – – – Different estimation situations Alternative estimation techniques The logic and mathematics underpinning these techniques Various strengths and weaknesses • What you will have – A detailed introduction to principles and mathematics – A resource to be revisited…and revisited…and revisited 2 Outline • Some Necessary Mathematical Background • Maximum Likelihood and Bayesian Theory • Estimation of Person Parameters When Item Parameters are Known – ML – MAP – EAP • Estimation of Item Parameters When Person Parameters are Known – ML • Simultaneous Estimation of Item and Person Parameters – JML – CML – MML • Other Approaches 3 Background: Finding the Root of an Equation • Newton-Raphson Algorithm f(x) – Finds the root of an equation – Example: the function f(x) = x2 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x – Has a root (where f(x) = 0) at x = 0 4 Newton-Raphson • Newton-Raphson takes a given point, x0, and systematically progresses to find the root of the equation – Utilizes the slope of the function to find where the root may be • The slope of the function is given by the derivative – Denoted – – – – f (x) or f(x) x Gives the slope of the straight line that is tangent to f(x) at x Tangent: best linear prediction of how the function is changing For x0, the best guess for the root is the point where f′(x) = 0 f(x) This occurs at - f(x) x f(x 0 ) – So the next candidate point for the root is: x1 x 0 - f(x 0 ) x 5 Newton-Raphson Updating (1) x1 1.5 - 2.25 3 0.75 • Suppose x0 = 1.5 f(x 0 ) • x1 x 0 - f(x 0 ) x f′(x0) = 3 f(x) f(x) f(x) x , 2x x 4.5 2 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -2.5 f(x0) = 2.25 x1 = 0.75 x0 = 1.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x 6 Newton-Raphson Updating (2) x 2 0.75- 0.56251.5 0.375 • Now x1 = 0.75 f(x 1 ) • x 2 x1 - f(x 1 ) x f(x) f(x) f(x) x , 2x x 4.5 2 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -2.5 f′(x1) = 1.5 f(x1) = 0.5625 x2 = 0.375 -2 -1.5 -1 -0.5 x1 = 0.75 0 0.5 1 1.5 2 2.5 x 7 Newton-Raphson Updating (3) x 3 0.375- 0.1406 0.75 0.1875 • Now x2 = 0.375 f(x 2 ) • x 3 x 2 - f(x 2 ) x f′(x2) = 0.75 f(x) f(x) x0.25, 2x x 0.2 2 f(x2) = 0.1406 f(x) 0.15 0.1 x3 = 0.1875 0.05 0 -0.05 -0.5 -0.4 -0.3 -0.2 -0.1 0 x 0.1 0.2 0.3 0.4 0.5 x2 = 0.375 8 Newton-Raphson Updating (4) x 4 0.1875- 0.0352 0.375 0.0938 • Now x3 = 0.1875 f(x 3 ) • x 4 x 3 - f(x 3 ) x f′(x3) = 0.375 f(x) f(x) x , 0.09 2x 0.08 x f(x) 2 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 -0.01 -0.25 f(x3) = 0.0352 x4 = 0.0938 -0.2 -0.15 -0.1 -0.05 0 x 0.05 0.1 0.15 0.2 0.25 x3 = 0.1875 9 Newton-Raphson Example f(x) x f(x) f(x) x f(x) x - f(x) x Iteration Value f(x) 0 1.5000 2.2500 3.0000 0.7500 0.7500 1 0.7500 0.5625 1.5000 0.3750 0.3750 2 0.3750 0.1406 0.7500 0.1875 0.1875 3 0.1875 0.0352 0.3750 0.0938 0.0938 4 0.0938 0.0088 0.1875 0.0469 0.0469 5 0.0469 0.0022 0.0938 0.0234 0.0234 6 0.0234 0.0005 0.0469 0.0117 0.0117 7 0.0117 0.0001 0.0234 0.0059 0.0059 8 0.0059 0.0000 0.0117 0.0029 0.0029 9 0.0029 0.0000 0.0059 0.0015 0.0015 10 0.0015 0.0000 0.0029 0.0007 0.0007 10 Newton-Raphson Summary • Iterative algorithm for finding the root of an equation • Takes a starting point and systematically progresses to find the root of the function f(x) • Requires the derivative of the function x • Each successive point is given by f(x) x - f(x) x • The process continues until we get arbitrarily close, as usually measured by the change in some function 11 Difficulties With Newton-Raphson f(x) • Some functions have multiple roots • Which root is found often depends on the start value 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x 12 Difficulties With Newton-Raphson • Numerical complications can arise • When the derivative is relatively small in magnitude, the algorithm shoots into outer space 0.6 0.4 f(x) 0.2 0 -0.2 -0.4 -0.6 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 x 13 Logic of Maximum Likelihood • A general approach to parameter estimation • The use of a model implies that the data may be sufficiently characterized by the features of the model, including the unknown parameters • Parameters govern the data in the sense that the data depend on the parameters – Given values of the parameters we can calculate the (conditional) probability of the data – P(Xij = 1 | θi, bj) = exp(θi – bj)/(1+ exp(θi – bj)) • Maximum likelihood (ML) estimation asks: “What are the values of the parameters that make the data most probable?” 14 Example: Series of Bernoulli Variables With Unknown Probability • Bernoulli variable: P(X = 1) = p • The probability of the data is given by pX × (1-p)(1-X) • Suppose we have two random variables X1 and X2 2 P X 1 , X 2 | p p j (1 p) X 1 X j j 1 • • • • When taken as a function of the parameters, it is called the likelihood Suppose X1 =1, X2 = 0 P(X1 =1, X2 = 0|p) = L(p|X1 =1, X2 = 0) = p × (1-p) Choose p to maximize the conditional probability of the data – For p = 0.1, L = 0.1 × (1-0.1) = 0.09 – For p = 0.2, L = 0.2 × (1-0.2) = 0.16 – For p = 0.3, L = 0.3 × (1-0.3) = 0.21 15 Example: Likelihood Function 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 16 The Likelihood Function in IRT • The Likelihood may be thought of as the conditional probability, where the data are known and the parameters vary PX | Θ, Ω LΘ, Ω | X • Let Pij = P(Xij = 1 | θi, ωj) LΘ, Ω | X PX ij xij | i , ω j N J i 1 j 1 N J i 1 j 1 ( Pij ) ij (1 Pij ) x 1 xij • The goal is to maximize this function – what values of the parameters yield the highest value? 17 Log-Likelihood Functions • It is numerically easier to maximize the natural logarithm of the likelihood N J lnLΘ, Ω | X xij ln(Pij ) (1 xij ) ln(1 Pij ) i 1 j 1 • The log-likelihood has the same maximum as the likelihood 18 Maximizing the Log-Likelihood • Note that at the maximum of the function, the slope of the tangent line equals 0 • The slope of the tangent is given by the first derivative • If we can find the point at which the first derivative equals 0, we will have also found the point at which the function is maximized 19 Overview of Numerical Techniques • One can maximize the ln[L] function by finding a point where its derivative is 0 • A variety of methods are available for maximizing L, or ln[L] – Newton-Raphson – Fisher Scoring – Estimation-Maximization (EM) • The generality of ML estimation and these numerical techniques results in the same concepts and estimation routines being employed across modeling situations – Logistic regression, log-linear modeling, FA, SEM, LCA 20 ML Estimation of Person Parameters When Item Parameters Are Known • Assume item parameters bj, aj, and cj, are known • Assume unidimensionality, local and respondent independence PX | Θ PX1 ,, X N | 1 ,, N Conditional probability now depends on person parameter only PX ij | i N J i 1 j 1 N J i 1 j 1 L1 ,, N | X ( Pij ) (1 Pij ) N xij 1 xij Likelihood function for the person parameters only J lnL1 ,, N | X xij ln(Pij ) (1 xij )(1 Pij ) i 1 j 1 21 ML Estimation of Person Parameters When Item Parameters Are Known • Choose each θi such that L or ln[L] is maximized • Let’s suppose we have one examinee J lnLi | Xi xij ( Pij ) (1 xij )(1 Pij ) j 1 • Maximize this function using any of several methods • We’ll use Newton-Raphson 22 Newton-Raphson Estimation Recap • Recall NR seeks to find the root of a function (where = 0) • NR updates follow the general structure Current value Updated value What is our function of interest? x next x - f(x) f(x) x What is the derivative of this function? Derivative of the function of interest Function of interest 23 Newton-Raphson Estimation of Person Parameters • Newton-Raphson uses the derivative of the function of interest • Our function is itself a derivative, the first derivative of ln[L] with respect to θi lnL i | x i i • We’ll need the second derivative as well as the first derivative 2 lnLi | xi i2 • Updates given by inext lnLi | xi 2 lnLi | xi i i i2 24 ML Estimation of Person Parameters When Item Parameters Are Known: The Log-Likelihood • The log-likelihood to be maximized • Select a start value and iterate towards a solution using Newton-Raphson • A “hill-climbing” sequence 25 ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Start at -1.0 lnLi | xi 3.211 i 2 lnLi | xi 2.920 2 i inext 3.211 1 2.920 .09 26 ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Move to 0.09 lnL i | xi 0.335 i 2 lnLi | xi 3.363 2 i inext 0.335 .09 3.363 0.0001 27 ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Move to -0.0001 lnL i | xi 0.0003 i 2 lnLi | xi 3.368 2 i • When the change in θi is arbitrarily small (e.g., less than 0.001), stop estimation • No meaningful change in next step • The key is that the tangent is 0 28 Newton-Raphson Estimation of Multiple Person Parameters • But we have N examinees each with a θi to be estimated N J i 1 j 1 L1 ,, N | X ( Pij ) (1 Pij ) N xij 1 xij J lnL1 ,, N | X xij ln(Pij ) (1 xij )(1 Pij ) i 1 j 1 • We need a multivariate version of the Newton-Raphson algorithm 29 First Order Derivatives lnL lnL 2 1 2 1 1 2 lnL 2 lnL 2 2 2 2 1 2 N lnL 2 lnL N 1 N 2 2 1 2 N next 2 lnL 1 N 2 lnL 2 N 2 lnL N2 2 1 lnL 1 lnL 2 lnL N Why??? • First order derivatives of the log-likelihood • ∂ln[L]/∂θi only involves terms corresponding to subject i 30 Second Order Derivatives lnL lnL 2 1 2 1 1 2 lnL 2 lnL 2 2 2 2 1 2 N lnL 2 lnL N 1 N 2 2 1 2 N next 2 lnL 1 N 2 lnL 2 N 2 lnL N2 2 1 lnL 1 lnL 2 lnL N • Hessian: second order partial derivatives of the log-likelihood • This matrix needs to be inverted Why?? ? • In the current context, this matrix is diagonal 31 Second Order Derivatives lnL lnL 2 1 2 1 1 2 lnL 2 lnL 2 2 2 2 1 2 N lnL 2 lnL N 1 N 2 2 1 2 N next 2 lnL 1 N 2 lnL 2 N 2 lnL N2 2 1 lnL 1 lnL 2 lnL N • The inverse of the Hessian is diagonal with elements that are the reciprocals of the diagonal of the Hessian • Updates for each θi do not depend on any other subject’s θ 32 Second Order Derivatives lnL lnL 2 1 2 1 1 2 lnL 2 lnL 2 2 2 2 1 2 N lnL 2 lnL N 1 N 2 2 1 2 N next 2 lnL 1 N 2 lnL 2 N 2 lnL N2 2 1 lnL 1 lnL 2 lnL N • The updates for each θi are independent of one another • The procedure can be performed one examinee at a time 33 ML Estimation of Person Parameters When Item Parameters Are Known: Standard Errors • The approximate, asymptotic standard error of the ML estimate of θi is SE (ˆ ) i 1 I ( i ) 1 I (ˆi ) 2 ln[L] • where I(θi) is the information function: I ( i ) E 2 i • Standard errors are – asymptotic with respect to the number of items – approximate because only an estimate of θi is employed – asymptotically approximately unbiased 34 ML Estimation of Person Parameters When Item Parameters Are Known: Strengths • ML estimates have some desirable qualities – They are consistent – If a sufficient statistic exists, then the MLE is a function of that statistic (Rasch models) – Asymptotically normally distributed – Asymptotically most efficient (least variable) estimator among the class of normally distributed unbiased estimators • Asymptotically with respect to what? 35 ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • ML estimates have some undesirable qualities – Estimates may fly off into outer space – They do not exist for so called “perfect scores” (all 1’s or 0’s) – Can be difficult to compute or verify when the likelihood function is not single peaked (may occur with 3-PLM or more complex IRT models) 36 ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • Strategies to handle wayward solutions – Bound the amount of change at any one iteration • Atheoretical • No longer common – Use an alternative estimation framework (Fisher, Bayesian) • Strategies to handle perfect scores – Do not estimate θi – Use an alternative estimation framework (Bayesian) • Strategies to handle local maxima – Re-estimate the parameters using different starting points and look for agreement 37 ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • An alternative to the Newton-Raphson technique is Fisher’s method of scoring – Instead of the Hessian, it uses the information matrix (based on the Hessian) – This usually leads to quicker convergence – Often is more stable than Newton-Raphson • But what about those perfect scores? 38 Bayes’ Theorem • We can avoid some of the problems that occur in ML estimation by employing a Bayesian approach • All entities treated as random variables • Bayes’ Theorem for random variables A and B Posterior distribution of A, given B: “The probability of A, given B.” Conditional probability of B, given A Prior probability of A PB | AP A P A | B P B Marginal probability of B 39 Bayes’ Theorem • If A is discrete PB | AP A PB | AP A P A | B PB | AP A P B PB | AP A A • If A is continuous PB | AP A PB | AP A P A | B PB | AP A P B PB | AP AdA A • Note that P(B|A) = L(A|B) 40 Bayesian Estimation of Person Parameters: The Posterior • Select a prior distribution for θi denoted P(θi) • Recall the likelihood function takes on the form P(Xi | θi) • The posterior density of θi given Xi is PXi | i P i Pi | Xi PXi PXi | i P i PX i | i P i d i • Since P(Xi) is a constant Pi | Xi PXi | i Pi 41 Bayesian Estimation of Person Parameters: The Posterior Pi | Xi PXi | i Pi The Likelilhood The Prior The Posterior 42 Maximum A Posteriori Estimation of Person Parameters Pi | Xi PXi | i Pi • The Maximum A Posteriori (MAP) estimate ~i is the maximum of the posterior density of θi • Computed by maximizing the posterior density, or its log • Find θi such that lnPi | Xi 0 i • Use Newton-Raphson or Fisher scoring • Max of ln[P(θi| Xi)] occurs at max of ln[P(Xi | θi)] + ln[P(θi)] • This can be thought of as augmenting the likelihood with prior information 43 Choice of Prior Distribution • Choosing P(θi) ~ U(-∞, ∞) yields the posterior to be proportional to the likelihood Pi | Xi PXi | i Pi • In this case, the MAP is very similar to the ML estimate • The prior distribution P(θi) is often assumed to be N(0, 1) – The normal distribution commonly justified by appeal to CLT – Choice of mean and variance identifies the scale of the latent continuum 44 MAP Estimation of Person Parameters: Features • The approximate, asymptotic standard error of the MAP is ~ SE ( i ) 1 I ( i ) 1 ~ I ( i ) where I(θi) is the information from the posterior density • Advantages of the MAP estimator – Exists for every response pattern – why? – Generally leads to a reduced tendency for local extrema • Disadvantages of the MAP estimator – Must specify a prior – Exhibits shrinkage in that it is biased towards the mean: May need lots of items to “swamp” the prior if it’s misspecified – Calculations are iterative and may take a long time – May result in local extrema 45 Expected A Posteriori (EAP) Estimation of Person Parameters • The Expected A Posteriori (EAP) estimator is the mean of the posterior distribution i i Pi | Xi di • Exact computations are often intractable • We approximate the integral using numerical techniques • Essentially, we take a weighted average of the values, where the weights are determined by the posterior distribution – Recall that the posterior distribution is itself determined by the prior and the likelihood 46 Numerical Integration Via Quadrature ∑ ≈ .165 .002 ⁄ .165 = .015 .021 ⁄ .165 = .127 • The Posterior Distribution • With quadrature points • Evaluate the heights of the distribution at each point • Use the relative heights as the weights 47 EAP Estimation of via Quadrature • The Expected A Posteriori (EAP) is estimated by a weighted average: i i Pi | Xi di Qr H (Qr ) r where H(Qr) is weight of point Qr in the posterior (compare Embretson & Reise, 2000; p. 177) • The standard error is the standard deviation in the posterior and may also be approximated via quadrature i 2 ( ) i i Pi | Xi di 2 ( Q ) r i H (Qr ) 48 EAP Estimation of via Quadrature • Advantages – – – – Exists for all possible response patterns Non-iterative solution strategy Not a maximum, therefore no local extrema Has smallest MSE in the population • Disadvantages – Must specify a prior – Exhibits shrinkage to the prior mean: If the prior is misspecified, may need lots of items to “swamp” the prior 49 ML Estimation of Item Parameters When Person Parameters Are Known: Assumptions • Assume – person parameters θi are known – respondent and local independence N J x L b1 , a1 , c1 ,, bJ , aJ , cJ | X ( Pij ) ij i 1 N 1 xij (1 Pij ) j 1 J lnLb1 , a1 , c1 ,, bJ , aJ , cJ | X xij ( Pij ) (1 xij )(1 Pij ) i 1 j 1 • Choose values for item parameters that maximize ln[L] 50 Newton-Raphson Estimation b1 a 1 c1 bJ a J c J next 2 lnL 2 b 2 1 lnL b 1 a a1 b1 1 2 lnL c1 c b 1 1 2 bJ lnL b b a J 2 J 1 c lnL J a J b1 2 lnL cJ b1 2 lnL b1 a1 2 lnL a12 2 lnL c1 a1 2 lnL bJ a1 2 lnL a J a1 2 lnL cJ a1 2 lnL b1 c1 2 lnL a1 c1 2 lnL c12 2 lnL bJ c1 2 lnL a J c1 2 lnL cJ c1 2 lnL b1 bJ 2 lnL a1 bJ 2 lnL c1 bJ 2 lnL bJ2 2 lnL a J bJ 2 lnL cJ bJ 2 lnL b1 a J 2 lnL a1 a J 2 lnL c1 a J 2 lnL bJ a J 2 lnL a J2 2 lnL cJ a J 2 lnL b1 cJ 2 lnL ba cJ 2 lnL ca cJ 2 lnL bJ cJ 2 lnL a J cJ 2 lnL cJ2 1 lnL b 1 lnL a1 lnL c1 lnL b J lnL a J lnL cJ • What is the structure of this matrix? 51 ML Estimation of Item Parameters When Person Parameters Are Known • Just as we could estimate subjects one at a time thanks to respondent independence, we can estimate items one at time thanks to local independence • Multivariate Newton-Raphson: lnL b 2j b j 2 lnL a j a b c j 2 j j lnL c j b j 2 b j a j c j next 1 lnL lnL lnL b j a j b j c j b j 2 lnL 2 lnL lnL a 2j a j c j a j 2 lnL 2 lnL lnL c j a j c 2j c j 2 2 52 ML Estimation of Item Parameters When Person Parameters Are Known: Standard Errors • To obtain the approximate, asymptotic standard errors – Invert the associated information matrix, which yields the variance-covariance matrix – Take the square root of the elements of the diagonal Diag [ I (b, a, c)] 1 • Asymptotic w.r.t. sample size and approximate because we only have estimates of the parameters • This is conceptually similar to those for the estimation of θ SE(ˆi ) I (ˆi ) 1 • But why do we need a matrix approach? 53 ML Estimation of Item Parameters When Person Parameters Are Known: Standard Errors • ML estimates of item parameters have same properties as those for person parameters: consistent, efficient, asymptotic (w.r.t. subjects) • aj parameters can be difficult to estimate, tend to get inflated with small sample sizes • cj parameters are often difficult to estimate well – Usually because there’s not a lot of information in the data about the asymptote – Especially true when items are easy • Generally need larger and more heterogeneous samples to estimate 2-PL and 3-PL • Can employ Bayesian estimation (more on this later) 54