estimation_1_class

Estimation of Item Response Models Mister Ibik Division of Psychology in Education Arizona State University EDP 691: Advanced Topics in Item Response Theory 1 Motivation and Objectives • Why estimate? – Distinguishing feature of IRT modeling as compared to classical techniques is the presence of parameters – These parameters characterize and guide inference regarding entities of interest (i.e., examinees, items) • We will think through: – – – – Different estimation situations Alternative estimation techniques The logic and mathematics underpinning these techniques Various strengths and weaknesses • What you will have – A detailed introduction to principles and mathematics – A resource to be revisited…and revisited…and revisited 2 Outline • Some Necessary Mathematical Background • Maximum Likelihood and Bayesian Theory • Estimation of Person Parameters When Item Parameters are Known – ML – MAP – EAP • Estimation of Item Parameters When Person Parameters are Known – ML • Simultaneous Estimation of Item and Person Parameters – JML – CML – MML • Other Approaches 3 Background: Finding the Root of an Equation • Newton-Raphson Algorithm f(x) – Finds the root of an equation – Example: the function f(x) = x2 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x – Has a root (where f(x) = 0) at x = 0 4 Newton-Raphson • Newton-Raphson takes a given point, x0, and systematically progresses to find the root of the equation – Utilizes the slope of the function to find where the root may be • The slope of the function is given by the derivative – Denoted – – – – f (x) or f(x) x Gives the slope of the straight line that is tangent to f(x) at x Tangent: best linear prediction of how the function is changing For x0, the best guess for the root is the point where f′(x) = 0 f(x) This occurs at - f(x) x f(x 0 ) – So the next candidate point for the root is: x1  x 0 - f(x 0 ) x 5 Newton-Raphson Updating (1) x1  1.5 - 2.25 3  0.75 • Suppose x0 = 1.5 f(x 0 ) • x1  x 0 - f(x 0 ) x f′(x0) = 3 f(x) f(x) f(x)  x ,  2x x 4.5 2 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -2.5 f(x0) = 2.25 x1 = 0.75 x0 = 1.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x 6 Newton-Raphson Updating (2) x 2  0.75- 0.56251.5  0.375 • Now x1 = 0.75 f(x 1 ) • x 2  x1 - f(x 1 ) x f(x) f(x) f(x)  x ,  2x x 4.5 2 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 -2.5 f′(x1) = 1.5 f(x1) = 0.5625 x2 = 0.375 -2 -1.5 -1 -0.5 x1 = 0.75 0 0.5 1 1.5 2 2.5 x 7 Newton-Raphson Updating (3) x 3  0.375- 0.1406 0.75  0.1875 • Now x2 = 0.375 f(x 2 ) • x 3  x 2 - f(x 2 ) x f′(x2) = 0.75 f(x) f(x)  x0.25,  2x x 0.2 2 f(x2) = 0.1406 f(x) 0.15 0.1 x3 = 0.1875 0.05 0 -0.05 -0.5 -0.4 -0.3 -0.2 -0.1 0 x 0.1 0.2 0.3 0.4 0.5 x2 = 0.375 8 Newton-Raphson Updating (4) x 4  0.1875- 0.0352 0.375  0.0938 • Now x3 = 0.1875 f(x 3 ) • x 4  x 3 - f(x 3 ) x f′(x3) = 0.375 f(x) f(x)  x , 0.09  2x 0.08 x f(x) 2 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 -0.01 -0.25 f(x3) = 0.0352 x4 = 0.0938 -0.2 -0.15 -0.1 -0.05 0 x 0.05 0.1 0.15 0.2 0.25 x3 = 0.1875 9 Newton-Raphson Example f(x) x f(x) f(x) x f(x) x - f(x) x Iteration Value f(x) 0 1.5000 2.2500 3.0000 0.7500 0.7500 1 0.7500 0.5625 1.5000 0.3750 0.3750 2 0.3750 0.1406 0.7500 0.1875 0.1875 3 0.1875 0.0352 0.3750 0.0938 0.0938 4 0.0938 0.0088 0.1875 0.0469 0.0469 5 0.0469 0.0022 0.0938 0.0234 0.0234 6 0.0234 0.0005 0.0469 0.0117 0.0117 7 0.0117 0.0001 0.0234 0.0059 0.0059 8 0.0059 0.0000 0.0117 0.0029 0.0029 9 0.0029 0.0000 0.0059 0.0015 0.0015 10 0.0015 0.0000 0.0029 0.0007 0.0007 10 Newton-Raphson Summary • Iterative algorithm for finding the root of an equation • Takes a starting point and systematically progresses to find the root of the function f(x) • Requires the derivative of the function x • Each successive point is given by f(x) x - f(x) x • The process continues until we get arbitrarily close, as usually measured by the change in some function 11 Difficulties With Newton-Raphson f(x) • Some functions have multiple roots • Which root is found often depends on the start value 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 x 12 Difficulties With Newton-Raphson • Numerical complications can arise • When the derivative is relatively small in magnitude, the algorithm shoots into outer space 0.6 0.4 f(x) 0.2 0 -0.2 -0.4 -0.6 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 x 13 Logic of Maximum Likelihood • A general approach to parameter estimation • The use of a model implies that the data may be sufficiently characterized by the features of the model, including the unknown parameters • Parameters govern the data in the sense that the data depend on the parameters – Given values of the parameters we can calculate the (conditional) probability of the data – P(Xij = 1 | θi, bj) = exp(θi – bj)/(1+ exp(θi – bj)) • Maximum likelihood (ML) estimation asks: “What are the values of the parameters that make the data most probable?” 14 Example: Series of Bernoulli Variables With Unknown Probability • Bernoulli variable: P(X = 1) = p • The probability of the data is given by pX × (1-p)(1-X) • Suppose we have two random variables X1 and X2 2 P X 1 , X 2 | p    p j (1  p) X 1 X j j 1 • • • • When taken as a function of the parameters, it is called the likelihood Suppose X1 =1, X2 = 0 P(X1 =1, X2 = 0|p) = L(p|X1 =1, X2 = 0) = p × (1-p) Choose p to maximize the conditional probability of the data – For p = 0.1, L = 0.1 × (1-0.1) = 0.09 – For p = 0.2, L = 0.2 × (1-0.2) = 0.16 – For p = 0.3, L = 0.3 × (1-0.3) = 0.21 15 Example: Likelihood Function 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 16 The Likelihood Function in IRT • The Likelihood may be thought of as the conditional probability, where the data are known and the parameters vary PX | Θ, Ω  LΘ, Ω | X • Let Pij = P(Xij = 1 | θi, ωj) LΘ, Ω | X    PX ij  xij |  i , ω j  N J i 1 j 1 N J i 1 j 1   ( Pij ) ij (1  Pij ) x 1 xij • The goal is to maximize this function – what values of the parameters yield the highest value? 17 Log-Likelihood Functions • It is numerically easier to maximize the natural logarithm of the likelihood N J lnLΘ, Ω | X   xij ln(Pij )  (1  xij ) ln(1  Pij ) i 1 j 1 • The log-likelihood has the same maximum as the likelihood 18 Maximizing the Log-Likelihood • Note that at the maximum of the function, the slope of the tangent line equals 0 • The slope of the tangent is given by the first derivative • If we can find the point at which the first derivative equals 0, we will have also found the point at which the function is maximized 19 Overview of Numerical Techniques • One can maximize the ln[L] function by finding a point where its derivative is 0 • A variety of methods are available for maximizing L, or ln[L] – Newton-Raphson – Fisher Scoring – Estimation-Maximization (EM) • The generality of ML estimation and these numerical techniques results in the same concepts and estimation routines being employed across modeling situations – Logistic regression, log-linear modeling, FA, SEM, LCA 20 ML Estimation of Person Parameters When Item Parameters Are Known • Assume item parameters bj, aj, and cj, are known • Assume unidimensionality, local and respondent independence PX | Θ  PX1 ,, X N | 1 ,, N  Conditional probability now depends on person parameter only   PX ij | i  N J i 1 j 1 N J i 1 j 1 L1 ,, N | X   ( Pij ) (1  Pij ) N xij 1 xij Likelihood function for the person parameters only J lnL1 ,, N | X   xij ln(Pij )  (1  xij )(1  Pij ) i 1 j 1 21 ML Estimation of Person Parameters When Item Parameters Are Known • Choose each θi such that L or ln[L] is maximized • Let’s suppose we have one examinee J lnLi | Xi    xij ( Pij )  (1  xij )(1  Pij ) j 1 • Maximize this function using any of several methods • We’ll use Newton-Raphson 22 Newton-Raphson Estimation Recap • Recall NR seeks to find the root of a function (where = 0) • NR updates follow the general structure Current value Updated value What is our function of interest? x next  x - f(x) f(x) x What is the derivative of this function? Derivative of the function of interest Function of interest 23 Newton-Raphson Estimation of Person Parameters • Newton-Raphson uses the derivative of the function of interest • Our function is itself a derivative, the first derivative of ln[L] with respect to θi  lnL i | x i   i • We’ll need the second derivative as well as the first derivative  2 lnLi | xi  i2 • Updates given by inext  lnLi | xi   2 lnLi | xi   i i i2 24 ML Estimation of Person Parameters When Item Parameters Are Known: The Log-Likelihood • The log-likelihood to be maximized • Select a start value and iterate towards a solution using Newton-Raphson • A “hill-climbing” sequence 25 ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Start at -1.0  lnLi | xi   3.211 i  2 lnLi | xi   2.920 2 i  inext 3.211  1   2.920  .09 26 ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Move to 0.09  lnL i | xi   0.335  i  2 lnLi | xi   3.363 2 i  inext  0.335  .09   3.363  0.0001 27 ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Move to -0.0001  lnL i | xi   0.0003  i  2 lnLi | xi   3.368 2 i • When the change in θi is arbitrarily small (e.g., less than 0.001), stop estimation • No meaningful change in next step • The key is that the tangent is 0 28 Newton-Raphson Estimation of Multiple Person Parameters • But we have N examinees each with a θi to be estimated N J i 1 j 1 L1 ,, N | X   ( Pij ) (1  Pij ) N xij 1 xij J lnL1 ,, N | X   xij ln(Pij )  (1  xij )(1  Pij ) i 1 j 1 • We need a multivariate version of the Newton-Raphson algorithm 29 First Order Derivatives   lnL  lnL  2   1  2  1   1     2 lnL  2 lnL 2   2        2   2 1     2   N    lnL  2 lnL   N 1  N  2 2  1     2     N  next 2  lnL   1  N   2 lnL   2  N      2 lnL   N2  2     1   lnL     1    lnL   2       lnL     N  Why??? • First order derivatives of the log-likelihood • ∂ln[L]/∂θi only involves terms corresponding to subject i 30 Second Order Derivatives   lnL  lnL  2   1  2  1   1     2 lnL  2 lnL 2   2        2   2 1     2   N    lnL  2 lnL   N 1  N  2 2  1     2     N  next 2  lnL   1  N   2 lnL   2  N      2 lnL   N2  2     1   lnL     1    lnL   2       lnL     N  • Hessian: second order partial derivatives of the log-likelihood • This matrix needs to be inverted Why?? ? • In the current context, this matrix is diagonal 31 Second Order Derivatives   lnL  lnL  2   1  2  1   1     2 lnL  2 lnL 2   2        2   2 1     2   N    lnL  2 lnL   N 1  N  2 2  1     2     N  next 2  lnL   1  N   2 lnL   2  N      2 lnL   N2  2     1   lnL     1    lnL   2       lnL     N  • The inverse of the Hessian is diagonal with elements that are the reciprocals of the diagonal of the Hessian • Updates for each θi do not depend on any other subject’s θ 32 Second Order Derivatives   lnL  lnL  2   1  2  1   1     2 lnL  2 lnL 2   2        2   2 1     2   N    lnL  2 lnL   N 1  N  2 2  1     2     N  next 2  lnL   1  N   2 lnL   2  N      2 lnL   N2  2     1   lnL     1    lnL   2       lnL     N  • The updates for each θi are independent of one another • The procedure can be performed one examinee at a time 33 ML Estimation of Person Parameters When Item Parameters Are Known: Standard Errors • The approximate, asymptotic standard error of the ML estimate of θi is SE (ˆ )  i 1  I ( i ) 1 I (î )   2 ln[L]   • where I(θi) is the information function: I ( i )   E  2   i  • Standard errors are – asymptotic with respect to the number of items – approximate because only an estimate of θi is employed – asymptotically approximately unbiased 34 ML Estimation of Person Parameters When Item Parameters Are Known: Strengths • ML estimates have some desirable qualities – They are consistent – If a sufficient statistic exists, then the MLE is a function of that statistic (Rasch models) – Asymptotically normally distributed – Asymptotically most efficient (least variable) estimator among the class of normally distributed unbiased estimators • Asymptotically with respect to what? 35 ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • ML estimates have some undesirable qualities – Estimates may fly off into outer space – They do not exist for so called “perfect scores” (all 1’s or 0’s) – Can be difficult to compute or verify when the likelihood function is not single peaked (may occur with 3-PLM or more complex IRT models) 36 ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • Strategies to handle wayward solutions – Bound the amount of change at any one iteration • Atheoretical • No longer common – Use an alternative estimation framework (Fisher, Bayesian) • Strategies to handle perfect scores – Do not estimate θi – Use an alternative estimation framework (Bayesian) • Strategies to handle local maxima – Re-estimate the parameters using different starting points and look for agreement 37 ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • An alternative to the Newton-Raphson technique is Fisher’s method of scoring – Instead of the Hessian, it uses the information matrix (based on the Hessian) – This usually leads to quicker convergence – Often is more stable than Newton-Raphson • But what about those perfect scores? 38 Bayes’ Theorem • We can avoid some of the problems that occur in ML estimation by employing a Bayesian approach • All entities treated as random variables • Bayes’ Theorem for random variables A and B Posterior distribution of A, given B: “The probability of A, given B.” Conditional probability of B, given A Prior probability of A PB | AP A P A | B   P B  Marginal probability of B 39 Bayes’ Theorem • If A is discrete PB | AP A PB | AP A P A | B     PB | AP A P B   PB | AP A A • If A is continuous PB | AP A PB | AP A P A | B     PB | AP A P B   PB | AP AdA A • Note that P(B|A) = L(A|B) 40 Bayesian Estimation of Person Parameters: The Posterior • Select a prior distribution for θi denoted P(θi) • Recall the likelihood function takes on the form P(Xi | θi) • The posterior density of θi given Xi is PXi | i P i  Pi | Xi    PXi   PXi |  i P i   PX i | i P i d i  • Since P(Xi) is a constant Pi | Xi   PXi | i Pi  41 Bayesian Estimation of Person Parameters: The Posterior Pi | Xi   PXi | i Pi  The Likelilhood The Prior The Posterior 42 Maximum A Posteriori Estimation of Person Parameters Pi | Xi   PXi | i Pi  • The Maximum A Posteriori (MAP) estimate ~i is the maximum of the posterior density of θi • Computed by maximizing the posterior density, or its log • Find θi such that  lnPi | Xi  0 i • Use Newton-Raphson or Fisher scoring • Max of ln[P(θi| Xi)] occurs at max of ln[P(Xi | θi)] + ln[P(θi)] • This can be thought of as augmenting the likelihood with prior information 43 Choice of Prior Distribution • Choosing P(θi) ~ U(-∞, ∞) yields the posterior to be proportional to the likelihood Pi | Xi   PXi | i Pi  • In this case, the MAP is very similar to the ML estimate • The prior distribution P(θi) is often assumed to be N(0, 1) – The normal distribution commonly justified by appeal to CLT – Choice of mean and variance identifies the scale of the latent continuum 44 MAP Estimation of Person Parameters: Features • The approximate, asymptotic standard error of the MAP is ~ SE ( i )  1  I ( i ) 1 ~ I ( i ) where I(θi) is the information from the posterior density • Advantages of the MAP estimator – Exists for every response pattern – why? – Generally leads to a reduced tendency for local extrema • Disadvantages of the MAP estimator – Must specify a prior – Exhibits shrinkage in that it is biased towards the mean: May need lots of items to “swamp” the prior if it’s misspecified – Calculations are iterative and may take a long time – May result in local extrema 45 Expected A Posteriori (EAP) Estimation of Person Parameters • The Expected A Posteriori (EAP) estimator is the mean of the posterior distribution  i   i Pi | Xi di  • Exact computations are often intractable • We approximate the integral using numerical techniques • Essentially, we take a weighted average of the values, where the weights are determined by the posterior distribution – Recall that the posterior distribution is itself determined by the prior and the likelihood 46 Numerical Integration Via Quadrature ∑ ≈ .165 .002 ⁄ .165 = .015 .021 ⁄ .165 = .127 • The Posterior Distribution • With quadrature points • Evaluate the heights of the distribution at each point • Use the relative heights as the weights 47 EAP Estimation of via Quadrature • The Expected A Posteriori (EAP) is estimated by a weighted average:  i   i Pi | Xi di   Qr  H (Qr )  r where H(Qr) is weight of point Qr in the posterior (compare Embretson & Reise, 2000; p. 177) • The standard error is the standard deviation in the posterior and may also be approximated via quadrature   i 2 (    )  i i Pi | Xi di  2 ( Q   )  r i H (Qr ) 48 EAP Estimation of via Quadrature • Advantages – – – – Exists for all possible response patterns Non-iterative solution strategy Not a maximum, therefore no local extrema Has smallest MSE in the population • Disadvantages – Must specify a prior – Exhibits shrinkage to the prior mean: If the prior is misspecified, may need lots of items to “swamp” the prior 49 ML Estimation of Item Parameters When Person Parameters Are Known: Assumptions • Assume – person parameters θi are known – respondent and local independence N J x   L b1 , a1 , c1 ,, bJ , aJ , cJ | X   ( Pij ) ij i 1 N 1 xij (1  Pij ) j 1 J lnLb1 , a1 , c1 ,, bJ , aJ , cJ | X   xij ( Pij )  (1  xij )(1  Pij ) i 1 j 1 • Choose values for item parameters that maximize ln[L] 50 Newton-Raphson Estimation  b1  a   1  c1     bJ    a J  c   J  next   2 lnL  2  b  2 1   lnL b  1   a   a1 b1  1    2 lnL  c1   c b    1 1      2 bJ    lnL    b b a J   2 J 1  c    lnL  J  a J b1   2 lnL   cJ b1  2 lnL b1 a1  2 lnL a12  2 lnL c1 a1  2  lnL bJ a1  2 lnL a J a1  2 lnL cJ a1  2 lnL b1 c1  2 lnL a1 c1  2 lnL c12  2  lnL bJ c1  2 lnL a J c1  2 lnL cJ c1         2 lnL b1 bJ  2 lnL a1 bJ  2 lnL c1 bJ  2  lnL bJ2  2 lnL a J bJ  2 lnL cJ bJ  2 lnL b1 a J  2 lnL a1 a J  2 lnL c1 a J  2  lnL bJ a J  2 lnL a J2  2 lnL cJ a J  2 lnL  b1 cJ   2 lnL ba cJ   2  lnL ca cJ      2 lnL bJ cJ   2 lnL  a J cJ   2 lnL  cJ2  1   lnL  b   1    lnL  a1    lnL    c1       lnL  b   J    lnL  a J    lnL    cJ  • What is the structure of this matrix? 51 ML Estimation of Item Parameters When Person Parameters Are Known • Just as we could estimate subjects one at a time thanks to respondent independence, we can estimate items one at time thanks to local independence • Multivariate Newton-Raphson:   lnL  b 2j  b j      2 lnL  a j    a b c j   2 j j     lnL   c j b j 2 b j    a j  c j    next 1  lnL  lnL   lnL    b j a j b j c j   b j   2 lnL  2 lnL   lnL  a 2j a j c j   a j   2 lnL  2 lnL   lnL  c j a j c 2j   c j  2 2 52 ML Estimation of Item Parameters When Person Parameters Are Known: Standard Errors • To obtain the approximate, asymptotic standard errors – Invert the associated information matrix, which yields the variance-covariance matrix – Take the square root of the elements of the diagonal  Diag [ I (b, a, c)] 1  • Asymptotic w.r.t. sample size and approximate because we only have estimates of the parameters • This is conceptually similar to those for the estimation of θ SE(î )  I (î ) 1 • But why do we need a matrix approach? 53 ML Estimation of Item Parameters When Person Parameters Are Known: Standard Errors • ML estimates of item parameters have same properties as those for person parameters: consistent, efficient, asymptotic (w.r.t. subjects) • aj parameters can be difficult to estimate, tend to get inflated with small sample sizes • cj parameters are often difficult to estimate well – Usually because there’s not a lot of information in the data about the asymptote – Especially true when items are easy • Generally need larger and more heterogeneous samples to estimate 2-PL and 3-PL • Can employ Bayesian estimation (more on this later) 54

estimation_1_class

Related documents

Products

Support

estimation_1_class

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib