Example: modeling reaction times of schizophrenics and nonschizophrenics Bayesian Data Analysis (page 468)1 Schizophrenia2 • Schizophrenia is most likely a group of illnesses affecting the most characteristically human of abilities such as language, planning, emotion and perceptions. It is relatively common, often becomes apparent in early adulthood, and frequently results in prolonged disability. • Schizophrenia is traditionally defined by symptoms and signs according to one of several sets of diagnostic criteria. Both major classification systems are: The International Classification of Diseases produced by the World Health Organization, and the American Psychiatric Association Fourth Edition of the Diagnostic and Statistical Manual (DSM-IV). Both sets of criteria are very similar and emphasize traditional symptoms such as hallucinations of voices commenting on the person’s actions, delusions, experiences of various forms of interference with the person’s thoughts, incoherent or irrelevant speech, blunting of affect and decline in general level of functioning. The Schizophrenic, The Manic Depressant, and the Bi Polar, by Gleen Brady. • Although schizophrenia appears to affect males and females in equal Queensland Center for Schizophrenia Research collection. numbers, males tend to develop schizophrenia at a younger age than females (modal age of onset for males:females = 20-24:25-29). In general the course of the disorder is worse for males than females. Men have more and longer hospital stays, higher suicide rates, greater substance abuse, 1 poorer living conditions and a worse psychosocial outcome than women. This example first appeared in the paper The analysis of repeated-measures data on schizophrenic reaction times using mixture models, Berlin, T. and Rubin, D.; Statistics in Medicine, Vol. 14, 747-768 (1995) 1 2 Extracted from Queensland Center for Schizophrenia Research (http://www.qcsr.uq.edu.au) 2 Dataset Histogram of Y = log(response time in milliseconds) Non-schizophrenic individuals • Psychologists at Harvard University performed an experiment measuring thirty reaction times for each of seventeen male subjects: eleven non-schizophrenics and six schizophrenics. 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5 • Manual reaction times to visual cues, where subjects watch a screen and move their fingers from one button to another when a signal appears, were measured. • There are 30 measurements per individual. 3 4 • Response times appear longer for schizophrenics than for non- Histogram of Y = log(response time in milliseconds) Schizophrenic individuals schizophrenics. • Response times for most of the schizophrenics appear more variable than those for non-schizophrenics. • Because of differences in both mean and variance, a simple linear model analysis seems inappropriate. The lack of fit of a standard model can be assessed by considering 5.5 6.0 6.5 7.0 7.5 5.5 6.0 6.5 7.0 7.5 – Replicate data consistent with the fitted ANOVA model can be generated by first drawing a value of the common variance from its scaled Inv-χ2 posterior distributions, then drawing values of the person means from their normal posterior distribution given the drawn value of the common variance. Finally draw 30 observations for each individual from a normal with the appropriate person mean and the 5.5 6.0 6.5 7.0 7.5 5.5 6.0 6.5 7.0 7.5 common variance. – Replicate the previous step, say, 1000 times, and each time calculate the largest within-person variance, smallest within-person variance, and average within-person variance among the six schizophrenic individuals. – Compare those values with the observed data. 5.5 6.0 6.5 7.0 7.5 5 5.5 6.0 6.5 7.0 7.5 6 • Current psychological theory suggests: (a) Schizophrenics have an attentional deficiency, implying that on some trials they have difficulty attending to the experi- Proposed models • Measurements from schizophrenics arise from a mixture of mental task, with a corresponding delay in initiating a reac(a) a component analogous to the distribution of response time tion. (b) Schizophrenia might be associated with motor reflex retardation implying slower reflexes once a reaction has commenced. for non-schizophrenics. (b) a mean-shifted component. Thus, at least some of the extra-variability arises from an attentional lapse that delays some, but not all, of each schizophrenic’s reaction times. • Although there are systematic relationships between physical measurements (e.g. reaction times and perceptual responses) and symptom variables (e.g. length of institutionalization and severity of disease), the dataset is not detailed enough or large enough to address such factors in interpreting schizophrenic reaction times. 7 8 Model 1 (a) Response times for non-schizophrenic individuals are treated as Model 1 (cont’d) Define, arising independently from a normal distribution with a distinct mean αj for individual j and common variance σy2. • Sj = I(individual j is schizophrenic), where I(·) is the indicator function. (b) Each schizophrenic also has a distinct mean αj for his response times, but more structure is needed to reflect the underlying psychological theory. Attentional deficiency is represented by modeling the distribution of response times for each schizophrenic individual as a two-component mixture: a proportion (1 − λ) of • λ, the probability that an observation will be delayed for a schizophrenic subject to attentional delays; • τ , the size of the delay when an attentional lapse occurs (on the log scale); the responses from schizophrenic i arise from a component anal- • β, the average log response time for the non-delayed observa- ogous to the distribution of non-schizophrenic response times, tions of schizophrenics minus the average log response time for with mean αi and variance nonschizophrenics; σy2, and a proportion λ of the re- sponses from schizophrenic j arise from a shifted component centered at αj + τ also with variance σy2. (c) The comparison of the typical components of α = (α1, . . . , α17) for schizophrenics versus non-schizophrenics addresses the mag- • ζij = I(response i for schizophrenic j is from the shifted component). Further define, • φ = (λ, τ, µ, σα2 , σy2) nitude of schizophrenics’ motor reflex retardation. A hierarchical parameter β measuring this motor retardation was included. 9 • yij the ith response of individual j. 10 Model 1 (cont’d) Model 1 (cont’d) • Hyperprior distribution • The data likelihood is – For non-schizophrenic individuals: yij |αj , φ ∼ N(αj , σy2) – Among schizophrenic individuals: – Non-informative uniform joint prior density on φ. • Some additional considerations – With the given hyperprior the model is not identified, because the trials unaffected by a positive attentional delay (a) for “non-shifted” reaction times yij |αj , φ ∼ N(αj , σy2) (b) for “shifted” reaction times yij |αj , φ ∼ N(αj + τ, σy2) • The model can be expressed in terms of the indicators variables yij |αj , ζij , φ ∼ N(αj + τ ζij , σy2) αj |ζ, φ ∼ N(µ + βSj , σα2 ) ζij |φ ∼ Bernoulli(λSj ) could instead be thought of as being affected by a negative attentional delay. τ was restricted to be positive to identify the model. – The variance components σy2 and σα2 are of course restricted to be positive as well. – The mixture component λ is actually taken to be uniform on [0.001,0.999] as values of zero or one would not correspond to mixture distributions. – The assumption of a common variance σy2 for shifted and unshifted schizophrenic responses is a possible weakness of this model. 11 12 Model 1 (cont’d) • Starting points: 100 points were drawn at random from a • Crude estimate of parameters α̂j = σ̂y2 = µ̂ = 1 30 1 11 1 11 30 a starting point for the ECM maximization algorithm to search 11 1 29 j=1 11 j=1 simplified distribution for φ. Each of those points was used as yij i=1 Model 1: ECM algorithm 30 for modes. (yij − ȳ.j ) 2 i=1 – The simplified distribution is obtaining by adding some randomness to the crude parameter estimates. Specifically, to α̂j obtain a sample from the simplified distribution, all parameters are set at the crude point estimates and then divided β̂ = 1 6 σ̂α2 = 17 j=12 1 16 1 α̂j − 11 17 11 j=1 each parameter by an independent χ21 random variable in α̂j an attempt to cover the modes of the parameter space with marginal Cauchy distributions (i.e. to ensure that the 100 (αj − ᾱ.)2 draws were sufficiently spread out). j=1 It is not necessary to create a preliminary estimates of the indicator variables, ζij , because they are updated as the first step in the ECM and Gibbs sampler computations. 13 14 Model 1: ECM algorithm Model 1: ECM algorithm (cont’d) • E step: we determine the expected joint log posterior density, averaging ζ over its posterior distribution given the last guessed • E step (cont’d): Thus Eold(ζij ) = zij , where value of θ . old The expected “complete-data” log posterior density is: Eold(log p(ζ, θ|y)) = const. + 17 j=1 log(N(αj |µ + βSj , σα2 ))+ 30 17 log(N(yij |αj , σy2))(1 − Eold(ζij ))+ j=1 i=1 log(N(yij |αj + τ, σy2))Eold(ζij ) + 30 17 j=12 i=1 zij = [log(1 − λ)(1 − Eold(ζij )) + log(λ)Eold(ζij )] λold N(yij |αjold +τ old ,σy2 old ) (1−λold )N(yij |αjold ,σy2 old )+λold N(yij |αjold +τ old ,σy2 old ) = old ) exp(σy−2 old((αjold − yij )τ old + .5(τ old)2)) 1 + (1−λ λold For each (i, j), the above expression is a function of (y, θ) and can be computed based on the data y and the current guess, θold. • M step: (a) Update λ λnew = We must compute Eold(ζij ) for each observation (i, j). Given θold and y, the indicators ζij are independent, with conditional posterior densities, 17 30 1 zij 6 × 30 j=12 i=1 (b) Update αj . For each j, αj new P (ζij = 0|θold, y) = 1 − zij σy2(µ + βSj ) + σα2 30 i=1 (yij − zij τ ) = 2 σy + 30σα2 P (ζij = 1|θold, y) = zij 15 −1 16 Model 1: ECM algorithm Model 1: Creating an approximate distribution • M step: (cont’d) • After 100 iterations of ECM from each of 100 starting points, (c) Update τ 17 τ new = 30 j=12 i=1 zij (yij − αj ) 17 30 j=12 i=1 zij minor modes. The minor modes are substantively uninteresting, corresponding (d) Update σy2 σy2 new three local maxima of (α, φ) were found: a major mode and two 17 30 1 = (yij − αj − zij τ )2 17 × 30 j=12 i=1 (e) Update µ 1 αj 11 j=1 11 µnew = (f) Update β to near-degenerate models with the mixture parameter λ near 0, and had little support in the data. Thus, they were ignored, and the target distribution could be considered unimodal for practical purposes. Had we included those two minor modes, any draws from them would have had essentially zero importance weights and would almost certainly have not appeared in the importance- β new 17 1 11 1 = αj − αj 6 j=12 11 j=1 (g) Update σα2 1 (αj − µ − βSj )2 = 17 j=1 17 σα2 new weighted samples. • Once the mode has been found, a multivariate t4 approximation for θ was constructed. The multivariate t4 distribution was centered at the mode with scale determined by the second derivative matrix at the mode, which was computed by numerical differentiation. 17 18 Model 1: Creating an approximate distribution (cont’d) Model 1: Why all this trouble? Why do we have to draw the starting points of the Gibbs sampler by • 2,000 independent random samples of θ were drawn from the importance-weighted resampling. multivariate t approximation. An histogram of the relative values of the 1,000 largest log importance weights shows little vari- • It is possible for the Gibbs sampler to exhibit slow convergence. ation – an indication of the adequacy of the overdispersed ap- To illustrate this point, 10 sequences of 200 steps were drawn proximation as a basis for taking draws to be resampled to create using Gibbs sampling. However, the starting points were drawn the starting distribution. directly from the initial approximate distribution (i.e. the one whose marginals are Cauchy distributions). Inference for some 80 100 scalar estimads were based on the last halves of the sequences. 0 20 40 60 mean β 0.27 λ 0.16 τ 0.84 -2log(density) 757.18 Posterior interval 2.5% 97.5% -0.17 0.70 0.01 0.74 0.70 0.98 681.63 832.74 -8 -6 -4 -2 0 Logarithms of the largest importance ratios from the multivariate t approximation 19 20 Potential scale reduction Est. 97.5% 2.43 3.63 5.15 7.42 1.19 1.35 3.53 5.08 Model 1: Why all this trouble? (cont’d) Model 1: Why all this trouble? (cont’d) • The high potential scale reductions clearly show that the simu- • The single sequence that stands alone started and remains in the lations are far from convergence. To understand better what is neighborhood of one of the minor modes found earlier by max- happening, the last halves of the 10 sequences of log posterior imization. Since the minor mode is of no scientific interest and densities were plotted. has a negligible support in the data, we discard this sequence, as almost certainly would have occurred with importance resam- 850 800 -2log(density) if well chosen, can lead to conservative yet relatively efficient inferences is all-important. 700 • Thus, the use of an overdispersed starting distribution, which, 750 900 pling. 100 120 140 160 180 200 iteration Log posterior densities for ten simulated sequences 21 22 Model 1: Gibbs sampler Model 1: One cycle of the Gibbs sampler • Let θ = (α, φ). Then, the marginal posterior distribution of the model parameters θ is a product of mixture forms: p(θ|y) ∝ 17 j=1 Step 1 - Update ζij for (i, j) ∈ {1, . . . , 30} × {12, . . . , 17} N(αj |µ + βSj , σα2 )× 30 17 (1 − λ)N(yij |αj , σy2) + λSj N(yij |αj + τ, σy2) j=1 i=1 • The Gibbs sampler is easy to apply for our model because the full conditional posterior distributions – p(φ|α, ζ, y), p(α|φ, ζ, y), and p(ζ|φ, α, y) – have standard forms and can be easily sampled from. ζij |θ, y ∼ Bernoulli(zij ) The indicators ζij are fixed at 0 for the non-schizophrenic subjects (j < 12). Step 2 - Update αj for j ∈ {1, . . . , 17} σy2(µ + βSj ) + σα2 30 30 i=1 (yij − ζij τ ) 1 αj |φ, ζ, y ∼ N , 2+ 2 σy2 + 30σα2 σα σy Step 3 - Update λ • Recall that a set of ten starting points was drawn by importance resampling from the t4 approximation centered at the ma- λ|τ, µ, σα2 , σy2, α, ζ ∼ Beta(h + 1, 180 − h + 1) jor mode. This distribution is intended to approximate our ideal where h = starting conditions: for each scalar stimand of interest, the mean Step 4 - Update σy2 is close to the target mean and the variance is greater than the target variance. 23 17 j=12 30 i=1 ζij ⎛ ⎞ 30 17 1 σy2|α, λ, ζ ∼ Inv-χ2 ⎝508, (yij − αj − ζij τ )2⎠ 508 j=1 i=1 24 Model 1: One cycle of the Gibbs sampler (cont’d) Model 1: Gibbs sampler (cont’d) • Possible difficulties at a degenerate point Step 5 - Update σα2 ⎛ σα2 |α, λ, ζ ∼ Inv-χ2 ⎝15, Step 6 - Update τ 17 τ |α, λ, ζ, σα2 , σy2 Step 7 - Update µ ∼N j=12 17 i=1 ζij (yij − 30 j=12 i=1 ζij 17 µ|α, λ, ζ, σα2 , σy2 ∼ N ⎝ Step 8 - Update β 1 (αj − µ − βSj )2⎠ 15 j=1 30 ⎛ If all ζij ’s are zero, then the mean and variance of the conditional ⎞ αj ) , 17 j=12 ⎞ σ2 1 (αj − βSj ), α ⎠ 17 j=1 17 17 distribution of β are undefined, because τ has an improper prior distribution, and, conditional on ij ζij = 0, there are no de- σy2 30 i=1 ζij layed reactions and thus no information about τ . Strictly speaking, this means that our posterior distribution is improper. For the data at hand, however, this degenerate point has extremely low posterior probability and is not reached by any of our sim ulations. If the data were such that ij ζij = 0 were a realistic possibility, it would be necessary to assign an informative prior distribution for τ . ⎛ ⎞ 17 2 1 σ (αj − µ), α ⎠ β|α, λ, ζ, σα2 , σy2 ∼ N ⎝ 6 j=12 6 Step 9 - Return to Step 1 and begin a new iteration • Gibbs sampler implementation Ten sequences of 200 iterations each were simulated. The first half of each sequence was discarded. Hence, we obtain posterior intervals for all quantities of interest from the quantiles of the 1000 simulations from the second halves of the sequences. 25 26 Model 1: Gibbs sampling results Model 1: Gibbs sampler Inference from the iterative simulations Parameter mean 2.5% α1 5.73 5.66 α2 5.89 5.82 α3 5.71 5.64 α4 5.71 5.64 5.58 5.51 α5 5.80 5.73 α6 5.86 5.79 α7 5.59 5.52 α8 5.55 5.48 α9 5.77 5.71 α10 5.72 5.65 α11 α12 5.73 5.66 α13 6.03 5.97 α14 6.01 5.93 6.19 6.08 α15 α16 6.19 6.11 α17 6.07 6.00 σα 0.14 0.09 β 0.32 0.17 λ 0.12 0.07 τ 0.85 0.74 0.19 0.18 σy σα /σy 0.74 0.50 -2log(density) 747.33 727.81 Posterior quantiles 25% 50% 75% 97.5% 5.71 5.73 5.76 5.80 5.86 5.89 5.91 5.95 5.69 5.71 5.73 5.78 5.68 5.71 5.73 5.77 5.56 5.58 5.60 5.65 5.77 5.80 5.82 5.86 5.83 5.86 5.88 5.92 5.56 5.59 5.61 5.65 5.53 5.55 5.57 5.62 5.75 5.77 5.80 5.84 5.69 5.72 5.74 5.78 5.71 5.73 5.75 5.80 6.01 6.03 6.05 6.10 5.98 6.01 6.04 6.09 6.15 6.19 6.22 6.29 6.16 6.19 6.22 6.26 6.04 6.07 6.09 6.14 0.12 0.14 0.16 0.21 0.27 0.32 0.37 0.48 0.10 0.12 0.14 0.18 0.81 0.85 0.88 0.96 0.18 0.19 0.19 0.20 0.64 0.73 0.85 1.11 739.92 746.88 753.84 768.35 27 Potential scale reduction Est. 97.5% 1.00 1.00 1.00 1.00 1.00 1.01 1.00 1.02 1.00 1.01 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.01 1.00 1.01 1.00 1.00 1.00 1.00 1.00 1.01 1.03 1.07 1.01 1.03 1.01 1.02 1.00 1.00 1.01 1.02 1.02 1.04 1.02 1.05 1.01 1.02 1.00 1.00 1.01 1.01 • Estimated potential scale reductions The estimated potential scale reductions shown in the previous table are close to 1, they suggest that further simulation will not markedly improve our estimates of the scalar stimands shown. More precise estimation of the means and variances of the target distribution, as would be achieved by further simulation, would not narrow the estimated posterior intervals much – nearly all the width of the intervals is due to the posterior variances themselves, not uncertainty due to simulation variability. 28 Model 1: Gibbs sampler Model 1: Checking the model Inference from the iterative simulations Inference from the iterative simulations (cont’d) • Results for this model suggest that approximately 7-18 per cent • Before accepting the results from model 1 as scientifically mean- of each schizophrenic’s responses suffer from an attentional lapse ingful, we should be confident that the model is not contradicted (λ), with a 110% (= e.74 −1) to 161 per cent (= e.96 −1) increase by the data. in mean response time due to this lapse (τ ). The interpretation here is based on the fact that τ measures increases on the scale of 100×log(response time). Note that the 95% interval for τ also • Bayesian model monitoring using posterior predictive checks, can be used to see whether important observed features of the data could plausibly have arisen under the assumed model. excludes the null value of zero, which is on the boundary of the parameter space; the real reason for rejecting the model with λ = 0 is not this interval, but rather the evidence that a simple linear model does not come close to explaining the observed • Operationally, we examine whether the posterior predictive distribution could lead to future replicate data that look like our data set. variability in the data. This group of schizophrenics also shows • This framework views data analysis as a process that involves evidence of motor retardation resulting in (e.17 − 1)=19% to model development, model fitting, and model monitoring, re- (e.48 − 1)= 62% slower mean reaction times (ᾱschiz − ᾱnon−schiz ). turning to the model development stage when inadequacies in These results suggest a rather dramatic attentional delay when model fit are found. it occurs, though it appears to be infrequent, combined with a more modest delay due to motor retardation. 29 30 Model 1: Checking the model Model 1: Checking the model Inference from the iterative simulations (cont’d) Inference from the iterative simulations (cont’d) • A general strategy for the model-monitoring step is to identify • We simulate predictive datasets from the normal–mixture model a statistic that is scientifically relevant and that would not au- for each of the 1000 draws of the parameters from the posterior tomatically be well fit by the assumed model, then to generate distribution. For each of those 1000 simulated datasets, y rep, we a number of replicate data sets from the posterior distribution compute the two test quantities, Tmin(y rep) and Tmax(y rep) of parameters from the fitted model, and finally to ascertain whether the value of the statistic in the observed data is ex- • We drew a scatterplot of the 1000 simulated values of the test quantities, with the observed values (indicated by ×). treme relative to the distribution of the value of the statistic calculated from the generated data sets. • The histograms of schizophrenics’ reaction times indicate that there is substantial variation in the within-person response time • With regard to these test quantities, the observed data y are atypical of the posterior predictive distribution – Tmin is too low and Tmax is too high with estimated p-values of 0.000 and 1.000 (to three decimal places). variance. To investigate whether the model can explain this feature of the data, we compute sj , the standard deviation of the 30 log reaction times yij , for each schizophrenic individual • A reformulation of the model is needed to fit the data more accurately. (j = 12, . . . , 17). Then, two test quantities were defined: Tmin and Tmax, the smallest and largest of the six values sj . 31 32 Expanding the model The following models are possible extensions of model 1. Expanding the model (cont’d) • Model 3 Model 3 postulates that there are two types of schizophrenic indi- • Model 2 viduals: those who are not susceptible to attentional deficiency, A first extension of model 1, model 2 allows the variance of and those who have a proportion λ of shifted response times due shifted responses to be different from the variance of non-shifted to attentional deficiency. Consequently, a missing indicator Wj responses for schizophrenics. for individual j which equals 1 if some responses can be shifted and 0 if they are never shifted is introduced. The separate vari- 2 2 yij |αj , ζij , φ ∼ N(αj + τ ζij , (1 − ζij )σy1 + ζij σy2 ) αj |ζ, φ ∼ N(µ + βSj , σα2 ) ζij |φ ∼ Bernoulli(λSj ) ances assumption of model 2 is dropped in favor of the common variance of model 1 to see if the indicator Wj ’s alone can account for the spread in variances among schizophrenics. yij |αj , ζij , φ ∼ N(αj + τ ζij , σy2) αj |ζ, φ ∼ N(µ + βSj , σα2 ) ζij |φ ∼ Bernoulli(λSj Wj ) Wj |S, θ ∼ Bernoulli(ωSj ) 33 34 Expanding the model (cont’d) Expanding the model (cont’d) • Possible alternatives to model 4 • Model 4 Model 4 includes both the possibility that not all schizophrenic individuals exhibit attentional deficiency as well as the possibility that response times shifted due to attentional deficiency have different variances from non-shifted response times. Possible extensions of model 4 include – letting τ vary across individual schizophrenics and estimating a mean and a variance for the individual values; e.g. assuming τj ∼ N(µτ , στ2) 2 2 yij |αj , ζij , φ ∼ N(αj + τ ζij , (1 − ζij )σy1 + ζij σy2 ) – letting λ vary across individual schizophrenics; e.g. assuming either λj ∼ Beta(λ1, λ2) αj |ζ, φ ∼ N(µ + βSj , σα2 ) or ζij |φ ∼ Bernoulli(λSj Wj ) Wj |S, θ ∼ Bernoulli(ωSj ) logit(λj ) ∼ N(µλ, σλ2 ) • Any of these alternatives would naturally be accompanied by treating person means for non-schizophrenics and the means of the non-shifted component of schizophrenic reaction times as random effects, i.e. αj ∼ N(µα , σα2 ). 35 36