Quantifying Opinion about a Logistic Regression using Interactive Graphics Paul Garthwaite The Open University Joint work with Shafeeqah Al-Awadhi 1 Introduction/Plan • This work arose from a practical problem in logistic regression. • The theory extends easily to elicit opinion about the link function of any glm. • I will outline the method for glm’s in general. • The motivating problem has some additional (commonly occurring) structure that the elicitation method exploits. • Interactive computing is used to elicit opinion. • Prior models can be formed that aim to allow a small amount of data to correct some potential systematic biases in assessments. • Results for the practical problem will be given. 2 Motivating Example The task is to model the habitat distribution of fauna in south-east Queensland - bats, birds, mammals etc. Available information: • Environmental attributes on a GIS database. • Sample information of presence/absence at 300400 sites. • Background knowledge of ecologists. The ecologists have seen the bat (say) in various locations but this information is difficult to use in a traditional statistical analysis because it has not been obtained from any sampling scheme. Prob(presence) = f (environmental attributes) 3 prob(presence) Continuous variables: elevation; quarterly rainfall and temperatures; canopy cover; slope; aspect. Factors: land type; vegetation; forest structure; logging; grazing; etc. A workshop with 15 ecologists indicated • unimodal or monotic relationships • independence between attributes in their effect on the probability of presence. 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 attribute 8 10 12 4 Generalised Linear Model (glm) The model has the form Y g[ (r)] where g[.] is the link function. For logistic regression, g[ ]ln( /(1 )) and is the probability of presence. r is the vector of predictor variables. From the ith predictor variable, ri , a vector of explanatory variables is constructed Xi ( X i,1,..., X i, (i))' such that we have the linear equation Y ' 1X1 ... ' mnXmn 5 Define: X i, j 0 if Ri ri, j1 Ri ri, j1 if ri, j1 Ri ri, j ri, j ri, j1 if ri, j Ri. and then Y is a linear function of Xi ( Xi,1 , ... , Xi, (i))'. 6 Factors:One factor level (the best one, say) is chosen as the reference level. Each other level is given a dummy 0/1 variable X i, j that equals 1 for that level and 0 for all other levels: 1 if Ri ri, j X i, j 0 otherwise 7 The sampling model is Y ' 1X1 ... ' mnXmn ( 1 , ... , mn)'. Let For the prior distribution we put 00 1' b MVN 0 , b 1 The values of the parameters in red must be chosen by the expert to represent his or her opinions. 8 Assessing medians and quartiles. These are fundamental assessment tasks the expert performs. How far is it from Aberdeen to Southampton? 25% | | 25% 470m | | 525m 25% | | 25% 600miles The median (blue) is assessed first and then the lower and upper quartiles (red). Ecologists were given practice at performing these tasks in preparatory training and explanation. 9 Eliciting b and 0 b E( ) and 0 00 00 Var( ). Also, Y at the reference point. The expert assesses m , the median of at this point. 0.50 (For logistic regression m is the probability 0.50 of presence.) b g[m We put 0 ] 0.50 . The expert also assesses the lower and upper m quartiles 0.25 and 00 m 0.75 . We put 2 g(m ) g ( m ) 0.75 0.25 1.348 10 Eliciting b and 1 • b is determined from the unconditional assessments. • is determined from assessments conditional on 1 . equalling . 0.75 m 11 b and 1 for factors. Eliciting Put y 0.75 g [m 0.75 E[ | y enabling ] . Then ] b 1 ( y 0.75 1 1 00 0.75 b ) 0 to be estimated. 12 [Go to program] Assessments to obtain Conditional on the first three line segments being correct, the dashed lines are quartiles of where the line might continue. 13 Conditional Assessments for Factors • The circles indicate conditions. • Dotted horizontal bars are previous assessments. • Solid bars are current assessments and must be within the dotted bars if is positive-definite. 14 [Go to program] Calculating Iterative calculations determine . Start by estimating the lower-right scalar element of , and call it A p. Then estimate the lower-right 22 of and call it A , etc. p1 If aii ai ' Ai ai Ai1 and Ai1 is positive-definite, then so is Ai provided aii ai' A1 ai . i1 15 Alternative Prior Models Individuals can show systematic bias in their subjective assessments. The aim is to form prior models that allow a small amount of data to largely correct some potential biases. Prior 2 The marginal distribution of is diffuse, rather than N (b , ) . The conditional distribution 0 00 of is assumed to be unchanged: | MVN (b, ) This allows for error in specifying the origin of the Y-axis. 16 Prior 3 Prior 3 replaces the scale for Y with some other linear scale. is again given a diffuse distribution and the conditional distribution of | is taken to be | MVN ( b, ) 2 is also given a diffuse distribution. Prior 4 This is the same as Prior 3, except it allows for systematic bias in quartile assessments by putting | MVN ( b, ) , and are given diffuse distributions. 17 Cross-validation and scoring • The usefulness of a prior distribution can be objectively examined by using cross-validation and a scoring rule. • For the cross-validation the data for a species were divided into four sets. Each set in turn was omitted and the remaining sets used to form prediction equations. • Prediction equations were applied to the omitted set and squared error loss determined: Squared error loss (k wk )2 k where the summation is over all sites in the omitted (validation) set, is the probability of presence k given by the prediction equation, and wk is a 0/1 dummy variable indicating absence/presence. • This defines a proper scoring rule. 18 Results for little bent-wing bat Method Set 1 Set 2 Set 3 Set4 Total _______________________________________ Prior 1 Prior 2 Prior 3 Prior 4 Frequent. No data 9.57 9.62 9.52 9.73 11.03 10.83 8.93 9.03 8.86 8.87 9.72 9.81 8.94 8.98 8.92 8.90 9.55 9.92 9.30 9.24 8.81 8.62 10.78 10.56 36.74 36.87 36.11 36.13 41.07 41.12 Sample Results 11/94 10/94 10/93 11/94 42 in 375 19 2.0 Prior 1 1.0 posterior value using Prior 2 posterior value using Prior 1 1.0 2.0 0.0 -1.0 -2.0 -3.0 MVN (b, ) -4.0 -3.0 -2.0 -1.0 0.0 1.0 0.0 -1.0 -2.0 -3.0 MVN (b, ) -4.0 -4.0 -5.0 -5.0 Prior 2 -5.0 -5.0 2.0 -4.0 -3.0 2.0 Prior 3 1.0 posterior value using Prior 4 posterior value using Prior 3 0.0 1.0 2.0 2.0 0.0 -1.0 -2.0 -3.0 MVN ( b, ) 2 -4.0 -5.0 -5.0 -1.0 prior value prior value 1.0 -2.0 -4.0 -3.0 -2.0 -1.0 prior value 0.0 1.0 Prior 4 0.0 -1.0 -2.0 -3.0 MVN ( b, ) -4.0 2.0 -5.0 -5.0 -4.0 -3.0 -2.0 -1.0 prior value 0.0 1.0 20 2.0 . Comm Little -on bent- bentPow- Greatwing wing Frog- erful er Method bat bat mouth owl glider ____________________________________________ Prior 1 36.74 12.75 28.76 13.61 43.90 Prior 2 36.87 12.73 28.91 13.60 43.94 Prior 3 36.11 12.41 25.99 13.17 42.35 Prior 4 36.13 12.75 28.61 13.61 43.90 Frequent. 41.07 13.70 30.91 14.38 44.15 No data 41.12 13.66 29.54 15.07 48.81 ____________________________________________ Sample Results 42 in 375 13 in 375 31 in 324 14 in 324 53 in 343 21 Concluding Comments • The elicitaion method described here is able to handle large problems by: (a) using interactive graphics (b) suggesting values to the expert that might represent his or her opinions. • It is believed that the use of graphs can improve the quality of the assessed distributions. • Cross-validation can demonstrate clearly the gain from using prior knowledge, when there is such gain. • Additional parameters in the prior model can allow limited data to be used more effectively. 22