COMPARING ITERATIVE CHOICE

advertisement
COMPARING ITERATIVE CHOICE-BASED METHODS IN
THE ELICITATION OF UTILITIES FOR HEALTH STATES
Jose Luis Pinto Prades
Glasgow Caledonian University
Joseluis.pinto@gcu.ac.uk
Ildefonso Mendez
Universidad de Murcia
Graham Loomes
University of Warwick
23-05-2014
ABSTRACT
It is a usual practice to elicit preferences for health states with iterative choicebased methods. The Time Trade-Off and the Standard Gamble use different search
strategies like Bisection, Titration (Bottom-up, Top-down), “Ping-pong” or others.
There is evidence that different methods produce different results. This has been
attributed to the presence of psychological biases. This paper shows that different
iterative procedures will produce different results even if there are no
psychological biases. We only have to assume that are stochastic. We assume that
individual preferences are characterized by a certain probability distribution. We
assume that the response of the subject to each question of the iteration process
comes from his/her “true” distribution (no biases) and compare (using simulation)
the mean of this “true” distribution with the mean elicited with the iterative
procedure. We find that the difference between the “true” and the elicited mean
was higher for Titration than for Bisection. There are two strategies that seem to
perform better than the rest. One is to randomize all starting points and use
Randomization. The other is to adapt the use starting points to the severity of the
health state. In this case, we should proceed in stages. First, we should estimate the
range where most of the probability mass is located. Next, we adapt the starting
point to the severity of the health state. In this case, Titration perform better than
Bisection.
1. Introduction
Several preference elicitation methods belong to the category of “matching”.
Matching methods ask subjects to establish indifference between two options. One
of the options has all attributes predetermined at some fixed level. In the other
option all levels of attributes except one are predetermined. Subjects are asked to
state the level that is missing such that the two options have the same utility level.
However, the usual procedure of establishing indifference is asking subject several
iterative choice questions. That is, each question is related to the response to the
previous question. The idea is that those questions make possible to estimate an
interval where the indifference point is located. The higher the number of
questions, the narrower is the interval. Finally, once the interval is narrow enough,
subjects are asked an open question about where in this interval the indifference
point is located or they assume that the middle point of the interval is the
indifference point. We will call these methods “Iterative Choice-Based” (ICB). In
health economics they are widely used to elicit preferences for health states with
methods such as Time Trade-Off (TTO) or Standard Gamble (SG).
These methods are very efficient since they get a lot of information about the
structure of preferences at the individual level with a reduced number of
questions. Unfortunately, ICB methods are not without problems. For example,
different iteration mechanisms produce different results. One example is Lenert et
al (1998). They compared different several iteration mechanisms and they found
that utilities changed depending on the iteration procedure. More specifically, they
compared “top-down titration” vs “ping-pong”. The first method asks subjects if
they would accept some risk of death (in the Standard Gamble) in order to avoid a
certain illness. If they accepted, risks were increased (1%, 2%, 3%...) until
indifference was reached. In the “ping-pong” the subject was offered the following
sequence of risks for SG: 1%, 99%, 2%, 98%, 10%, 90%, 80%, 20%, 70%, 30%,
60%, 40%, 50%. They found that titration consistently produced higher utilities.
Brazier and Dolan (2005) also compared Ping-pong and Titration with similar
results. Hammerschmidt et al (2004) also provide evidence that different iterative
procedures lead to different utilities. In summary, there is evidence that different
ways of iterating produce different values.
The reaction of researchers to these findings has been different.
1. In some cases, they use a unique iteration method. This may not a problem
if the objective of the paper is to test hypothesis that require holding
everything constant (like the iteration method). However, it is more
worrying if the objective is to compare utilities produced by different
methods.
2. In other cases (Bleichrodt et al , 2005), they use procedures that try to
increase the probability that each choice in the iteration process is treated
as independent as possible from the rest. In some respect, they try to elicit
preferences as if each choice was the first choice in the sequence assuming
this is the most unbiased choice. This is based on the idea that the problem
of iterative methods comes from anchoring effects. They assume that the
subject will reveal her “true” preferences if each step in the iterative choice
procedure is independent from the previous one.
3. Another potential strategy is to randomize the starting point. Apparently,
the theory behind this is randomization averages out errors and biases.
This implicitly assumes that anchoring effects generate a distribution of
errors with zero mean.
4. Finally, some researchers (Stein et al, 2009) claim that some search
procedures are less biased than others and they use those methods they
believe less biased.
All these approaches have one thing in common. They assume that different search
methods produce different results because there is a psychological bias that
generates this phenomenon. This bias usually stems from the fact people do not
treat each choice independently from previous choices in the iteration process.
This leads some authors to conclude that “utility values are heavily influenced by,
if not created during, the process of elicitation” (Lenert et al 1998). That is, they
question the same existence of the concept of preference given those results.
The objective of this paper is to show that the fact that different ICB produce
different results can be explained in a less radical way. We can keep the
assumption that preferences do exist but they are not deterministic, they are
stochastic. Different ICB will produce different results if preferences are stochastic.
We will show that even in the absence of any psychological bias, iterative methods
might produce different results depending on the iterative procedure used. This
does not conflict with the interpretation of these effects as biases. The two effects
can go together. What we will show is that we do not need to resort to psychology
in order to explain why utilities can be different depending of the iterative
procedure used. The only assumption we need is that preferences are stochastic.
This is hardly an assumption since there is plenty of evidence that there is an
element of variability in preferences for health states. We are not aware of any
test-retest study that found a correlation coefficient of 1.0. For example, Feeny et al
(2004) report Intraclass Correlation Coefficient (ICC) ranging from 0.49 to 0.62.
Although they report that mean agggregate utilities are stable over time, the ICC
reveals that there is variability at the individual level. Very often this result
(variability at the individual level and stability at the aggregate level) does not
worry too much researchers, since they are interested about the preferences of
groups of subjects. They interpret the variability at the individual level as the
consequence of random error. We will show that this element of variability may be
relevant in order to explain why different ICB methods may produce different
utilities.
The structure of the paper is as follows. First, we show with the help of some
examples, the intuition that is behind our hypotheses. Next, we present the
methodology that we use to derive our predictions. We show our results and
discussion closes the paper.
2. ITERATIVE CHOICES: SOME EXAMPLES.
We will start with some examples to show the intuition behind our argument.
Assume a subject who has to evaluate a health state (better than death) with the
TTO. Assume that duration in bad health is 10 years and we want to know the
duration in full health that the subject considers equivalent to 10 years in bad
health. Researchers have to take several decisions: a) which is the level of
precision that we will use? b) Which is the starting point? c) How many iterations
(e.g., iterative choices) are going to be used? d) Which search procedure is going
to be used? Assume that the researcher follows next strategy:
1. Decides to use the year as the maximum level of precision in the iteration
process, that is, the researcher wants to know the value of X and Y such that (10
years, bad health)<(X years; full health) and (10 years, bad health)>(Y years; full
health) such that X-Y=1.
2. Decides to estimate the middle point between X and Yas the true utility
so utilities will be 0.95. 0.85 and so on.
3. Decides to randomise the starting point. That is 10% of the sample will
start with the choice (10 years, bad health; death) vs (1 year, full health;death),
then 10% will start with the choice (10 years, bad health; death) vs (2 year, full
health;death) and so on.
4. Decides to use Bisection as the “search procedure”. That is, if subjects
prefer (10 years, bad health; death) to (1 year, full health;death) the next question
will be (10 years. bad health; death) vs (6 years, full health;death) and so on.
Assume that the subject has to evaluate a health state that it is not too severe.
Since her preferences are stochastic, we will assume that her responses are
generated by some probability distribution (see column “True distribution” in
Table 1). This distribution implies that, if the subject were asked repeatedly to
chooce between (9 years, FH) vs (10 years, bad health) in 60% of the cases, she
would choose (10 years bad health). In this case, the utility would be 0.95.
However, in 40% of the cases, she would choose (9 years, FH). Assume that the
first question, generated randomnly, is the choice between (8 year, FH) vs (10
years, bad health). In 80% of the cases the subject will choose (10 years, bad
health) implying that U>0.8 and in 20% of the cases (8 years, FH). This is the
essence of the concept of “stochastic” preferences. The subject does not always
give the same response but this does not imply that she is irrational or that their
preferences do not exist.
Why is it that subjects give different responses in different moments of time? We
do not really know. We can simply assume that subjects can be in different “states
of mind” on different occasions. In some cases, she perceives the health state as
being very mild and she is not willing to give up 1 year of life to improve quality of
life. In some other cases, she perceives that health state a little bit more severe
and she is willing to give up more of one year. This is what we observe in surveys:
variability at the individual level. However, most probably the subject can give a
rational explanation of both decisions. Maybe in some cases she considers that
some aspect of her life and in some cases she focuses on some other aspect. One
implication of this perspective about preferences is that giving different
responses to the same question does not imply that she has made a mistake.
Stochastic preferences imply that all “states of mind” are equally valid. In
principle, there are not mistakes. The concept of error only applies to cases such
as the subject pressing the key of option A while the subject prefers B. Apart from
that, we can talk about preferences as a distribution, with some mean, median or
some other moment but not exactly about wrong and right responses. Not even
about “contradictions”. That is, if health state A is better than B, the fact that the
subject may give an answer such that U(B)>U(A) does not have to be a mistake. If
distributions overlap this will happen with some degree of probability as long as
subjects do not remember (or link) the two responses. However, in this paper we
will assume that the objective of the elicitation procedure is to estimate the mean.
Once we have explained what we understand by stochastic preferences, we will
show how they affect ICB methods. Let us assume that the first choice is (10Y, bad
health) vs (9 years full health). We will show that, in this case, the expected mean
is 0.893. If we start with (10Y, bad health) vs (1 years full health) the expected
mean is 0.847. The reason is that the probability distributions that are generated
starting from 9 and starting from 1 are different, even if every single answer
comes from the “True” probability distribution, with mean 0.872. Let us see one
example. We will show that the probability of identifying the indifference interval
between 8-9 years in full health is 28.51% and not 20% (the true probability) if
we start from 9. Why is that? If she starts with (10Y, bad health) vs (9 years full
health) she will end in the interval 8-9 years in full health if:
1. In the choice between (10Y, bad health) vs (9 years full health) she prefers (9
years full health). The probability of this response is 40%. The iterative process
(assuming Bisection) will present the choice between (10Y, bad health) vs (5
years full health).
2. In the choice between (10Y, bad health) vs (5 years full health) she prefers
(10Y, bad health). The probability of this response is 99%. The iterative process
will present the choice between (10Y, bad health) vs (7 years full health).
3. In the choice (10Y, bad health) vs (7 years full health) she prefers (10Y, bad
health). The probability of this response is 90%. The iterative process will then
proceed to show her the choice between (10Y, bad health) vs (8 years full
health).
4. In the choice (10Y, bad health) vs (8 years full health) she prefers (10Y. bad
health). The probability of this response is 80%.
In summary, if she starts with the choice (10Y, bad health) vs (9 years full
health) she will end up in the interval 8-9 years in full health if all of the above
happens and the probability is 0.4x0.99x0.9x0.8=0.285. However, this is not the
true probability (it is 20%). Each time that we start from (10Y, bad health) vs (9
years full health) there are some chances that the subject will end up in a
different interval, given the stochastic nature of her preferences. If we were able
to repeat these questions a large number of times (and the subject would suffer
some kind of instantaneous amnesia) she would end up in the different intervals
of Table 1 in the frequencies that we present in the last two columns of Table 1.
This would produce a mean of 0.893 if we had started from (9Y,FH) and of 0.847
if we had started from (1Y,FH).
What it is important is to observe is that utilities change with the starting point
even if we assume that there are no psychological biases. Each response in the
iteration process comes from the “true” distribution. Even more, we observe a
certain tendency to obtain higher utilities if we start from above (9 year full
health) than if we start from below (1 year full health). This could be
interpreted as evidence of anchoring effects while we see that this is not the
case. Of course, this does not exclude that anchoring has an effect. What we say
is that we do not need any sort of psychological bias (e.g.. anchoring) to
observed utilities changing when the starting point changes.
This example highlights several issues:
1. The “solution” to the problem generated by iterative methods is not to use
methods that encourage subjects to treat each choice as independently as
possible from the rest of choices. In our example, the subject has treated
each choice independently from the rest.
2. Randomization does not have to be the right solution. Randomisation is the
right approach only when the “mean of the means” estimated from each
starting point coincides with the “true mean”. This would only happen in
very specific cases. Moreover, randomising has a potential problem: it
increases sample error since it reduces the number of subjects starting
from different points. The potential benefit of having observations all along
the different starting points can be offset by the increase in error
generated by the smaller sample size in each starting point.
3. In order to find “solutions” it is important to have observations of
distributions at the individual level. This is an important point that we
would like to stress. At the aggregate level we observe that different
subjects give different responses. This variability may come from two
sources. One is that subjects have heterogeneous but deterministic
preferences. Another is that preferences are stochastic at the individual
level. If subjects have heterogeneous but deterministic preferences
iterative methods will not be problematic. For example, assume that we
observe that in our sample 50% of subjects state that the utility of a health
state is in the interval 0.6-0.7 and the other 50% that the utility is in the
interval 0.9-1.00. This could be because our sample is split in two different
groups with very different views about health. However, one single group
with homogeneous but stochastic preferences could generate this
response pattern as well. In the first case, we would always obtain the
same mean irrespectively of the starting point. This would not happen in
the second case.
In this paper we will show that in order to know which is the best strategy
with iterative methods depends on the characteristics of the underlying
(stochastic) utility functions. However, since there is no empirical evidence
about how those functions may look like (at the individual level), we will make
some assumptions that we think are reasonable in order to compare iterative
methods. The method that we will use to approach this problem is simulation.
3. METHODS.
Variability in the preferences of respondent i is measured by means of a
probability function that assigns a value pij to each of the j=1.....J intervals in
which the range of variation of X is partitioned. where pij≥0 and
=1 for all i.
Those intervals can be life years in full health in the case of TTO or probabilities in
the case of SG. This is not relevant in our case. The average utility of respondent i.,
μi, is a weighted average of his utility evaluated at the midpoint of each interval.
The parameter of interest is μ. the average utility across all the respondents. The
researcher observes the individual choices and his goal is to elicit μ. We assume
the researcher deals with this identification problem using ICMM methods.
We characterize the identification problem as follows. We assume that  ranges
from 0 to 10. J equals 10 intervals of equal amplitude and we consider nine starting
points corresponding to values 1 to 9 of . The objective is to estimate where in
the 10 intervals generated the indifference point is located. We assume that
indifference is in the middle of the interval and we want to reach this interval
through a series of iterative choices. The researcher has to take some decisions
like:
1. Which is the starting point?
2. Which is the search procedure we are going to use?
3. Are we going to set a limit on the number of iterations?
In this paper we will assume that there will be no limit on the number of iterations,
that is, we will ask as many questions as necessary to place the subject in one of
the 10 intervals that we have decided to use. For this reason we will concentrate
on the first two issues. More specifically, we assume that the researcher has next
options about the starting point and about the search procedure:
1. Starting point. Options:
a. Randomize. This implies that we used each of the 9 potential
starting points equally.
b.
Use a fixed point. Three cases have been used in the literature:
i. The upper end, e.g. (10 years bad health) vs (9 year full
health)
ii. The lower end, e.g. (10 years bad health) vs (1 year full
health)
iii. The middle point, e.g. (10 years bad health) vs (5 years full
health)1
2. Which search procedure? In this paper we will simulate the effect of
three search methods:
a. Bisection: this method is characterized by halving the
indifference interval in two parts of equal size. For example, if one subject
prefers (10 years bad health) to (4 years in full health), the next choice would
be between (10 years bad health) and (7 years in full health) since (7 years in
full health) is the middle point between 4 and 10.
b. Titration: this method usually starts from one end or another of
the distribution and moves up or down in units until the subject changes her
preferences from one option to the other. The literature refers to top-down
titration when the subject starts with high values (e.g. 9 years full health) and
moves down (8, 7…). When the subject starts with low values (e.g. 1 year in full
health) and moves up (2, 3…) the literature refers to bottom-up titration. In
this paper we will also use two other options, namely, randomizing and starting
from the middle.
c. Ping-pong: in this method subjects start from one of the two
ends and move to the other end narrowing down the interval where
indifference is reached. For example, the researcher could start from one end
(10 year bad health) vs (9 years full health) [the Ping] or from the other end (1
year full health) [the Pong].
While this would theoretically lead to 12 (4 starting points x 3 search procedures)
different iterations procedures, only 10 are feasible since Ping-pong cannot start
from the middle by definition. One interesting case is Titration+Middle since this is
basically the method used by the Euroqol group.
To compare all these iterative procedures we randomly generate 1000
distributions of a given type and we extract a sample of size 100 out of each
distribution. This number (n=100) was arbitrary but we think it is a generous one
since it refers to the number of subjects who evaluate one health state at a time.
There are no many examples of surveys where more than 100 subjects have
evaluated more than one health state. We also analyzed the effect of increasing the
number of observations to 500 and 1000 but we do not present the results here.
The main assumption that we make in our simulations is about the shape of the
stochastic utility functions. We make next assumptions:
1. There are seven types of distributions according to their degree of
skewness, namely severely left skewed (LL), left skewed (L), slightly left
skewed (SL), normal (N), slightly right skewed (SR), right skewed (R)
and severely right skewed (RR).
2. These distributions are centered in the following intervals of x: 0-1
[LL], 1-2 [L], 2-3 [SL], 4-6 [N], 7-8 [SR], 8-9 [R] and 9-10 [RR].
In health economics each interval could be related with the severity of the health
state. In the TTO we can think of  as the number of life years people are willing to
give up or in the SG as the risk of death they are willing to accept. In this case, the
LL distribution would characterize very mild health states and the RR distribution
the severe health states. So LL distributions would imply high utility values or
alternatively, low values of . The opposite with RR distributions.
The justification of these assumptions is related to the McCrimmon and
Smith(1986) and Butler and Loomes(2007) view of imprecise preferences. They
assumed that variability proportional to the range of potential responses. The
“model” of preferences that support these functions is as follows. We assume that
each subject has some sort of core preferences. That is, a band of responses (or
even a unique value) on which she would settle if asked to think about the
question repeatedly, with opportunities for deliberation and reflection. So if we
were to represent the probability distribution of her responses, we might expect
the modal response to lie somewhere in this core, while allowing for the
distribution to stretch away from this area, within the bounds of the response
interval. Depending on the location of the core relative to the upper and lower
bounds of the range, we might expect such a distribution to be more or less
skewed. Specifically, where the core lies near the bound, we would expect that the
skewness would be more pronounced, because of end-of-scale effects. We would
expect positive skew when the core is close to the lower bound of the response
interval and negative skew when it is close to the upper bound. In practice,
assume that the health state to be evaluated is mild and the median utility is about
0.9. This implies that the distance from the median to the upper part of the
distribution (0.999) is much smaller than the distance to the lower part of the scale
(0.000). The probabilities that the subject gives an answer lower than 0.9 are
higher than the probability that the subject gives an answer between 0.9 and 1.
That is, subjects may know, for example that the health state is not too severe so
they should not give up too many life years, say 1 or 2 at the most. That is where
the core could be located. However, from time to time they can be in a different
“state of mind” and give a value higher than 2. This will produce a skewed
distribution with the probability mass located between 1 and 2 but with some
responses higher than 2.
We generated 1000 distributions of each of each type centered in the intervals
already mentioned. Each distribution was generated by randomly extracting, for
each interval of , a number from a uniform distribution defined on different
intervals depending on the type of distribution that we aimed to generate. The sum
of the probabilities was normalized to one. This way of proceeding ensures that the
validity of our results does not depend on the particular distributions considered.
The severely skewed distributions are characterized as lognormal distributions
whose means are located in the first or last intervals. The remaining type of
distributions are not necessarily well-shaped distributions.
In order to compare the effect of the different iterative procedures on the utilities
estimated we assume that: a) the main moment of the distribution is the mean, b)
the researcher only has one opportunity to observe the mean, that is, she can only
ask one question to each subject and observes only one aggregate distribution.
Although have estimated 1000 means for each iterative procedure the researcher
will see the result of only one of those results in practice. For this reason, we will
compare iterative procedures according to:
a) The probability of making a mistake, that is, of not estimating the true
mean.
b) The size of the error, that is, estimated mean – true mean.
Since simulation will produce 1000 means it is going to be highly improbably to
produce exactly the same mean as the true one. However, in many cases
differences can be negligible. We will consider that there is an error threshold
below which errors will be irrelevant. We have established in 0.03 the threshold of
the error to be considered “significant” since it has been suggested (Wyrwich et al
2005) that those differences have clinical significance. For this reason, our results
will only focus on differences between the true and the estimated mean that are at
least of 0.03. Iterative methods will be compared according to the probability of
making a mistake equal or larger than 0.03 and the size of the error. One option is
to estimate the mean of all errors (equal or bigger than 0.03), this is what we will
call the Mean Absolute Error (MAE). However, minimizing MAE may not be the
main objective of a researcher since MAE does not reflect the variability of errors.
For this reason, we will present the error corresponding to three percentiles: 5%,
50% and 95%. In this way, the range of errors can also be observed.
4. RESULTS
Table 1 shows the probability that the mean is above or below the true mean in
more than 0.03 points (on a utility scale from 0 to 1). We can see that
1. Bisection tends to perform better than Titration. The combination that clearly
outperforms all the rest if when Bisection is combined with Randomization.
2. Titration only produce good results when the starting point is close to the
interval where the probability mass in located (the core). That is, in the topdown strategy, only performs well for mild health states (Severely Left
Skewed). In the bottom-up strategy, only performs well for severe health states
(Severely Right Skewed). When starting from the middle of the distribution,
only performs well for health states with intermediate health states.
Randomization hardly improves Titration.
3. The pattern with Ping-pong is similar to Titration. It only performs well when
the starting point coincides with the core, that is, it works well for mild health
states when we start from above and for severe health states when it starts
from below.
4. The method used by the Euroqol group generates values that are too low for
mild health states and too high for severe health states.
The implication of these results is that in order to have good results it is important:
a) to use Bisection b) choose the starting point close to the core. It seems that all
methods work well when the starting point is close to the true mean.
It is interesting to observe that Randomization in itself does not generate unbiased
estimates. The reason is that the main assumption behind the use of
Randomization is that errors cancel each other out. However, Randomization
includes several starting points that are far from the core. If this is the case, it will
produce some means that are far from the true mean. For example, assume we use
Randomization coupled with Titration and we want to estimate the utility of a mild
health state. In that case, Randomization will include a lot of questions that are far
from the place where most of the probability mass is located and error will not
cancel each other out. This only happens when Randomization is used with
Titration for intermediate health states. This is the only case where errors will
cancel each other out using Randomization.
Bisection performs better than Titration because it does not ask too many
questions on the “wrong” side of the distribution. Titration has to ask a lot of
questions until it gets to the core, if it starts from the wrong place. This “inflates”
the part of the distribution where the questions are asked. That is, assume that a
health state is mild. Bisection will quickly chop-off the tail to the right of the
distribution while Titration will keep asking questions in that part of the
distribution. This will increase the chances that subjects may respond affirmatively
to a question where there are asked to trade-off a large number of life years in
order to avoid a mild health state. Although the chances that the subject will accept
to give up a large number of life years are small, if the iteration method asks a lot
of questions in these range it maximizes the chances of getting affirmative
responses to “extreme” questions. Something similar seems to happen with Pingpong.
If something like this lies behind of our results one consequence could be that, a
process that adjusts the starting point to the severity of the health state could work
quite well. What happens if we adapt the starting point to the severity of the health
state? Can we improve Titration or Bisection using this strategy? We will call this
strategy Adaptive Starting Point (ASP). According to this strategy we start from (8
years full health) vs (10 years bad health) in the Severely Left Skewed distribution
and then proceed by reducing in 1 year in full health for the rest of distributions.
That is, we start from (7 years, full health) in the case of the Left Skewed
distribution, from (6 years, full health) in the case of the Slightly Left Skewed and
so on until we reach the Severey Right Skewed where the first choice would be (2
years in full health) vs (10 years bad health). The results for ASP can be seen in
Table 2. We also include in this table Randomization+Bisection since it is the
strategy that performs better in Table 1.
Surprisingly (or maybe not) the best method involves Titration. However, this fits
perfectly well with the explanation we have provided above. Titration performs
better than Bisection in the ASP strategy because all questions are produced in the
range where most of the probability mass is located. Bisection is less efficient
since, in order to find the place where the core is located, needs to ask questions
that are far from it. Another way of illustrating this point is analyzing the amount
of errors after a certain number of iterations. Table 3 presents the percentage of
errors larger than 0.03 in Bisection + Randomization and Titration+ASP. It can be
seen that Titration+ASP reduces very quickly the number of errors. In fact, the
percentages in Table 4 after three iterations are very similar to the percentages in
Table 3 for Titration+ASP. There is very little to be gained after three iterations.
This does not happen with Bisection+Randomization that needs more iterations to
find the interval where indifference in located. This is a logical implication of
asking more questions outside the main range.
We now move to the second characteristic we will use to evaluate Iteration
Methods, that is, how big is the mistake. This information is provided in Table 4.
We present the errors that correspond to the centiles 5%, 50% and 95%. In this
way, the whole range of errors can be easily perceived. Methods based on Titration
produce the largest errors. The Titration method that produces better results is
the Titration+ASP. Randomization also gives rise to large errors (>0.1) when it is
combined with Titration. We can also see the size of errors using the Mean
Absolute Error, which is the mean of all errors in absolute value. This is shown in
Table 5. The MAE cannot be interpreted as the “most common error” since, as we
know, the same mean can proceed from different distributions. We suggest that for
the researcher it is even more important to be sure that the chances of making a
big error are small. For this reason, Table 5 should be read in conjunction with
Table 4.
DISCUSION
We have shown that different iterative procedures produce different results even
if the respondent always responds truthfully (from an stochastic point of view).
Means tends to be correlated with the starting point. However this effect is not
produced from what it has traditionally considered the “starting point bias”. There
seems to be logic in our results, namely, the estimated means will be closer to the
true mean if most of the questions of the iterative procedure are asked in the area
where most of the probability mass is located. This is the reason why Titration
performs so badly. However, we have also seen that if the starting point is far from
the mean results are not going to be good even with Bisection. In any case,
Bisection usually performs better than Titration.
There two ways of establishing the starting point that perform fairly well, namely,
randomization and “adaptive starting”. It seems that randomization is the simplest
to implement since it does not require any previous information. If we want to use
“adaptive starting” we need some hypothesis about the severity of the health state.
In order to have this information some piloting is necessary. Randomization does
not require of any previous information. Two caveats apply here. One is that
randomization alone does not guarantee the best results. The second is that
randomization may not perform so well if we change the shape of the probability
distributions. That is, one of the reasons that randomization works well, coupled
with Bisection, is that we did not place a limit on the number of questions, so the
search procedure tends to find the main probability mass sooner or later.
However, it is clear that Randomization works through a process of “error
compensation” and it involves asking questions that we know are going to give rise
to biased estimates. Bisection does not work well when we start from the wrong
place but expect that errors will cancel each other out. This is not the philosophy
behind “adaptive starting”. In this case we are always asking questions in the right
part of the distribution. That is why it performs so well, coupled with Titration. It
starts from the right place and Titration asks questions in the right part of the
distribution. In summary, it seems that the two main options are
Bisection+Randomization or ASP+Randomization/Titration. In one case
(Randomization), we hope that the different biases introduced by the questions
asked from the “wrong starting point” will be compensated by the biases
introduced by a “wrong starting point” of the other end of the scale. In the other
case (ASP), we hope that the starting point is the right one. If it is, the chances of
making an important mistake are small either with Bisection or Titration.
We understand that some researchers may be uneasy about using the ASP
strategy. It has been traditionally assumed that using different starting points
introduce biases in the elicitation procedure. This is because these effects have
been attributed to anchoring. We show this does not have to be the case. We have
shown that using the same starting point may also produced biased estimates. Of
course, one cannot use ASP without any previous information about where the
core is. It requires piloting. In some respect, the ASP strategy is similar to the
strategy used to establish the bid values in Dichotomous Choice Contingent
Valuation (DCCV) (Kanninen, B. J. 1993; Hanemann, M., Loomis, J., & Kanninen, B.
1991). In order to ask the right questions (the right bids) one needs to do some
piloting using open-ended WTP. The problem with those methods is that
sometimes the researcher may feel that once she has enough information to set up
the bids correctly she already has almost knows the final answer….However, the
message of ASP is that it can be a good idea to proceed in stages. First, do some
piloting to see where utilities are placed. Second, adjust the starting point to the
previous information. It is a kind of Bayesian way to proceed.
Finally, one could think that, if iterative procedures have all these problems, it is
better to use non-iterative methods. To some extent, this is true. However, we
should make several caveats in relation to non-iterative methods. First, they still
need piloting to know where to put the “bids”. Asking many questions in areas of
the response scale where there is little movement is very inefficient. Second, it is
complicated to elicit individual preferences with non-iterative methods for two
reasons: a) more questions are needed per subject b) we have to deal with errors
(contradictions) in the process. Third, we can use non-iterative methods to obtain
preferences at the aggregate level as in Double Bound Dichotomous Choice
(DBDC). However, the evidence that comes from DBDC is that those methods may
have problems of anchoring (Kanninen, B. J. 1995) when more than one
dichotomous question is used. In health economics it would be unfeasible to elicit
the utility of a large set of health states using DBDC. Fourth, the evidence that
comes from DBDC is that when we elicit preferences only at the aggregate level,
the shape of the distribution function that we assume generates these responses is
extremely important and small changes in the assumptions about this function can
produce important changes in the results (Alberini, 2005). Fifth, these problems
are enhanced by the fact that preferences are heterogeneous while most of these
techniques assume a common distribution function for all subjects. Sixth, they are
not useful for individual decision-making. It is true that there have been advances
in the literature that have improved many of those problems of non-itearative
procedures. In spite of that, abandoning iterative methods may have an important
problems. Gathering information is stages, using Bayesian methods can be a good
alternative to non-iterative methods.
REFERENCES
Alberini, A. (2005). What is a life worth? Robustness of VSL values from contingent
valuation surveys. Risk Analysis : An Official Publication of the Society for Risk
Analysis, 25(4), 783-800.
Bleichrodt, H., Doctor, J., & Stolk, E. (2005). A nonparametric elicitation of the
equity-efficiency trade-off in cost-utility analysis. Journal of Health Economics,
24(4), 655-78.
Brazier, J., & Dolan, P. (2005). Evidence of preference construction in a comparison
of variants of the standard gamble method. HEDS Discussion Paper 05/04.
Butler. D J and G C Loomes. 2007. Imprecision as an account of the preference
reversal phenomenon. American Economic Review 97 (1): 277-297.
Feeny, D., Blanchard, C. M., Mahon, J. L., Bourne, R., Rorabeck, C., Stitt, L., &
Webster-Bogaert, S. (2004). The stability of utility scores: Test-retest reliability
and the interpretation of utility scores in elective total hip arthroplasty. Quality of
Life Research : An International Journal of Quality of Life Aspects of Treatment, Care
and Rehabilitation, 13(1), 15-22.
Hanemann, M., Loomis, J., & Kanninen, B. (1991). Statistical efficiency of doublebounded dichotomous choice contingent valuation. American Journal of
Agricultural Economics, 73(4), 1255-1263.
Kanninen, B. J. (1993). Optimal experimental design for double-bounded
dichotomous choice contingent valuation. Land Economics, 138-146.
Kanninen, B. J. (1995). Bias in discrete response contingent valuation. Journal
of Environmental Economics and Management, 28(1), 114 – 125.
Lenert, L. A., Cher, D. J., Goldstein, M. K., Bergen, M. R., & Garber, A. (1998). The
effect of search procedures on utility elicitations. Medical Decision Making, 18(1),
76-83.
MacCrimmon. Kenneth. and Maxwell Smith. 1986. “Imprecise Equivalences:
Preference Reversals in Money and Probability.” University of Brit-ish Columbia
Working Paper 1211.
Stein, K., Dyer, M., Milne, R., Round, A., Ratcliffe, J., & Brazier, J. (2009). The
precision of health state valuation by members of the general public using the
standard gamble. Quality of Life Research : An International Journal of Quality of Life
Aspects of Treatment, Care and Rehabilitation, 18(4), 509-1.
Wyrwich Kathleen W, Monika Bullinger. Neil Aaronson, Ron D Hays. Donald L
Patrick, Tara Symonds. and The Clinical Significance Consensus Meeting Group.
2005. Estimating clinically significant differences in quality of life outcomes.
Quality of Life Research 14 (2): 285-295.
Download