Unknown Rewards in Finite-Horizon Domains Colin McMillen and Manuela Veloso

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)
Unknown Rewards in Finite-Horizon Domains
Colin McMillen and Manuela Veloso
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213 U.S.A.
{mcmillen,veloso}@cs.cmu.edu
Abstract
“Human computation” is a recent approach that extracts information from large numbers of Web users. reCAPTCHA
is a human computation project that improves the process of
digitizing books by getting humans to read words that are difficult for OCR algorithms to read (von Ahn et al., 2008). In
this paper, we address an interesting strategic control problem inspired by the reCAPTCHA project: given a large set
of words to transcribe within a time deadline, how can we
choose the difficulty level such that we maximize the probability of successfully transcribing a document on time? Our
approach is inspired by previous work on timed, zero-sum
games, as we face an analogous timed policy decision on the
choice of words to present to users. However, our Web-based
word transcribing domain is particularly challenging as the
reward of the actions is not known; i.e., there is no knowledge if the spelling provided by a human is actually correct.
We contribute an approach to solve this problem by checking
a small fraction of the answers at execution time, obtaining an
estimate of the cumulative reward. We present experimental
results showing how the number of samples and time between
samples affects the probability of success. We also investigate
the choice of aggressive or conservative actions with regard to
the bounds produced by sampling. We successfully apply our
algorithm to real data gathered by the reCAPTCHA project.
Figure 1: reCAPTCHA digitization example. OCR software reads the circled word as “announcacl,” but with low
confidence. This word is then extracted and presented as
a CAPTCHA challenge to users, who type in the correct
spelling: “announced.”
aid in the digitization of old books, newspapers, and other
texts. Typically, such texts are digitized by scanning the entire source document, then running optical character recognition (OCR) software on the resulting images in order to recover the original text in digital form. However, state-of-theart OCR software cannot achieve perfect transcription accuracy, especially on old books in which the words are often
faded and distorted. In these cases, the OCR software will
output its best guess for the word, but with a low confidence
score. reCAPTCHA takes these low-confidence OCR words
and displays them to users in the form of a CAPTCHA challenge. These human answers are then used to improve the
accuracy of the digitized text. Figure 1 shows an example
of this process. In this example, the circled word is read
as “announcacl” by the OCR software, but with low confidence. This unknown word is extracted from the source document and shown to users, who type in the correct spelling:
“announced.” To check whether the user is a human, the
CAPTCHA also displays a known word, for which the correct spelling is already known—“Eugene” in the example
above. To ensure that automated programs cannot read the
Introduction
A CAPTCHA is a challenge-response test that can be used
to tell humans and computers apart (von Ahn et al. 2003).
CAPTCHAs are typically used to prevent automated registrations on Web-based email services and other Web sites.
The most common form of CAPTCHA consists of an image
that contains distorted words or letters. The user is prompted
to type in the letters that appear in the image. In order
to correctly distinguish between humans and computers, a
CAPTCHA must present the user with a challenge that is
relatively easy for humans to solve, but difficult or impossible for computers to solve. Therefore, every time a human solves a CAPTCHA, he/she is performing a task that is
known to be difficult for state-of-the-art AI algorithms.
The reCAPTCHA project, http://recaptcha.net,
aims to make positive use of this precious “human computation” power by using human responses to CAPTCHAs to
c 2008, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
963
proportion of users are malicious at a given time, the challenge generator can limit the damage done by temporarily
serving two known words to all users. This means that
no new words are transcribed, but the quality of the transcription is not negatively affected by the malicious users.
Conversely, if we know that all users are not malicious, the
challenge generator could theoretically send two unknown
words to each user, doubling the transcription rate but increasing the chance that any malicious user would be able to
seed incorrect transcriptions into the system.2
In this work, we focus on a control problem inspired
by reCAPTCHA: given a document containing w unknown
words and a hard time deadline h, how can the challenge
generator choose challenges in order to maximize the probability of successfully transcribing the entire document on
time? This poses a challenging control problem, because the
reCAPTCHA system does not know whether an answer to
an unknown word is correct. We are therefore posed with the
difficult problem of trying to obtain some given amount of
reward (number of words digitized) without actually knowing the amount of reward achieved during execution.
In the following section, we briefly present a background
of the thresholded-rewards framework that has inspired our
work. We then present our formalization of a reCAPTCHAinspired domain as a thresholded-rewards MDP. We present
an algorithm that aims to address the problem of unknown
rewards by taking periodic samples of reward values. We
experimentally evaluate the performance of this algorithm
against the optimal thresholded-rewards policy and the policy that maximizes expected rewards (without considering
time or cumulative reward). We also compare the effects of
two different reward estimation functions.
Figure 2: Outline of the reCAPTCHA system. The challenge generator is responsible for choosing words from the
pools of known and unknown words and showing them to
users such that that the probability of digitizing the entire
document by the time deadline h is maximized.
words, both words are distorted, and a black line is drawn
through them.
Figure 2 shows a simplified outline of the complete
reCAPTCHA system. reCAPTCHA maintains pools of
known words and unknown words. When a new document is
added to the system, any words that OCR cannot confidently
read are added to the pool of unknown words. For some
documents, there may also be an associated time deadline;
if so, the system only has a limited amount of time available to digitize the document. For the purpose of this work,
the most important aspect of the reCAPTCHA system is the
“Challenge Generator,” which is responsible for choosing
words from the word pools and displaying them to users
in the form of CAPTCHA challenges. If we have a hard
time deadline, the challenge generator is faced with a control problem: how should we choose challenges such that we
maximize the probability of digitizing the document before
the deadline? This turns out to be a particularly challenging
problem because the rewards achieved are unknown; i.e., the
system does not know whether a user’s response to an unknown word is actually the correct spelling of that word.
During normal reCAPTCHA operation, the challenge
generator sends one known word and one unknown word
to each user. If the user answers the known word correctly,
their response for the unknown word is counted as a “vote”
for the correct spelling of that word. If enough users agree
on the spelling of the unknown word, it is moved into the
pool of known words.1
While most users make a good-faith effort to correctly
transcribe the words, some users maliciously submit incorrect answers a high fraction of the time, which could potentially result in incorrect transcriptions. If a relatively large
Background
In this paper, we build upon our work on MDPs with thresholded rewards (McMillen and Veloso 2007). A thresholdedrewards MDP (TRMDP) is a tuple (M, f, h), where M
is an MDP, f is a threshold function, and h is the time
horizon. The goal of a TRMDP is to maximize the value
of f (r), where r is the cumulative reward obtained by
the end of h time steps. TRMDPs are also similar to
the concept of risk-sensitive MDPs (Liu and Koenig 2005;
2006). In previous work, we focused on the problem of
timed, zero-sum games such as robot soccer, where the goal
is to be ahead at the end of the time horizon. In this work,
we address a domain that also has a hard time deadline,
but is not zero-sum. Instead, our “score” is the number of
words successfully digitized so far. We can model the reCAPTCHA domain as an MDP, and the threshold function
f is simple: 1 if we successfully digitize w words before the
time deadline, 0 otherwise. This makes TRMDPs a natural
choice for finding the optimal policy in this domain.
However, the optimal policy for a TRMDP depends on
2
The production reCAPTCHA system would never send two
unknown words to a user, because this would severely compromise
the ability to distinguish humans from computers. For the purposes
of this paper, we only address the problem of digitizing documents,
ignoring the security implications of such a choice.
1
More details on the production reCAPTCHA system, which
has been online since May 2007, can be found in (von Ahn et al.
2008).
964
T (s, ∗, s0 )
s =accurate
s =mixed
s =attack
standard
1 (p=0.9522)
accurate -2 (p=0.0478)
mixed
1 (p=0.8105)
-2 (p=0.1895)
attack
1 (p=0.4783)
-2 (p=0.5217)
two-unknown
2 (p=0.7067)
-1 (p=0.1910)
-4 (p=0.1023)
2 (p=0.5569)
-1 (p=0.3572)
-4 (p=0.0859)
2 (p=0.1288)
-1 (p=0.5490)
-4 (p=0.3222)
s0 =mixed
0.09
0.9
0.01
s0 =attack
0.01
0.01
0.9
Figure 5: Transition probabilities (reCAPTCHA domain).
Figure 3: MDP model of the reCAPTCHA domain.
R(s, a)
s0 =accurate
0.9
0.09
0.09
two-known
0 (p=1.0)
0 (p=1.0)
0 (p=1.0)
Figure 4: Reward distribution (from reCAPTCHA data).
the state, the time remaining, and the cumulative reward actually obtained during execution time. In the reCAPTCHA
domain, this presents a problem—since the system does not
know the correct spelling of each word, the system does
not know how much cumulative reward has been obtained
during execution time. In this paper, we extend the TRMDP approach by presenting a sampling-based algorithm
that chooses actions based on an estimate of the cumulative
reward achieved so far.
In addition to the optimal solution algorithm, we have
also presented heuristic solution methods for TRMDPs
(McMillen and Veloso 2007). One of these heuristic approaches is uniform-k, in which the agent only considers
changing its policy every k steps to save computation time.
In this paper, we show how the uniform-k heuristic is also
useful for the development of a sampling-based approach.
It is important to note that the problem of unknown rewards in a thresholded-rewards domain is distinct from the
problem of an unknown or unmodeled domain, such as is addressed by reinforcement learning (Kaelbling, Littman, and
Moore 1996). In reinforcement learning, the agent does not
know the transition or reward functions a priori, but must
learn about the domain using the transitions and rewards experienced during execution. Rather, we assume that the transition and reward functions are known, but the actual reward
values received during execution are unknown.
Figure 6: Histogram of per-user solution accuracy for
31,163 of the most active reCAPTCHA users.
Each step in the MDP corresponds to a single user requesting a CAPTCHA challenge. The challenge generator then
needs to choose an action—what type of CAPTCHA challenge to send to the user. We consider three types of challenges: standard, in which the user is shown one known
word and one unknown word; two-known, in which the user
is shown two known words; and two-unknown, in which the
user is shown two unknown words.
The reward distribution for the reCAPTCHA domain was
derived by analyzing the answers provided by 31,163 of
the most active reCAPTCHA users. Over a six-month period, these users submitted over 29 million answers to reCAPTCHA. We measured each user’s solution accuracy;
Figure 6 shows a histogram of per-user solution accuracies.
Based on this histogram, we have partitioned the users into
three classes:
1. Accurate users. This group consists of the 27,286 users
that attain a solution accuracy of 85% or higher. These
users submitted a total of 28,921,000 answers; 95.22% of
these answers were correct. Accurate users constitute the
majority of reCAPTCHA users.
TRMDP Domain Model
2. Inaccurate users. This group consists of the 3,807 users
with solution accuracies in the range [10%-85%). These
users submitted a total of 873,000 answers; 66.88% of
these answers were correct. Inaccurate users are not necessarily trying to seed incorrect answers into the system;
many of them are from countries where English is not a
commonly-spoken language. This makes it more difficult
to pass the reCAPTCHA challenge, since most of the reCAPTCHA challenges consist of English words.
Given a source document containing w unknown words and
a time deadline h steps in the future, we model the reCAPTCHA domain as a TRMDP (M, f, h), where M is an
MDP containing three states (accurate, mixed, and attack)
as shown in Figure 3. f is the threshold function:
1 if r ≥ w
f (r) =
(1)
0 otherwise.
965
Algorithm 1 Sampling-based control policy.
1: Given: MDP M , threshold function f , time horizon h,
sampling interval k, reward estimation function E
3. Malicious users. This group consists of 70 users with solution accuracies lower than 10%. These users submitted
a total of 108,000 answers; 0.44% of these answers were
correct. Though these users provide a small proportion
of total answers, it is important to consider them in our
analysis because their traffic can be “bursty”; i.e. multiple
malicious users may simultaneously submit many bad answers to reCAPTCHA as part of an organized attack. After time passes, the malicious users generally disappear.
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
Our MDP model has three states: accurate, mixed, and
attack. The accurate state corresponds to the most common
case, when nearly all the users are “accurate”: answering
reCAPTCHA challenges with high accuracy. In the mixed
state, we assume that half the users are “accurate” and half
the users are “inaccurate.” We therefore expect that a standard reCAPTCHA challenge will get answered correctly approximately 81% of the time. In the attack state, we similarly assume that half the users are “accurate” and half the
users are “malicious,” leading to an overall accuracy rate of
approximately 48%.
Figure 4 shows the reward distribution for our model,
which is derived from actual reCAPTCHA answer data. We
assume that we get 1 point of reward for every unknown
word answered correctly by a user, and −2 points of reward
for every word answered incorrectly, as it generally takes
two correct answers to override each incorrect answer when
it comes time to produce the final output. For a standard
CAPTCHA challenge, the agent receives reward of either 1
or −2. For a two-unknown challenge, the agent receives reward of 2 (if both words are answered correctly), −1 (if one
word is answered correctly and one word is answered incorrectly), or −4 (if both words are answered incorrectly). For
a two-known challenge, no rewards are obtained because no
unknown words are presented to the user.
Figure 5 shows the transition probabilities for our domain,
which model the fact that the most common state is accurate, followed by mixed, with malicious occurring only a
small proportion of the time. We assume that our choice of
action does not affect the transition probabilities; therefore
the transition probabilities only depend on the current state.
π ← U NIFORM(M, f, h, k)
S ← {} // set of reward samples
s ← s0
// current state
for t ← h to 1 do
if |S| = 0 then
r̂ ← 0
else
r̂ ← nearest integer(E(S))
a ← π(s, t, r̂)
(s, r) ← ACT(M, a)
when r then // get reward sample (every k steps)
|S| ← |S| ∪ {r}
have developed for thresholded-rewards domains with unknown rewards. This algorithm takes the same inputs as the
TRMDP solution algorithm (McMillen and Veloso 2007):
an MDP M , threshold function f , and time horizon h. For
the reCAPTCHA domain presented above, f is the threshold function shown in Equation 1: we receive reward 1 if
the cumulative reward is greater than or equal to the number
of unknown words w in the document; 0 otherwise. Algorithm 1 also takes as input an integer k, the number of time
steps elapsed between each reward sample; and a reward estimation function E, which takes in a set of reward samples
S and outputs an estimate of how much cumulative reward
the agent has received so far. In this paper, we consider two
reward estimation functions. The first, M EAN, computes the
mean of the samples, then multiplies this value by the total
number of steps taken so far:
P
r
M EAN(S) = r∈S × (h − t)
(2)
|S|
The other reward estimation function we consider is L OW,
which estimates the per-step reward as the lower bound of
the 95% confidence interval of the samples seen so far:
Sampling-Based Control Policy
L OW(S) = CI-L OWER -B OUND(S) × (h − t)
The optimal policy in a thresholded-rewards domain depends on the time remaining and the cumulative reward obtained by the agent. In this section, we present our samplingbased control policy which aims to address the challenge of
having unknown rewards at execution time. We assume that
our agent occasionally observes the reward received when
taking an action; for the reCAPTCHA domain, we assume
there is a trusted human who checks the results of every kth
CAPTCHA response and verifies whether the solutions to
the unknown words (if any) were correct. The agent will
only consider changing its policy when it receives a reward
sample, which means that the agent’s policy will be equivalent to the uniform-k policy except that we use an estimate
of the cumulative reward rather than the true cumulative reward.
Algorithm 1 shows the sampling-based control policy we
Line 2 of Algorithm 1 first calls the uniform-k TRMDP
solution algorithm. This returns the best policy π that only
considers changing strategies every k time steps. On lines
3–4, we initialize the set of reward samples to the empty set
and the current state of the system to be the initial state of
MDP M . Line 5 begins the main control loop of the agent:
in each iteration through this loop, the agent executes a single action. Lines 6–9 determine the reward estimate r̂ that
will be used to determine the action. If the agent has not
yet received any reward samples, it assumes that the cumulative reward is 0. If the agent does have samples, it
calls the reward estimation function to estimate the cumulative reward. The result of the reward estimation function
is rounded to the nearest integer, because the policy π returned by the uniform-k solution algorithm requires integer
reward values. Line 10 determines the optimal action to take
966
(3)
Figure 7: Value of the optimal policy and uniform-k for the
reCAPTCHA domain, with the time horizon h = 1000, values of k in {10, 20, 50, 100}, and thresholds from 500–1500.
Figure 8: Results when sampling using the M EAN reward
estimation function on the reCAPTCHA domain, with time
horizon h = 1000, values of k in {1, 10, 20, 50, 100}, and
thresholds from 500–1500. The y-axis shows the proportion
of 3,600 trials which were successes. The success rate of the
maximize-expected-rewards (“MER”) policy is also shown.
given the current state, the time remaining, and our estimate
of the cumulative reward. On line 11, the agent acts in the
world. As a result of this action, the agent receives the new
state of the system s, and may also receive knowledge of the
reward r it received for taking that action. If a reward sample is received, the agent adds it to the set of reward samples
seen. Regardless, the loop then continues at line 5, until h
time steps have elapsed and the process completes.
the threshold is set to 1200. The value drops off sharply
after that point; with the reward threshold set to 1500, the
target is achieved less than 5% of the time. The uniform-k
policies perform worse as k increases; this is unsurprising
since higher values of k correspond to rougher approximations to the optimal policy. However, even uniform-100 has
performance that is reasonably close to optimal.
Figure 8 shows the results of our sampling algorithm
when the M EAN function is used to estimate the reward.
Again, we have set the time horizon to h = 1000 steps and
thresholds from 500–1500. For the sampling algorithm, k
determines how often the agent receives a reward sample,
and is set to 1, 10, 20, 50, or 100. Sampling-1, which receives a reward sample at every time step, is equivalent to the
optimal policy. For each combination of k and threshold, we
have run 3,600 trials; the graph shows the fraction of these
trials which were successes (i.e. in which the reward threshold was met). For comparison with the non-thresholdedrewards solution, we also show the success rate of the policy that simply maximizes expected rewards (“MER”). The
MER policy always chooses the standard CAPTCHA challenge when the MDP is in the accurate or mixed states, and
the two-known challenge when the MDP is in the attack
state. Since the MER policy does not depend on the reward
received so far, it chooses an action at every time step (as
does sampling-1).
Figure 8 clearly shows that the sampling-based
thresholded-rewards policy outperforms the policy which
maximizes expected rewards. However, there is a significant
gap between the sampling-based policies and the theoretical
upper bounds shown in Figure 7. With the M EAN reward
estimator, agents’ reward estimates are effectively “too
optimistic” in a significant number of trials. An agent
Results
In this section, we present results showing the effectiveness of our algorithm on the example reCAPTCHA domain.
First, we consider the question: how effective would an
agent be if the agent actually knew the reward it received
at execution time? By running the optimal TRMDP solution
algorithm on this domain, we can find the value of the optimal policy, which serves as an upper bound on the value of
any solution algorithm for this domain. Further, the uniformk algorithm provides the optimal policy that only considers
changing policy every k steps. This gives an upper bound on
value of any policy that samples every k steps: if the reward
estimation function E always returns the actual cumulative
reward, the sampling policy will always choose the optimal
uniform-k action; however, if E estimates incorrectly, the
sampling policy might choose a suboptimal action.
Figure 7 shows the results of running the optimal solution
algorithm and uniform-k heuristic on the reCAPTCHA domain, with the time horizon set to h = 1000 steps, threshold
values ranging from 500 to 1500, and k set to 10, 20, 50, and
100. The value of each policy is equal to the probability than
an agent following that policy will achieve the desired reward threshold. For a threshold of 500 to 800 words, the figure shows that the agent can achieve the threshold with near
certainty (> 95%). As the threshold increases, the probability of achieving the reward threshold decreases, but the optimal algorithm can still succeed over 50% of the time when
967
Conclusion
In this paper, we have presented a sampling-based algorithm
that enables agents to reason about cumulative rewards and
time deadlines in domains where the exact rewards achieved
by the agent are not known to the agent at execution time.
Our previous work was directed towards zero-sum games;
in this work we address a problem that has a hard time deadline and a notion of “score”, but that is not a game and does
not have a specific opponent. To this end, we have used over
29 million solutions from over 31,000 reCAPTCHA users to
come up with a plausible model of a domain inspired by the
reCAPTCHA project. Using this domain, we have shown
that reasoning about time and “score” can lead to a significant benefit, even if we only obtain a small sample of reward
values during execution time. We have tested the effectiveness of two possible reward estimation functions: M EAN,
which estimates cumulative reward by using the mean of all
samples seen so far; and L OW, which estimates cumulative
reward by using the lower bound of the 95% confidence interval. If the M EAN function happens to overestimate the
reward achieved by the agent, this can potentially cause the
agent to adopt an overly conservative strategy, which detracts from overall performance. In comparison, the L OW
function provides a somewhat pessimistic estimate of the reward obtained by the agent, which prevents the agent from
erroneously adopting an overly conservative strategy.
Figure 9: Results when sampling using the L OW reward estimation function on the reCAPTCHA domain, with time
horizon h = 1000, values of k in {1, 10, 20, 50, 100}, and
thresholds from 500–1500. The y-axis shows the proportion
of 3,600 trials which were successes.
which believes it has enough reward to meet the threshold
will begin to act conservatively, selecting the two-known
CAPTCHA because the two-known CAPTCHA cannot lead
to negative reward. If the reward estimate is correct, this is
indeed the best strategy, but if the reward estimate is too
high, the agent’s “complacent” choice of the two-known
CAPTCHA prevents it from achieving the remaining reward
that is needed. When the desired reward is relatively
easy to achieve, the chance of incorrectly falling into this
“complacent” strategy is higher, explaining the relatively
large gap between the sampling-based policies and the
theoretical upper bounds when the threshold value is low.
Figure 9 shows the results of our sampling algorithm
when the L OW function is used to estimate the reward. The
L OW function is more pessimistic with regard to the estimated reward; it assumes that the average per-step reward
is the lower bound of the 95% confidence interval of the reward samples seen so far. Since the L OW function underestimates the reward (compared to M EAN), an agent using the
L OW function is unlikely to erroneously fall into the “complacent” strategy. The results indicate that the performance
of the L OW function is better than the performance of M EAN
for almost every setting of k and the threshold value. The
performance of sampling-10 nearly matches the theoretical
upper-bound of uniform-10. As expected, the overall performance degrades as we take samples less often. However,
even sampling-100 significantly outperforms the maximizeexpected-rewards policy, which is quite impressive since the
sampling-100 agent only receives 10 reward samples over
the entire time horizon.
These results show that there is a significant benefit to
reasoning about reward and time in thresholded-rewards domains, even if our agent only obtains a small sample of the
reward values received during execution.
Acknowledgements
We would like to thank Luis von Ahn for providing the first
author with the opportunity to work on the reCAPTCHA
project and also for providing helpful feedback. This research was sponsored in part by United States Department
of the Interior under Grant No. NBCH-1040007, and in part
by the Boeing Corporation. The views and conclusions contained in this document are those of the authors only.
References
Kaelbling, L. P.; Littman, M. L.; and Moore, A. W. 1996.
Reinforcement learning: A survey. Journal of Artificial
Intelligence Research.
Liu, Y., and Koenig, S. 2005. Existence and finiteness conditions for risk-sensitive planning: Results and conjectures.
In Proceedings of Uncertainty in Artificial Intelligence.
Liu, Y., and Koenig, S. 2006. Functional value iteration for
decision-theoretic planning with general utility functions.
In Proceedings of the Twenty-First National Conference on
Artificial Intelligence (AAAI-06).
McMillen, C., and Veloso, M. 2007. Thresholded rewards:
Acting optimally in timed, zero-sum games. In Proceedings of the Twenty-Second Conference on Artificial Intelligence (AAAI-07).
von Ahn, L.; Blum, M.; Hopper, N. J.; and Langford, J.
2003. CAPTCHA: Using hard AI problems for security. In
Eurocrypt 2003.
von Ahn, L. et al. 2008. Manual character recognition using online security measures: An example of crowd computing. (Under review).
968