The Turing Test in the Classroom Lisa Torrey, Karen Johnson,

Proceedings of the Sixth Symposium on Educational Advances in Artificial Intelligence (EAAI-16)
The Turing Test in the Classroom
Lisa Torrey, Karen Johnson,
Sid Sondergard, Pedro Ponce, Laura Desmond
St. Lawrence University
The Imitation Game, known better today as the Turing
Test, has been a matter of debate ever since. Turing’s suggestion had a bold implicit claim: that conversation is a complex enough behavior that conversational ability is sufficient
evidence of intelligence. Objections (Searle 1980), as well
as defenses (Levesque 2009), are numerous.
In 1991, Hugh Loebner began sponsoring the Loebner
Prize, an annual contest modeled after the Turing Test. Computer programs entered into the contest attempt to convince
human judges that they are also human. The program that
most often succeeds is awarded the annual prize.
In the 1991 contest, 10 judges had 7-minute conversations
with each of 6 programs and 2 human “confederates” and
ranked them from least to most human. The judges and confederates were not computer scientists, and presumably they
did not know each other. To make the program developers’
task easier, conversations were restricted to specific topics.
Participants were instructed to converse naturally, without
“trickery or guile” (Shieber 1994).
Although the two humans were not ranked first by every judge, their average rankings were higher than any program. The programs may have been even less successful
if the judges had been more knowledgeable in computer
science, or had been permitted to conduct an interrogation
rather than a polite conversation (Shieber 1994). The format of the Loebner contest has changed from year to year in
response to such observations.
By 2000, there were no longer any restrictions placed on
conversation topics or interrogation styles. After 5 minutes
of conversation, judges classified each contestant as human
or computer. Then the conversation continued for 10 more
minutes, and the judges decided again at the end. Their classifications were 91% correct after 5 minutes and 93% correct
after 15 minutes (Moor 2001).
In 2008, the judges conducted 5-minute conversations
with a human and a program simultaneously on a split
screen, and had to decide between them. The conversations
were not synchronized in a question-answer format; any participant could send a message at any time. The winning program that year convinced 3 of the 12 judges to decide the
wrong way (Shah and Warwick 2009).
In 2013, the judges again had simultaneous unsynchronized conversations. However, this time they were all experts in artificial intelligence, and they made no classification mistakes at all.
One interesting aspect of Turing Tests in all their forms
is that humans sometimes get labeled as computers. There
Abstract
This paper discusses the Turing Test as an educational activity for undergraduate students. It describes in detail an experiment that we conducted in a first-year non-CS course. We
also suggest other pedagogical purposes that the Turing Test
could serve.
Introduction
The Turing Test occasionally appears as a hands-on activity in K-12 classrooms (Lehman 2009) and museum exhibits (Hollister et al. 2013). However, the 2013 ACM curriculum suggests only that computer science majors should
be able to describe it (The Joint Task Force on Computing
Curricula 2013).
We propose a few more active roles the Turing Test could
play in the undergraduate classroom, both within and beyond the CS major. For example, it could serve as:
• A broad-interest introduction to computer science in a
first-year course.
• An exercise in experimental design and analysis in a psychology or statistics course.
• A hands-on social activity in a course on philosophy or
artificial intelligence.
• A programming project for advanced CS students.
We begin by motivating each of these examples. Then we
elaborate on the first one by presenting a Turing Test that we
conducted with a large group of first-year undergraduates.
We describe how we organized the experiment, the software
we designed for it, and the data it generated. Finally, we
share some observations and advice that may help other instructors plan successful Turing Tests.
History of the Turing Test
In 1950, Alan Turing asked his famous question: “Can machines think?” To provide a way to answer it, he introduced
the idea of the Imitation Game (Turing 1950). In this game,
a human interrogator converses electronically with another
human and with a machine, and must decide which is which.
If the interrogator cannot reliably tell the difference, Turing suggested, then we ought to conclude that the machine
thinks.
c 2016, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
4113
the conversation transcripts is also interesting.
are many potential reasons for these errors: cultural differences, language barriers, the ambiguity of natural language,
and the limitations of electronic communication, to name a
few. Variation in the eagerness and cooperativeness of the
confederates may also play a role (Christian 2011).
Turing Test as Active Philosophy
The Turing Test is sometimes discussed in courses on artificial intelligence (Russell and Norvig 2009). It also sometimes appears in philosophy courses as part of a discussion
on mind and consciousness (Kim 2010). In courses like
these, there may be value added by going beyond discussing
the Turing Test and actually conducting one.
Arguably we need only point to the extensive literature
on active learning to justify this suggestion. Active participation in classroom activities is known to promote engagement and understanding (Prince 2004). However, we can
also identify some specific goals that are better achieved in
a live Turing Test than in a discussion.
It is difficult for students to judge whether or not the Turing Test is an effective device for the evaluation of intelligence. This is particularly true for students who lack programming experience. After testing modern chatbots themselves, students can begin to develop more informed opinions. The experience can expose assumptions they may have
about the abilities and limitations of computers - and also of
humans!
Turing Test as Advertisement
Attracting students to study computer science is an ongoing
endeavor in CS education. There are many challenges involved, but one of the major factors is unfamiliarity with the
field. Students often get little exposure to computer science
before college and may have negative preconceived notions
about it (Carter 2006).
Introductory CS courses can address this knowledge gap,
but first we must get students to take them. Exposure to the
Turing Test could be one way to pique students’ interest. It
is a broadly accessible concept, requiring no particular background knowledge to understand and discuss, and participation requires no technical skill. It tends to be an interesting
topic for students who have encountered aspects of artificial
intelligence in both fiction and reality throughout their lives.
Although nothing beats active participation, merely discussing the Turing Test can be an interesting exercise. Turing’s original paper (Turing 1950) is quite readable, and it
can provoke lively discussions on what intelligence is and
how to detect it. Introducing some controversy, via the famous Chinese Room argument (Searle 1980) and its rebuttals (Levesque 2009), can further enhance the debate.
Although these discussions are largely philosophical, they
inevitably elicit curiosity about more technical aspects of
computer science. How would a conversation program
work? Can a program make a choice? Is it capable of behaving randomly? These kinds of questions make good reasons
to point students towards programming courses.
Turing Test as Programming Exercise
The Turing Test has a software prerequisite: conversations
need to take place over an electronic interface, so that the
identities of the participants are concealed. The Loebner
prize at first addressed this need by commissioning software
from the contestants, but in recent years, the organizers have
hired a professional developer.
For experienced computer science students, creating this
type of software could be a worthwhile programming
project. The application we implemented for our Turing
Test, described in the next section, could serve as one assignment model. It touches upon several important aspects
of the CS curriculum: object-oriented programming, user interface design, concurrency, and client-server architecture.
We note that this software requirement is probably the
main reason that the Turing Test is not a common activity
in the classroom. Although any CS faculty member would
be able to implement it, the time investment may discourage
most from doing so. For this reason, our software is available (via email) to any instructor who wishes to use it or
build from it.
Turing Test as Science Experiment
The Turing Test translates a philosophical question into a
scientific experiment with human participants. For students
of disciplines like psychology and statistics, it could serve
as a practical exercise in designing and carrying out such
an experiment. Students would need to grapple with several
experimental parameters and types of data.
Some of the parameters are quantitative. How many
judges and confederates should there be? How many conversations should each participant have, and how long should
they have for each one? Should judges make binary classifications of the contestants, or rank them, or assign them
scores? What should be the criteria for a program to pass the
test?
Other important decisions are qualitative. For example,
what limitations should be placed on conversation in the
test? Unlike the Loebner Prize, a classroom test will probably not have state-of-the-art chatbots that are tailored to
the event. Some restrictions may be desirable to make the
judges’ task interesting.
The results of a Turing Test lend themselves to several
types of analysis. Judges generate data that students can collect and examine statistically. A more qualitative analysis of
Our Turing Test
We conducted a Turing Test with 78 students at our small
liberal-arts college. The students came from two courses in
the First-Year Program, whose goal is to develop collegelevel skills in writing, speaking, and research. FYP courses
are team-taught and have interdisciplinary themes that provide context for skill development.
In section 1, taught by Torrey and Johnson, the theme was
Reason and Debate in Scientific Controversy. Students in
this course, most of whom intended to pursue science majors, discussed controversies from several fields of science.
Since Torrey is a computer scientist and AI researcher, one
4114
no particular expertise in computer science or artificial intelligence. They knew how the test would proceed, but we
did not introduce them to Cleverbot before the event. We assigned them IDs that indicated where and when they should
arrive, but they did not know which role they would play until they arrived. We encouraged them all to think about potential questioning strategies in advance, in case they ended
up as judges.
The conversation topics were mostly unrestricted, but we
did establish two rules:
1. Don’t be rude.
2. Don’t ask for personal or identifying information, and
avoid giving such information if asked.
Rule 1 was just a precaution against the decivilizing effects of anonymity, but Rule 2 was important for the validity
of the test. Its goal was to offset the inherent advantage that
our students had over Cleverbot because many of them are
friends or acquaintances. For the same reason, we arranged
for most conversations to take place between students from
different course sections.
Figure 1: Our Turing Test chat interface.
of the controversies we chose was the Turing-Searle debate.
The Turing Test event served as a culminating activity for
that unit.
In section 2, taught by Sondergard, Ponce, and Desmond,
the theme was Paranormal Phenomena and Science: Experiencing and Analyzing the Inexplicable. Students in this
course examined scientific methodologies that have been
applied to the study of anomalous phenomena, from the
placebo effect to dream telepathy. Sondergard and Desmond
study the early modern scientific history of Europe and Asia,
and saw in the Turing Test a modern variant on traditional attempts to verify the presence of human (or other) agency in
inexplicable phenomena.
Software
The software we designed for our Turing Test fulfilled several needs. Of course, it allowed students to have anonymous electronic conversations. It also coordinated multiple
test groups simultaneously so that judges spoke once with
each contestant in their group, and it filed away all the conversations and judgments for later analysis.
The application is built on a client-server model. Each
participant in a session runs a copy of the client, and all the
clients connect to a single server. Both the client and server
are implemented in Java and can therefore run on any networked machine.
The server program has a bare-bones command-line interface. It asks for a session ID and the sizes of all the test
groups in the session, then starts accepting client connections. Once all the test groups are filled, conversations can
be started and stopped through keyboard prompts.
When conversations are started, the server becomes multithreaded, with one thread controlling each concurrent conversation. Messages are accepted alternately from the judge
and contestant. When a conversation ends, the server
prompts the judge’s client for a decision. It saves all conversation transcripts and judge decisions to local HTML files,
ready to view in a web browser.
The client program has a graphical interface. It begins
with a registration screen at which participants enter their
assigned IDs. During conversations, it displays the simple
chat screen in Figure 1. After each conversation, it displays
one last screen to judges only, requiring them to select a
radio button to make their decision.
Although it is simpler than the server with respect to concurrency, the client program is also multi-threaded. One
thread handles communication with the server, and the Event
Dispatch Thread handles changes to the user interface.
Implementing this software would pose several kinds of
challenges for advanced CS majors. In terms of programming skills, it would be an exercise in client-server archi-
Procedure
We structured our Turing Test much like the early Loebner contests. Judges and contestants conversed one-on-one
through a custom-built user interface (see Figure 1). After 5
minutes, the judge had to label the contestant as human or
computer. The computer role was performed by Cleverbot, a
web chatbot that learns words and phrases from people who
interact with it online.
Each of our students was assigned to be a judge, a confederate, or a transcriber. The judge and confederate roles
are familiar. The transcriber’s task was to submit the judge’s
questions to the Cleverbot website and then copy Cleverbot’s
replies back into the Turing Test interface.
We divided our 78 participants into 8 test groups. Most
groups consisted of 5 judges, 3 confederates, and 2 transcribers; the smaller group was less one judge and one transcriber. Within a group, each judge spoke to each contestant,
so most students participated in 5 conversations.
The entire event took place within two 45-minute sessions, with 4 of the 8 test groups present for each session.
We arranged for the judges to be in one computer lab, the
confederates in another, and the transcribers in a third. This
required 3 labs with 20, 12, and 8 workstations respectively.
Naturally, as first-year undergraduates, our students had
4115
Test Group
A
B
C
D
E
F
G
H
Confederate 1
2/5
3/5
3/5
2/5
5/5
5/5
2/5
3/4
Confederate 2
5/5
4/5
4/5
4/5
5/5
5/5
5/5
3/4
Confederate 3
3/5
5/5
3/5
4/5
5/5
3/5
1/5
4/4
Transcriber 1
3/5
2/5
2/5
3/5
2/5
2/5
3/5
0/4
Transcriber 2
0/5
2/5
2/5
1/5
1/5
2/5
2/5
-
Table 1: The fraction of judges within each test group who labeled each contestant as human.
transcripts can also provide some insight into student assumptions about the abilities and limitations of computers.
All of the excerpts in this section have been reproduced
without any editing.
Judges initiated conversations in a variety of ways: by
giving a conventional greeting, making a remark, or asking
a question. Typically, they abandoned their opening topic
after the contestant’s first response. Occasionally, the judge
would pursue the same topic for another exchange, but only
4% of the opening topics lasted for 3 exchanges and only
1% lasted for 4 or more. This tendency to switch gears no
doubt worked to Cleverbot’s advantage, since it appears to
base its response only on the most recent prompt.
Many judges asked about preferences and opinions:
• So what do you like to do?
• What are your thoughts on Obama?
• What’s your favorite Tom Hanks movie?
Others checked for recognition of popular trends:
• What does #tbt mean?
• Do you like to twerk?
A few demanded objective knowledge:
• What is the capital of Canada?
• What is synaptic pruning?
Judges were likely to mistake the chatbot for a human if
their conversation did not venture beyond small talk. This
type of dialogue is so formulaic that Cleverbot has no trouble
giving appropriate responses:
tecture, object-oriented design, user interface development,
and concurrency. In terms of software engineering, it could
be an exercise in team coordination, requirements analysis,
design, and testing. Since we are making our version of the
software available to instructors, it could also be used for an
assignment on modifying an existing codebase.
Statistical analysis
Table 1 displays the judgment data from each of our test
groups. Collectively, our judges had 117 conversations with
confederates and 74 conversations with Cleverbot. They correctly identified 88 of the confederates (75% correct) and 47
of the chatbots (64% correct). Their combined accuracy was
71% overall.
To look at it another way: on average, our students recognized each other as human 75% of the time, and thought
Cleverbot was human 36% of the time. They had a relatively
high error rate, but overall there was still a clear distinction
between human and machine.
This distinction was not so clear in each individual test
group. The least accurate test group was G, which correctly
classified only 53% of the confederates and misclassified
50% of the chatbots. Leaving out test group H because of
its unusual size and structure, the most accurate test group
was E, which correctly classified 100% of the confederates
and misclassified 30% of the chatbots.
In test groups A-D, the judges were from section 2 and the
confederates were from section 1. In groups E-H, it was the
other way around. If we examine them separately, some statistical differences emerge. The judges from section 2 were
70% correct on confederates, 61% correct on Cleverbot, and
66% correct overall. The judges from section 1 were 79%
correct on confederates, 65% correct on Cleverbot, and 74%
correct overall.
The most likely explanation for this discrepancy is that
the Turing Test played a larger role in the course for section
1 than for section 2. Before this event, students in section
1 read several papers, conducted in-class discussions, and
participated in an experimental design exercise. This additional exposure may have allowed them to be more discerning judges and/or more neutral confederates.
Judge: Hey there
Cleverbot: How are you?
Judge: I am good thanks. how are you?
Cleverbot: I am good.
Only a few conversations went beyond pleasantries and
identification of interests. A few judges and confederates
did establish more of an emotional or humorous connection:
Judge: What if you found out your dog died?
Confederate: my dog is 13, and she could die any day,
so thanks for that.
Judge: well... you love your dog dont you? whats the
breed of your dog? and whats yoru favorite thing about
it?
Confederate: She’s a black lab, and I love that she
doesn’t bark. She was trained not to.
Transcript analysis
Although we have no direct testimony from the judges about
their interrogation strategies or decision rationales, we can
make some inferences by analyzing their conversations. The
4116
Judge: That’s no fun how do you know she’s happy?
Confederate: She has her own ways of telling us, like
wagging her tail and jumping on her back legs.
to a social gap of some kind between judge and confederate.
For example, some confederates failed to recognize cultural
references:
Judge: Meow.
Confederate: Are you a cat?
Judge: Yes.
Confederate: Wow. That’s interesting. I didn’t know
cats could type. You must be a very smart cat.
Judge: Thank you, most people don’t know about how
cats can type. You’re one of the lucky ones.
Confederate: I feel so honored.
Judge: what’s gucci my ****, What’s drugs my
dealer...
Confederate: Why would you talk like that?
Judge: Whay would YOU talk like that? Is there
something wrong with the way I speak? I am not
judging you’re “whats up”...
Confederate: I am sorry it was a simple question, I
was just curious why you were talking like that
Judge: It’s song lyrics... I apologize for blacking on
you.
In a few interesting cases, Cleverbot seemed to fool
judges by making them feel a connection. These judges
were even willing to look beyond the occasional stumble:
Others responded soberly to a playful judge, or vice versa:
Judge: hey girl hey ;)
Confederate: Greetings.
assume I’m a girl, isn’t it?
Judge: i want to hear your secret first
Cleverbot: I just got broken up with my girlfriend
of three years, i’m hurting. What should I do to stop
dreaming about her
Judge: i think you need to find a new girl
Cleverbot: Mine are correct, yours are not
Judge: what does that even mean?
Cleverbot: It means that you should create motto for
us
Judge: YOLO! that’s the motto. deal?
That’s presumptious to
Judge: Where are you from?
Confederate: A galaxy far, far away.
In general, judges were suspicious of contestants who declined to pursue a proposed topic or indulged in non sequiturs. This seems to reflect an assumption about chatbot
weaknesses that is entirely correct. On the other hand, confederates who responded with blandness or formality were
also more likely to be called computers. This bias probably comes from media depictions of artificial intelligence,
which do not necessarily represent modern chatbots well.
Judge: Dobby has no master. Dobby is a free elf.
Cleverbot: But Dobby got a sock, Dobby is free!
Judge: How dare you stand where he stood!?
Cleverbot: Sir?
Judge: A man who trusted you!
Cleverbot: You’re not a man.
Judge: How do you know?
Cleverbot: I don’t know.
Judge: Exactly, you don’t. but we are both Harry
Potter fans and that is okay
Conclusions
We will conclude by observing which aspects of our Turing
Test worked well, and which could be improved. This experience could benefit other instructors who wish to conduct a
Turing Test, regardless of its pedagogical role.
First, we noticed that the conversations were rather brief.
They contained only 6 exchanges on average (ranging from
3 to 12). Our students agreed that the allotted time for each
conversation felt short. They may have been more accurate, and felt more satisfied, with slightly longer conversations. We would advise allowing more than 5 minutes, but
10 would probably be excessive.
It seems likely that the turn-taking style of our conversation interface slowed the students down. They were often
stuck waiting for their conversation partners to reply. However, since Cleverbot never speaks out of turn, we had to
enforce this style to avoid obvious distinctions between the
chatbot and the humans.
The students did not let their waiting time go entirely to
waste; they had lively exchanges with others in the same
computer lab. They shared quotes from their conversations
and indulged in speculation about their partners’ identities.
Overall, we thought that this enhanced their engagement in
the activity, although it did highlight the importance of placing students who are playing different roles into different
rooms.
Other judges mistook Cleverbot for a human when it
made replies that seemed witty, knowing, or creative. This
type of “cleverness” is just what Cleverbot is named for:
Judge: What’s up?
Cleverbot: The ceiling.
Judge: What’s down?
Cleverbot: The floor.
Judge: knock knock
Cleverbot: No, no jokes. This is a serious matter.
Judge: What is life?
Cleverbot: Life is but a second of an eternity that we
will exist in.
Judges misclassified their fellow humans as computers
surprisingly often. Typically, these errors could be attributed
4117
References
A related question we considered is whether the software should have facilitated simultaneous split-screen conversations instead of a sequence of one-on-one conversations. Hugh Loebner favors the simultaneous style (Loebner 2009), and the more recent Loebner contests have been
structured that way. However, given the wide range of success our students had at recognizing each other as human,
we prefer the one-on-one style. Results with the simultaneous style would seem to confound multiple factors.
The fact that our test groups had such a wide range of
statistical results made us glad that we had 8 of them to analyze collectively. If we had only been able to conduct one
test, and it happened to produce results far from the median
in either direction, our students might have come away from
the activity with misconceptions about the status of modern
AI. Instructors with smaller groups of students could have
each student participate in more than one test, perhaps with
a different role each time.
One practical problem that we encountered was that our
students belonged to a shared community. There were a few
exchanges like this:
Carter, L. 2006. Why students with an apparent aptitude
for computer science don’t choose to major in computer science. In Proceedings of the 37th Technical Symposium on
Computer Science Education, 27–31.
Christian, B. 2011. The Most Human Human. Doubleday,
New York, NY.
Hollister, J.; Parker, S.; Gonzalez, A.; and DeMara, R. 2013.
An extended Turing test. Lecture Notes in Computer Science
8175:213–221.
Kim, J. 2010. Philosophy of Mind. Westview Press.
Lehman, J. 2009. Computer science unplugged: K-12 special session. Journal of Computing Sciences in Colleges
25(1):110–110.
Levesque, H. 2009. Is it enough to get the behavior right?
In Proceedings of the 21st International Joint Conference on
Artificial Intelligence, 1439–1444.
Loebner, H. 2009. How to hold a Turing test contest. In
Epstein, R.; Roberts, G.; and Beber, G., eds., Parsing the
Turing Test. Springer. 173–179.
Moor, J. 2001. The status and future of the Turing test.
Minds and Machines 11(1):77–93.
Prince, M. 2004. Does active learning work? A review of
the research. Journal of Engineering Education 93(3):223–
231.
Russell, S., and Norvig, P. 2009. Artificial Intelligence: A
Modern Approach. Prentice Hall.
Searle, J. 1980. Minds, brains, and programs. Behavioral
and Brain Sciences 3:417–457.
Shah, H., and Warwick, K. 2009. Testing Turing’s five minutes parallel-paired imitation game. Kybernetes Turing Test
Special Issue 39(3):449–465.
Shieber, S. 1994. Lessons from a restricted Turing test.
Communications of the ACM 37(6):70–78.
The Joint Task Force on Computing Curricula. 2013. Computer Science Curricula 2013. Association for Computing
Machinery.
Turing, A. 1950. Computing machinery and intelligence.
Mind 59(236):433–460.
Judge: Describe your FYP male professor?
Confederate: One is tall, wears converse every other
day. He has a ponytail and a handlebar mustache. The
other is shorter with glasses and always wears a shirt
and tie to class with a vest.
Judge: are you excited for this upcoming break?
Confederate: very excited cant wait to head home on
friday and finally eat a home cooked meal.
Ideally, Turing Test participants would be a diverse set
of strangers, so that judges cannot exploit common knowledge to recognize confederates so easily. In the classroom
context, of course, we must settle for less ideal circumstances. We do think it helped that most conversations in
our tests were between students from different course sections. However, it would probably be best to have an explicit rule about concealing community membership as well
as personal identity.
We were pleased at the extent to which our students followed the rules we did have. It most likely helped that we
went over these rules multiple times and took care to explain
why they were necessary. Preparing students effectively
seems crucial for a successful Turing Test, and not only for
the purposes of rule enforcement: judges need to think about
their strategies in advance, or they may find themselves with
minds blank at the first empty chat screen.
We think it would be possible to prepare students too well
for a Turing Test. The more they know about chatbots and
their weaknesses in advance, the fewer classification errors
they are likely to make, and a test with very high accuracy
would not be as interesting or educational. Encouraging students to think about likely weaknesses and having them test
their speculations, without confirming or contradicting them
in advance, seems to us to strike the right balance.
Contact
Please contact ltorrey@stlawu.edu to acquire the Turing Test
software or for further information.
4118