Verbal IQ of a Four-Year Old Achieved by an AI... Stellan Ohlsson György Turán

advertisement
Late-Breaking Developments in the Field of Artificial Intelligence
Papers Presented at the Twenty-Seventh AAAI Conference on Artificial Intelligence
Verbal IQ of a Four-Year Old Achieved by an AI System
Stellan Ohlsson and Robert H. Sloan
University of Illinois at Chicago
stellan | sloan @uic.edu
György Turán
Aaron Urasky
University of Illinois at Chicago
University of Szeged
gyt@uic.edu
University of Illinois at Chicago
aaron.urasky@gmail.com
not obtain any positive results; we wish we could measure
the performance of Watson of Jeopardy! fame.)
We measure the system using a tool developed by
psychometricians: the IQ test, specifically the Wechsler
Preschool and Primary Scale of Intelligence, Third Edition
(WPPSI-III) test, designed to assess the intelligence of
young children (Wechsler 2002). Our general plan and
some preliminary results on made-up questions are given
in (Ohlsson et al. 2012). We report here on ConceptNet’s
performance on the WPPSI-III VIQ. As required, grading
was done by a psychologist (author Ohlsson). The results
depend on both ConceptNet and our algorithms that use it
to answer the questions.
We selected the five subscales of the test for which the
items could be input into a computer system with a
relatively direct translation (ruling out, e.g., subtests where
the child has to work with blocks). These turn out to be the
five subtests from which the Verbal IQ is defined.
A WPPSI-III full-scale IQ is determined by a
Performance IQ (drawing, puzzle, and memory tasks) and
a VIQ; the IQs have mean 100 and standard deviation 15.
VIQ is determined by three of the five verbal subtests:
Information,
Vocabulary,
Word
Reasoning,
Comprehension, and Similarities. The first three are
“core,” the last two are “supplemental.” The examiner may
choose any three subtests such that at least two are core.
Vocabulary items ask for word meanings, and
Information items ask for simple retrieval of facts.
Similarity items ask what two things (e.g., apples and
oranges) have in common. Word Reasoning items ask for a
word based on clues. In a Comprehension item, the task is
to give an explanation in response to a why-question such
as “Why do we keep ice cream in the freezer?”, going
beyond retrieval. (ConceptNet answered: “UsedFor keep
cold”.) The Comprehension subtest is often described as a
test of “common sense” as opposed to some other
cognitive ability.
Abstract
Verbal tasks that have traditionally been difficult for
computer systems but are easy for young children are
among AI’s “grand challenges”. We present the results of
testing the ConceptNet 4 system on the verbal part of the
standard WPPSI-III IQ test, using simple test-answering
algorithms. It is found that the system has the Verbal IQ of
an average four-year-old child.
Overview and Background
The question how machines might exhibit common sense
is implicit in (Turing 1950), and explicit in (McCarthy
1959). Common sense is “common knowledge about the
world that is possessed by every schoolchild and the
methods for making obvious inferences from this
knowledge” (Davis 1990). Nilsson (Nilsson 2009) quotes
Brooks’ challenges (Brooks 2008): the object-recognition
capabilities of a two-year old, the language capabilities of a
four-year old, the manual dexterity of a six-year-old, and
the social understanding of an eight-year-old.
Capturing language capabilities and common sense in a
computer system has turned out to be even harder than
capturing technical expertise. Researchers have come to
realize that a large knowledge base of facts is probably a
prerequisite for successful common-sense reasoning.
Attempts to build such a knowledge base include
ConceptNet/AnalogySpace (Speer, Havasi, and Lieberman
2008; Havasi, Speer, and Alonso 2007; Havasi et al. 2009),
Cyc (Lenat 1995), and Scone (Fahlman 2006).
We report on the Verbal IQ (VIQ) of ConceptNet 4 and
AnalogySpace, a hybrid between symbolic and statistical
systems, that uses both a semantic net and Principal
Component Analysis. More information about it is
available at http://csc.media.mit.edu/docs/conceptnet/. (We
made a preliminary attempt to test Cyc as well but could
Copyright © 2013, Association for the Advancement of Artificial
89
intelligence of a young child. One area for improvement is
answering why questions, a known hard problem in
question answering (Maybury 2004). We speculate that
better algorithms could get the results to average for a five
or six year old, but something new would be needed to
answer Comprehension questions with the skill of a child
of eight.
Methods and Results
Subtest:
Information
Word
Reasoning
Vocabulary
Similarities
Comp.
Subtests included in
VIQ
Standard Best Worst
3
3
x
x
x
x
x
Scoring
Regimen
Strict Relaxed
20
3
21
3
120
x
x
x
x
20
24
2
21
37
2
100
80
Verbal IQ score
Table 1. Raw WPPSI-III verbal subscores obtained with the
ConceptNet system, using two different scoring regimens (see
text). Also shows which subscores are used in which
computations of VIQ reported in the text and in Figure 2.
Strict scoring
60
Relaxed scoring
40
Our objective was to explore the capabilities available in
ConceptNet for a non-expert user without a significant
knowledge engineering effort. Thus, we wrote short
Python programs to feed items into ConceptNet using its
fairly rudimentary NLP tools with some added processing.
We got answers from the system with a score showing,
intuitively, its degree of belief. The strict score uses the
top-scored answer, and the relaxed score uses the best
answer from among the top five.
Raw scores, strict and relaxed, are given for each of the
five subtests in Table 1. VIQ for ConceptNet, using the
standard choice of three subtests (Information, Verbal, and
Word Reasoning) as a function of age in years are shown
in Figure 1. Relaxed scoring leads to only slightly higher
scores; the only large difference is for Similarity.
Treating the testee as 4 years old, the VIQ score is
average (VIQ = 100, based on scaled subscores
Information 10, Vocabulary 13, Word Reasoning 7) but at
an assumed age of 5 years, the system scores somewhat
below average, a VIQ of 88. At an assumed age of 7 years,
it scores very far below average, a VIQ of 72.
The tester may choose, with constraints, which three
subtests to use to compute the VIQ. In Figure 2 we show
percentile results as a function of age for the (strict)
standard three subtests, the lowest scoring and the best
scoring permissible set of subtests. These vary greatly,
because ConceptNet’s subscores have a wider range than a
typical child. Considered as a 4-year old, the system is in
the 21st, 50th, and 79th percentile (worst, standard, best),
while, considered as a 7-year old, the system falls below
the 10th percentile using all three scoring methods.
20
0
4
5
6
7
Age of testee
Figure 1. WPPSI-III VIQ of ConcepteNet as a function of
assumed age in years, computed using the three standard
subtests, using both strict and relaxed scoring.
90
80
70
Percentile
60
50
Standard
Best three
Worst three
40
30
20
10
0
4
5
6
7
Age of testee
Figure 2. WPPSI-III percentile for VIQ as a function of age,
computed using the best possible, standard, and worst possible
legal choice of three subtests. Strict scoring was used in all cases.
Acknowledgements
Ohlsson was supported, in part, by Award # N00014-09-10125 from the Office of Naval Research (ONR), US Navy.
Other authors were supported by NSF Grant CCF0916708.
Concluding Remarks
ConceptNet obtained an average VIQ for a 4-year-old. By
some definition of human intelligence, it has the
90
References
Brooks, R. 2008. I, Rodney Brooks, am a robot. IEEE Spectrum
45(6).
Davis, E. 1990. Representations of Commonsense Knowledge.
Morgan Kaufmann Pub.
Fahlman, S. 2006. Marker-passing inference in the Scone
knowledge-base system. Knowledge Science, Engineering and
Management:114–126.
Havasi, C.; Pustejovsky, J.; Speer, R.; and Lieberman, H. 2009.
Digital intuition: Applying common sense using dimensionality
reduction. Intelligent Systems, IEEE 24(4):24–35.
Havasi, C.; Speer, R.; and Alonso, J. 2007. ConceptNet 3: a
flexible, multilingual semantic network for common sense
knowledge. In Recent Advances in Natural Language Processing,
27–29. http://www.media.mit.edu/~jalonso/cnet3.pdf.
Lenat, D. B. 1995. Cyc: a large-scale investment in knowledge
infrastructure. Communications of the ACM 38(11):33–38.
Maybury, M. T. 2004. Question answering: an introduction. In
New Directions in Question Answering, 3–18.
McCarthy, J. 1959. Programs with common sense. In Proc.
Teddington Conf. on the Mechanization of Thought Processes.
http://www-formal.stanford.edu/jmc/mcc59.html.
Nilsson, N. J. 2009. The Quest for Artificial Intelligence.
Cambridge University Press.
Ohlsson, S.; Sloan, R. H.; Turán, G.; Uber, D.; and Urasky, A.
2012. An approach to evaluate AI commonsense reasoning
systems. In Twenty-Fifth International FLAIRS Conference.
http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS12/paper/vie
wPDFInterstitial/4399/4828.
Speer, R.; Havasi, C.; and Lieberman, H. 2008. AnalogySpace:
Reducing the dimensionality of common sense knowledge. In
Proceedings of AAAI.
Turing, A. M. 1950. Computing machinery and intelligence.
Mind 59(236):433–460.
Wechsler, D. 2002. WPPSI-III: Technical and Interpretative
Manual. The Psychological Corporation.
91
Download