Addressing Uncertainty in Performance
Measurement of Intelligent Systems
Raj Madhavan1,2
Elena Messina1
Hui-Min Huang1
Craig Schlenoff1
1Intelligent
Systems Division
National Institute of Standards and Technology (NIST)
&
2Institute for Systems Research (ISR)
University of Maryland, College Park
Commercial equipment and materials are identified in this presentation in order to adequately specify certain procedures.
Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the materials or
equipment identified are necessarily the best available for the purpose. The views and opinions expressed are those of
the presenter and does not necessarily reflect those of the organizations he is affiliated with.
Measuring Performance of Intelligent Systems

Performance Evaluation, Benchmarking, and Standardization are
critical enablers for wider acceptance and proliferation of existing and
emerging technologies

Crucial for fostering technology transfer and driving industry innovation

Currently, no consensus nor standards exist on
–
–

key metrics for determining the performance of a system
objective evaluation procedures to quantitatively deduce/measure the
performance of robotic systems against user-defined requirements
The lack of ways to quantify and characterize performance of
technologies and systems has precluded researchers working towards
a common goal from
–
–
–
exchanging and communicating results,
inter–comparing robot performance, and
leveraging previous work that could otherwise avoid duplication and
expedite technology transfer.
Measuring Performance of Intelligent Systems

The lack of ways to quantify and characterize technologies and
systems also hinders adoption of new systems
–
–

Users may be reluctant to try a new technology for fear of
expensive failure:
–

Users don’t trust claims by developers
There is lack of knowledge about how to match a solution with a
problem
…
Think of the “graveyards” of unused equipment in some places
Challenges in Measuring Performance of IS


Diversity of applications and deployment scenarios for
the IS
Complexity of the Intelligent System itself
–
–
–

Software components
Hardware components
Interactions between components – System of Systems
Lack of a well-defined mathematical foundation for
dealing with uncertainty in a complex system
–
–
–
methods for computing performance measures and related
uncertainties
techniques for combining uncertainties and making
inferences based on those uncertainties
approaches for estimating uncertainties for predicted
performance
Uncertainty and Complexity
 Uncertainty and complexity are often closely
related
 The abilities to handle uncertainty and
complexity are directly related to the levels of
autonomy and performance
Autonomy Levels for Unmanned Systems
(ALFUS) Framework
 Standard terms and definitions for characterizing the
levels of autonomy for unmanned systems
 Metrics, methods, and processes for measuring
autonomy of unmanned systems
 Contextual Autonomous Capability
 http://www.nist.gov/el/isd/ks/autonomy_levels.cfm/ (HuiMin Huang)
Addressing Uncertainty in Performance
Measurement via Complexity



Mission Complexity/
Uncertainty
In this context, performance
that we are trying to measure
is taken to mean the
successful completion of the
mission
Being able to handle higher
level of mission and
environmental complexities
results in higher system
performance
We can determine whether
program-specific
performance requirements
are achievable
UGV team metrics:
 coordinate in team
 coordinate w by-stander
UGV metrics:
 max speed/acce.
 endurance distance/duration
 min turn/bank radius
Mobility Example
Team
Organization
Mob
subsys
UMS
flat, paved surface
unpaved surfaces
unknown
terrain
Environment
Complexity/Uncertainty
Systems Complexity/
Uncertainty
Test Methods (1)
Hurdle Test Method
The purpose of this test method is to quantitatively evaluate the vertical step
surmounting capabilities of a robot, including variable chassis configurations and
coordinated behaviors, while being remotely teleoperated in confined areas with
lighted and dark conditions.
Metrics • Maximum elevation (cm) surmounted for 10 repetitions • Average time per
repetition
Robot
by
Size
Weight
(kg)
A
<20
<50
B
20-40
50-90
C
40–70
90-130
D
70-100
130-170
Length
(cm)
Locomotion Type
Successful Attempts in 10
Repetitions for Obstacle Height
10cm
20cm
30cm
40cm
50cm
Skid steer wheels
with 1 actuator
10
10
0
0
0
Skid steer tracks
with 2 actuator
10
10
0
0
0
10
10
0
0
0
10
10
10
10
0
Skid steer tracks
With 0 actuators
Skid steer wheels
with 4 actuators
 Hurdle Test Method Results: Numbers indicating successful repetitions. 10
corresponds to reliability of 80%--probability of success--that the robot can
successfully perform the task at the associated apparatus setting.
 Measurement Uncertainty (in measuring Obstacle Traverse Capability):
One half of the obstacle size increment (5 cm) and the elapsed time unit (30 s)
REQUIREMENT:
To communicate effectively
throughout mission
Mission Complexity/
Uncertainty
Comms Example
To project remote situational
awareness from down range
To traverse and perform
comms task
UMS team
Comms
Plan
Comms
subsys
UMS
Flat, paved surface without
objects in surrounding
UMS operating area
EMI existence
Environment
Complexity/Uncertainty
Systems Complexity/
Uncertainty
Test Methods (2)
Radio Comms (LoS) Test Method
The purpose of this test method is to quantitatively evaluate
the line of sight (LOS) radio communications range for a
remotely teleoperated robot.
Metric • Maximum distance (m) downrange at which the
robot completes tasks to verify the functionality of control,
video, and audio transmissions.
Line-of-Sight Radio Comms Test Method : Stations
every 100 m for testing two-way communications.
Multiple testing tasks at each test station sum up for the
repeatability.
SCORE

SCORE (System, Component and
Operationally Relevant Evaluations)
• Is a unified set of criteria and software
tools for defining a performance
evaluation approach for complex
intelligent systems
• Provides a comprehensive evaluation
blueprint that assesses the technical
performance of a system, its
components and its capabilities
through isolating and changing
variables as well as capturing enduser utility of the system in realistic
use-case environments
•
System – a set of interacting or
interdependent components forming
an integrated whole intended to
accomplish a specific goal
•
Component – a constituent part or
feature of a system that contributes
to its ability to accomplish a goal
•
Capability – a specific purpose or
functionality that the system is
designed to accomplish
•
Technical Performance – metrics
related to quantitative factors (such
as accuracy, precision, time,
distance, etc) as required to meet
end-user expectations
•
Utility Assessment – metrics
related to qualitative factors that
gauge the quality or condition of
being useful to the end-user
How SCORE Handles Complexity
• The complexity of the “system under test” grows as more
components are introduced into the evaluation
• Components evaluated in the elemental tests are less
complex than sub-systems (which contain multiple
components) which are less complex than the while
system
• SCORE tests at these various levels of complexity
• Data in the following slides indicate that the results of the
elemental tests can accurately be predictive of the
performance of the subsystem test (which is more
complex) and so on.
TRANSTAC
• GOAL – Demonstrate capabilities to rapidly
develop and field free-form, two-way
speech-to-speech translation systems
enabling English and foreign language
speakers to communicate with one another
in real-world tactical situations.
• NIST was funded over the past three years
to serve as the Independent Evaluation
Team for this effort.
• METRICS (as specified by DARPA)
• System usability testing – providing overall
scores to the capabilities of the whole
system
• Software component testing – evaluate
components of a system to see how well
they perform in isolation
TRANSTAC
A QUICK TUTORIAL ON SPEECH TRANSLATION
Please open the
car door.
Automatic Speech
Recognition (ASR)
Please
open the
car door.
Machine
Translation (MT)
‫يرجى فتح باب‬
‫السيارة‬
Text To Speech (TTS)
TRANSTAC
METRICS
Automated Metrics:. For speech recognition, we calculated Word-Error-Rate (WER). For machine
translation, we calculated BLEU and METEOR.
TTS Evaluation: Human judges listened to the audio outputs of the TTS evaluation and compared
them to the text string of what was fed into the TTS engine. They then gave a Likert score to
indicate how understandable the audio file was. WER was also used to judge the TTS output.
Low-Level Concept Transfer: A directly quantitative measure of the transfer of the low-level
elements of meaning. In this context, a low-level concept is a specific content word (or words)
in an utterance. For example, the phrase “The house is down the street from the mosque.” is
one high-level concept, but is made up of three low-level concepts (house, down the street,
mosque).
Likert Judgment: A panel of bilingual judges rated the semantic adequacy of the translations, an
utterance at a time, choosing from a seven point scale.
High-Level Concept Transfer: The number of utterances that are judged to have been
successfully transferred. The high-level concept metric is an efficiency metric which shows the
number of successful utterances per unit of time, as well as accuracy.
Surveys/Semi-Structured Interviews: After each live scenario, the Soldiers/Marines and the
foreign language speakers filled out a detailed survey asking them about their experiences
with the TRANSTAC systems. In addition, semi-structured interviews were performed with all
participants in which questions such as “What did you like?, What didn’t you like? and What
would you change?” were explored.
TRANSTAC
Complexity
SCORE Level
Metric
Team 1
Team 2
Team 3
Elemental
BLEU
1
2
2
Elemental
METEOR
1
2
2
Elemental
TTS
Low-level
Concept Transfer
1
1
2
1
2
2
Likert Judgment
High Level
Concept Transfer
1
2
2
1
2
3
User Surveys
1
2
3
Sub-System
System
System
System
(Qualitative)
From this data, it appears that:
• the quantitative performance of the elements of the systems have a direct
correlation to the quantitative performance of the subsystems;
• the quantitative performance of the sub-systems has a direct correlation to the
quantitative performance of the overall system;
• the quantitative performance of the overall system has a direct correlation to the
qualitative perception of the soldiers using the systems.
In Conclusion …
Thank you!
Download

Addressing Uncertainty in Performance Measurement of Intelligent