Stefanie Tomko
Dialogs on Dialogs meeting
10 February 2006
1
i.e.
A system of shaping and adaptivity can be used to induce more efficient user interactions with spoken dialog systems.
This strategy can increase efficiency by increasing the amount of user input that is actually understood by the system, leading to increased task completion rates and higher user satisfaction.
This strategy can also reduce upfront training time, thus accelerating the process of reaching optimally efficient interaction.
2
User input
Speech
Graffiti?
(target) no yes shapeable?
(expanded
) yes shaping prompt no
{confsig} resul t
3
User input
Speech
Graffiti?
(target) no yes shapeable?
(expanded
) yes shaping prompt no intelligent shaping help resul t
4
Standardized framework of syntax, keywords, and principles
Domain-specific vocabulary
Theater is Showcase North Theater
Showcase Cinemas Pittsburgh North
Genre is drama
Drama
What movies are playing?
{confsig} [an error beep, since previous utterance is not in grammar]
WHERE WAS I?
Theater is Showcase Cinemas Pittsburgh North, genre is drama
OPTIONS
You can specify or ask about title, show time, rating, {ellsig} [a 3-beep list continuation signal]
What is title?
2 matches: Dark Water, War of the Worlds
START OVER
Starting over
Theater is Northway Mall Cinemas Eight
Northway Mall Cinemas 8
What is address?
1 match: 8000 McKnight Road in Pittsburgh
5
Exploit the fact that knowledge of speaking to a limited-language system restricts input
Create a grammar that will accept more natural language input cf. SG
This grammar is opaque for users
Why have two grammars?
Lower perplexity LMs
lower error rates
Some applications may be SG-only
Restriction: linear mapping from EXP input to TGT equivalent
6
Handle user input accepted by expanded grammar but not target
Balance current task success with future interaction efficiency
Baseline strategy – this study:
Confirm expanded grammar input with full, explicit slot+value confirmation
Give result if appropriate for query
7
“Normal” adults, i.e. not CMU students
15 males, 14 females, aged 23-54
Native speakers of American Eng.
Little/no computer programming exp
New to Speech Graffiti
8
Between-subjects
3 conditions
non-shaping+tutorial (BT) shaping+tutorial (ST) shaping+no_tutorial (SN)
Tutorial
9-slide .ppt presentation
5 minutes
9
15 tasks
4 difficulty levels
# of slots to be specified/queried
40 minutes or when all tasks completed
Only one user did not get to attempt all
15 tasks in 40 minutes
Afterwards: SASSI questionnaire
10
11
9
7
5
3
1
non- shaping
In short, the baseline shaping strategy didn’t have an effect complet ed t asks
Efficiency
12
10 t ur ns t o complet ion t ime t o complet ion, in seconds
8
6
4
2
0 non- shaping shaping shaping
50
40
30
20
10
0
100
90
80
70
60 non- shaping shaping
Mean results from shaping subjects are only slightly better – non-significant
11
Again, no significant differences user satisfaction (mean of means)
4
3
2
1
7
6
5 non-shaping shaping
No differences on individual SASSI factors
No efficiency/satisfaction differences between tutorial/non-tutorial, either
12
How often did users speak within the
Target SG grammar?
From Q1 to Q4, both groups showed significant increases in TGT gram
80
70
60
50
40
30
20
10
0
Q1 non-shaping
Q4 shaping
13
For non-shaping: 39.9%
30.3% for grammatical utts
38.3% utt-level concept error
For shaping: a bit harder to figure, because of 2-pass ASR
Each shaping input generated a TGT hyp & a EXP hyp
Selection based on AM/LM score and a few simple heuristics
14
Shaping:
For selected hypothesis: 37.3%
All TGT: 40.9%
All EXP: 64.2%
25.6% utt-level concept error
15
Shaping users had success with NLish input, and shaping prompts were not strong enough to change behavior.
16
Using NL or slot-only query formats
My theory: < slot> is <value> specification format is very structured.
what is <slot> sounds structured to me, but to users it sounds like <just ask a question!>
In new versions, query format will be list <slot>
Users don’t seem to have too much trouble adapting to a structure – but the structure needs to be clear.
Will also shape more explicitly by confirming with “I think you meant, ‘list movies’”
Also for more explicit shaping of specifications
17
Not using start over to clear context
Confusion about semantics of location
Long utterances
Using next instead of more
Pacing
These will be addressed via targeted help messages
18
Can we improve WER?
LM improvements?
COTS recognizer?
Dragon:
Using
Results
Issues
19
Dragon Naturally Speaking 8
Distribution from Jahanzeb
Set up for dictation – i.e. mic input
So, no telephone models
To compare with Sphinx
Test set of utterances from this study
Rerecorded with head mic (so, read) at 16kHz
Downsampled to 8kHz for Sphinx
20
Two groups
TGT
Sphinx mean 56.4% (
Worse than 8k telephone model (?)
Dragon mean 35.9%
Mean diff: Dragon 18.8pts less (ns)
EXP
Sphinx mean 68.5%
Dragon mean 45.4%
Mean diff: Dragon 22.3pts less (s)
21
But – Dragon rates are not that different from original Sphinx WER rates
Sphinx WER in this test might be fishy
Setup seems tricky – can I still do 2-pass decoding?
Would need to change to mic setup
Black-box LM stuff
Mysterious adaptation? – not good for user studies!
So, sticking with Sphinx.
22