Shaping in Speech Graffiti: results from the initial user study Stefanie Tomko

Shaping in Speech Graffiti: results from the initial user study

Stefanie Tomko

Dialogs on Dialogs meeting

10 February 2006

1

Big picture (

i.e.

thesis statement)







A system of shaping and adaptivity can be used to induce more efficient user interactions with spoken dialog systems.

This strategy can increase efficiency by increasing the amount of user input that is actually understood by the system, leading to increased task completion rates and higher user satisfaction.

This strategy can also reduce upfront training time, thus accelerating the process of reaching optimally efficient interaction.

2

This study

User input

Speech

Graffiti?

(target) no yes shapeable?

(expanded

) yes shaping prompt no

{confsig} resul t

3

My approach, graphically

User input

Speech

Graffiti?

(target) no yes shapeable?

(expanded

) yes shaping prompt no intelligent shaping help resul t

4

Speech Graffiti





Standardized framework of syntax, keywords, and principles

Domain-specific vocabulary

Theater is Showcase North Theater

Showcase Cinemas Pittsburgh North

Genre is drama

Drama

What movies are playing?

{confsig} [an error beep, since previous utterance is not in grammar]

WHERE WAS I?

Theater is Showcase Cinemas Pittsburgh North, genre is drama

OPTIONS

You can specify or ask about title, show time, rating, {ellsig} [a 3-beep list continuation signal]

What is title?

2 matches: Dark Water, War of the Worlds

START OVER

Starting over

Theater is Northway Mall Cinemas Eight

Northway Mall Cinemas 8

What is address?

1 match: 8000 McKnight Road in Pittsburgh

5

Expanded grammar











Exploit the fact that knowledge of speaking to a limited-language system restricts input

Create a grammar that will accept more natural language input cf. SG

This grammar is opaque for users

Why have two grammars?



Lower perplexity LMs

 lower error rates



Some applications may be SG-only

Restriction: linear mapping from EXP input to TGT equivalent

6

Shaping strategy







Handle user input accepted by expanded grammar but not target

Balance current task success with future interaction efficiency

Baseline strategy – this study:





Confirm expanded grammar input with full, explicit slot+value confirmation

Give result if appropriate for query

7

Study participants











“Normal” adults, i.e. not CMU students

15 males, 14 females, aged 23-54

Native speakers of American Eng.

Little/no computer programming exp

New to Speech Graffiti

8

Study design







Between-subjects

3 conditions





 non-shaping+tutorial (BT) shaping+tutorial (ST) shaping+no_tutorial (SN)

Tutorial



9-slide .ppt presentation



5 minutes

9

Study tasks









15 tasks

4 difficulty levels



# of slots to be specified/queried

40 minutes or when all tasks completed



Only one user did not get to attempt all

15 tasks in 40 minutes

Afterwards: SASSI questionnaire

10

11

9

7

5

3

1

Results

non- shaping





In short, the baseline shaping strategy didn’t have an effect  complet ed t asks

Efficiency

12

10 t ur ns t o complet ion t ime t o complet ion, in seconds

8

6

4

2

0 non- shaping shaping shaping

50

40

30

20

10

0

100

90

80

70

60 non- shaping shaping



Mean results from shaping subjects are only slightly better – non-significant

11

User satisfaction



Again, no significant differences user satisfaction (mean of means)

4

3

2

1

7

6

5 non-shaping shaping





No differences on individual SASSI factors

No efficiency/satisfaction differences between tutorial/non-tutorial, either

12

Grammaticality



How often did users speak within the

Target SG grammar?



From Q1 to Q4, both groups showed significant increases in TGT gram

80

70

60

50

40

30

20

10

0

Q1 non-shaping

Q4 shaping

13

Error rates - WER





For non-shaping: 39.9%



30.3% for grammatical utts



38.3% utt-level concept error

For shaping: a bit harder to figure, because of 2-pass ASR



Each shaping input generated a TGT hyp & a EXP hyp



Selection based on AM/LM score and a few simple heuristics

14

Error rates – WER





Shaping:



For selected hypothesis: 37.3%





All TGT: 40.9%

All EXP: 64.2%

25.6% utt-level concept error

15

So – what happened?



Shaping users had success with NLish input, and shaping prompts were not strong enough to change behavior.

16

Biggest problem





Using NL or slot-only query formats





My theory: < slot> is <value> specification format is very structured.

what is <slot> sounds structured to me, but to users it sounds like <just ask a question!>

In new versions, query format will be list <slot>





Users don’t seem to have too much trouble adapting to a structure – but the structure needs to be clear.

Will also shape more explicitly by confirming with “I think you meant, ‘list movies’”



Also for more explicit shaping of specifications

17

Other problems





Not using start over to clear context

Confusion about semantics of location





Long utterances

Using next instead of more



Pacing



These will be addressed via targeted help messages

18

Current hang-up





Can we improve WER?



LM improvements?



COTS recognizer?

Dragon:







Using

Results

Issues

19

A little bit about trying DNS







Dragon Naturally Speaking 8



Distribution from Jahanzeb

Set up for dictation – i.e. mic input



So, no telephone models

To compare with Sphinx







Test set of utterances from this study

Rerecorded with head mic (so, read) at 16kHz

Downsampled to 8kHz for Sphinx

20

More Dragon stuff



Two groups



TGT



Sphinx mean 56.4% (



Worse than 8k telephone model (?)







Dragon mean 35.9%

Mean diff: Dragon 18.8pts less (ns)

EXP







Sphinx mean 68.5%

Dragon mean 45.4%

Mean diff: Dragon 22.3pts less (s)

21

More Dragon stuff











But – Dragon rates are not that different from original Sphinx WER rates



Sphinx WER in this test might be fishy

Setup seems tricky – can I still do 2-pass decoding?

Would need to change to mic setup

Black-box LM stuff



Mysterious adaptation? – not good for user studies!

So, sticking with Sphinx.

22

Shaping in Speech Graffiti: results from the initial user study Stefanie Tomko

Shaping in Speech Graffiti: results from the initial user study

Big picture (

thesis statement)

This study

My approach, graphically

Speech Graffiti

Expanded grammar

Shaping strategy

Study participants

Study design

Study tasks

Results

User satisfaction

Grammaticality

Error rates - WER

Error rates – WER

So – what happened?

Biggest problem

Other problems

Current hang-up

A little bit about trying DNS

More Dragon stuff

More Dragon stuff

Related documents

Products

Support

Shaping in Speech Graffiti: results from the initial user study Stefanie Tomko