Getting a Piece of the Action (Items) Multi

advertisement
Detecting Action Items in Multi-Party
Meetings: Annotation and Initial
Experiments
Matthew Purver, Patrick Ehlen, John Niekrasz
Computational Semantics Laboratory
Center for the Study of Language and Information
Stanford University
The CALO Project



Multi-institution, multi-disciplinary project
Working towards an intelligent personal
assistant that learns
Three major areas
–
managing personal data

–
assisting with task execution

–
clustering email, documents, managing contacts
learning to carry out computer-based tasks
observing interaction in meetings
The CALO Meeting Assistant



Observe human-human meetings
–
Audio recording & speech recognition (ICSI/CMU)
–
Video recording & processing (MIT/CMU)
–
Written notes, via digital ink (NIS) or typed (CMU)
–
Whiteboard sketch recognition (NIS)
Produce a useful record of the interaction
–
answer questions about what happened
–
can be used by attendees or non-attendees
Learn to do this better over time (LITW)
The CALO Meeting Assistant


Primary focus on end-user
Develop something that can really help people
when it comes to dealing with all of the
meetings we have to deal with
What do people want to know
from meetings?
What do people want to know
from meetings?

Banerjee et al. (2005) survey of 12 academics:
–
–
–
–
Missed meeting - what do you want to know?
Topics: which were discussed, what was said?
Decisions: what decisions were made?
Action items/tasks: was I assigned something?
What do people want to know
from meetings?

Banerjee et al. (2005) survey of 12 academics:
–
–
–
–

Missed meeting - what do you want to know?
Topics: which were discussed, what was said?
Decisions: what decisions were made?
Action items/tasks: was I assigned something?
Lisowska et al. (2004) survey of 28 people:
–
What would you ask a meeting reporter system?
Similar responses about topics, decisions
who attended, who asked/decided what?
–
Did they talk about me?
–
–
Purpose


Helpful system: not only records and
transcribes a meeting, but extracts (from
streams of potentially messy human-human
speech):
–
topics discussed
–
decisions made
–
tasks assigned (“action items”)
The system should highlight this information
over meeting “noise”
Example

Impromptu meeting you might have after your
team has boarded a rebel spacecraft in search
of stolen plans, and you’re trying to figure out
what to do next
Commander, tear this
ship apart until you’ve
found those plans!
Commander, tear this
ship apart until you’ve
found those plans!
Commander, tear this
ship apart until you’ve
found those plans!
A section of discourse in a
meeting where someone is
made responsible to take
care of something
Action Items



Concrete decisions; public commitments to be
responsible for a particular task
Want to know:
–
Can we find them?
–
Can we produce useful descriptions of them?
Not aware of previous discourse-based work
Action Item Detection in Email

Corston-Oliver et al., 2004

Marked a corpus of email with “dialogue acts”

Task act:
–
“items appropriate to add to an ongoing to-do list”

Good inter-annotator agreement (kappa > 0.8)

Per-sentence classification using SVMs
–
lexical features e.g. n-grams; punctuation; message
features
–
f-scores around 0.6
A First Try: Flat Annotation

Gruenstein et al (2005) analyzed 65 meetings
annotated from:
–
ICSI Meeting Corpus (Janin et al., 2003)
–
ISL Meeting Corpus (Burger et al., 2002)

Two human annotators

“Mark utterances relating to action items”
–
create groups of utterances for each AI
–
made no distinction between utterance type/role
A First Try: Flat Annotation (cont’d)




Annotators identified 921 / 1267 (respectively)
action item-related utterances
Human agreement poor ( < 0.4)
Tried binary classification using SVMs (like
Corston-Oliver)
Precision, recall, f-score: all below .25
Try a more restricted dataset?

Sequence of 5 (related) CALO meetings
–


Same annotation schema
SVMs with words & n-grams as features
–

similar amount of ICSI/ISL data for training
Also tried other discriminative classifiers, and 2- & 3grams, w/ no improvements
Similar performance
–
–
Improved f-scores (0.30 - 0.38), but still poor
Recall up to 0.67, precision still low (< 0.36)
Should we be surprised?


Our human annotator agreement poor
DAMSL schema has dialogue acts Commit,
Action-directive
–
–

ICSI MRDA dialogue act commit
–

annotator agreement poor ( ~ 0.15)
(Core & Allen, 1997)
Most DA tagging work concentrates on 5 broad DA
classes
Perhaps “action items” comprise a more
heterogeneous set of utterances
Rethinking Action Item Acts





Maybe action items are not aptly described as
singular “dialogue acts”
Rather: multiple people making multiple
contributions of several types
Action item-related utterances represent a form
of group action, or social action
That social action has several components,
giving rise to a heterogeneous set of utterances
What are those components?
Commander, tear this
ship apart until you’ve
found those plans!
• A person commits or is
committed to “own” the
action item
Commander, tear this
ship apart until you’ve
found those plans!
• A person commits or is
committed to “own” the
action item
• A description of the task itself
is given
Commander, tear this
ship apart until you’ve
found those plans!
• A person commits or is
committed to “own” the
action item
• A description of the task itself
is given
• A timeframe is specified
Yes, Lord
Vader!
• A person commits or is
committed to “own” the
action item
• A description of the task itself
is given
• A timeframe is specified
• Some form of agreement
Exploiting discourse structure

Action items have distinctive properties
–


Task description, owner, timeframe, agreement
Action item utterances can simultaneously play
different roles
–
assigning properties
–
agreeing/committing
These classes may be more homogeneous &
distinct than looking for just “action item” utts.
–
Could improve classification performance
New annotation schema


Annotated and classified again using the new
schema
Classify utterances by their role in the action
item discourse
–

can play more than one role
Define action items by grouping subclass
utterances together in an action-item discussion
–
a subclass can be missing
Action Item discourse: an example
New Experiment


Annotated same set of CALO/ICSI/ISL data
using the new schema
Ran classifiers to train and identify utterances
that contain each of the 4 subclasses
Encouraging signs


Between-class distinction (cosine distances)
–
Agreement vs. any other is good: 0.05 to 0.12
–
Timeframe vs. description is OK: 0.25
–
Owner/timeframe/description: 0.36 to 0.47
Improved inter-annotator agreement?
–
Timeframe:  = 0.86
–
Owner 0.77, agreement & description 0.73
–
Warning: this is only on one meeting, although it’s
the most difficult one we could find
Combined classification

Still don’t have enough data for proper
combined classification
–
Recall 0.3 to 0.5, precision 0.1 to 0.5
–
Agreement subclass is best, with f-score = 0.40

Overall decision based on sub-classifier outputs

Ad-hoc heuristic:
–
prior context window of 5 utterances
–
agreement plus one other class
Questions we can ask

Does overall classification look useful?
–

Whole-AI-based f-score 0.40 to 1.0 (one meeting
perfectly correlated with human annotation)
Does overall output improve sub-classifiers?
–
Agreement: f-score 0.40  0.43
–
Timescale: f-score 0.26  0.07
–
Owner: f-score 0.12  0.24
–
Description: f-score 0.33  0.24
Example output

From a CALO meeting:



t = [the, start, of, week, three, just, to]
o = [reconfirm, everything, and, at, that, time, jack, i'd,
like, you, to, come, back, to, me, with, the]

d = [the, details, on, the, printer, and, server]

a = [okay]
Another (less nice?) example:

o = [/h#/, so, jack, /uh/, for, i'd, like, you, to]

d = [have, one, more, meeting, on, /um/, /h#/, /uh/]

t = [in, in, a, couple, days, about, /uh/]

a = [/ls/, okay]
Where next for action items?

More data annotation
–
Using NOMOS, our annotation tool

Meeting browser to get user feedback

Improved individual classifiers

Improved combined classifier

–
maximum entropy model
–
not enough data yet
Moving from words to symbolic output
–
Gemini (Dowding et al., 1990) bottom-up parser
Questions we can ask

Does overall classification look useful?
–

Whole-AI-based f-score 0.40 to 1.0 (one meeting
perfectly correlated with human annotation)
Does overall output improve sub-classifiers?
–
Agreement: f-score 0.40  0.43
–
Timescale: f-score 0.26  0.07
–
Owner: f-score 0.12  0.24
–
Description: f-score 0.33  0.24
Download