Recording Meetings with the CMU Meeting Recorder Architecture Satanjeev Banerjee, et al.

advertisement
Recording Meetings with the
CMU Meeting Recorder
Architecture
Satanjeev Banerjee, et al.
School of Computer Science
Carnegie Mellon University
Goals

End goal: Build conversational agents

That “understand” meetings


Make contributions to meetings



E.g.: Identify action items
E.g.: Confirm details of action items
Part of Project CALO: Cognitive Agent that
Learns and Organizes
First goal: Create corpus of human meetings

Capture data that we expect agents to use

E.g.: Speech, video, whiteboard markings, etc.
Carnegie Mellon University
2
Desirable Properties of the Recorder

Need to record meetings anywhere




Should be easy to add new data streams


Emphasis on instrumenting user, not room
Assume low network bandwidth
Should still be able to record in the extreme
situation where there is no network access!
“Easy” = low time to incorporate new stream
Should be able to support major OS-es
Carnegie Mellon University
3
The Recorder Architecture

Information stream is discretized into events



Each event is given start/end time stamps


Coincide for instantaneous events, e.g. keystroke
Events are stored on local disks


Either a sequence of events, e.g. utterances
Or one long event, e.g. video data
Laptops, shuttle PCs, etc.
Events are (slowly) uploaded to a central
server when there is network access
Carnegie Mellon University
4
Event Identification and Logging

Each recorded event has the following
identifying information associated with it:




Start and stop time stamps
Name of the meeting and the user
Modality (speech, video, hand-writing, etc.)
After recording an event, its identification
information is sent to a logging server


Server creates a list of all the events in a meeting
Good for book-keeping (but not essential)
Carnegie Mellon University
5
Architecture of Meeting Recorder
Browse Meeting
{
DATA_BLOCK
session: OTTER
user:
arudnicky
datatype: SPEECH
file: \\spot\data\u1.raw
Start: 20030917::18:27.600
End:
20030917::18:35.357
}
P1
P2
P3
P1
P2
[master]
Participant 3
P3
P1
P1
Participant 1
Time
server
Carnegie Mellon University
Participant 2
6
Synchronizing the Time Stamps


All event time stamps must be synchronized
We use the Simplified Network Time Protocol




Query a central NTP server for the time
Use the reply and the round-trip time to estimate
time difference between local machine and server
Use this to create server-time time stamps
Rough experiments reveal 10ms variance


Caveat: Experiments done on high speed network
What if there is *no* network access?
Carnegie Mellon University
7
Aggregating the Data

Upon network access availability, data is
transferred from all sites to a central location


Current recording sites: CMU and Stanford
Implemented a cross-platform version of the
MS Background Intelligent Transfer Service




Uploads files in a transparent background process
Throttles bandwidth use as user’s activity goes up
Pauses if network connection is lost
Resumes once network access is restored
Carnegie Mellon University
8
Data Collection Process (proposed)
preparation
Independent
cross-site
collection
Transcription,
Annotation
CALO
MEETING
DATABASE
integration
Learning
Background
data
transmission
Carnegie Mellon University
Analysis
research
9
Capturing Close-Talking Speech




Implemented Meeting Recorder Cross
Platform (MRCP) to record speech and notes
Speech recorded using head-mounted mics
11.025 kHz sampling rate used for portability
End pointing done using CMU Sphinx 3 ASR




Each end-pointed utterance is an event
Utterance is recorded to local disk (wav format)
Time stamps are generated using Simple NTP
Utterance’s identifying information is sent to
logging server, utterance is queued for upload
Carnegie Mellon University
10
Capturing Typed Notes


Users type notes in client’s note-taking area
“Snapshots” of notes are taken at each
carriage return



Each snapshot is an event
Each snapshot is saved to disk, time-stamped,
logged, and queued for upload
[Demonstration of MRCP]
Carnegie Mellon University
11
More Details about MRCP

Implemented using cross platform libraries:






wxWidgets for GUI, file access, networking
PortAudio for audio libraries
Currently compiles on Windows, Macintosh
OS-X and Linux operating systems
Windows version distributed to other Project
CALO sites
Macintosh and Linux versions in beta-testing
WinCE version in development
Carnegie Mellon University
12
Capturing Whiteboard Pen Strokes



We use Mimio to capture whiteboard pen strokes
“Strokes” consist of all the x-y coordinates
between pen-down and pen-up
Each stroke is an event. It is recorded, timestamped, logged, queued for upload.
Carnegie Mellon University
13
Capturing Power Point Slides Information



We use MS’s PowerPoint API to capture slide
change timing information, and slide contents
Events = slide changes
Event data = content of the new slide


Events are instantaneous


Content is in the form of all the text, and all the
“shapes” on the slide
Start and stop time stamps coincide
Events are processed as before
Carnegie Mellon University
14
Capturing Panoramic Video

We capture panoramic
video using a 4-camera
CAMEO device



Developed by the Physical
Awareness group at CMU
Video recording done in
MPEG-4 format
One long event is
produced and uploaded
Carnegie Mellon University
15
Current Status of Data Collection

Recorded meetings vary widely in size…


…in meeting type


Scheduling meetings, presentations, brain storms
…in content


From 2 to 10 person meetings
Speech group meetings, dialog group meetings,
physical awareness group meetings
Currently have a total of more than 11,000
utterances (including cross talk)
Carnegie Mellon University
16
Using the Data: Some Initial Research


Question: Can we detect the state of a meeting, and
the roles of participants from simple speech data?
Introduced a taxonomy of meeting states and
participant roles
Meeting State
Participant Roles
Presentation
Presenter, Observer
Briefing
Information producer/consumer
Discussion
Participator, Observer
Carnegie Mellon University
17
Detection Methods and Initial Results



Used Anvil to hand annotate 45 minutes of
meeting video with states and roles
Trained decision tree classifier from 30
minutes of data
Input features:


# speakers, lengths of utterances, pauses and
interruptions within a short history of the meeting
Initial results: About 50% detection accuracy
on separate 15 minutes of test data
Carnegie Mellon University
18
Questions?
Thanks to DARPA grant NBCH-D-02-0010
Download