An Architecture for Real-time Lecture Transcription and

advertisement
An Architecture for Real-time Lecture Transcription
and Collaborative Annotation using Mobile Devices
Josh White
Dept. of Industrial Design
Auburn University
Alabama, USA
jpw0009@auburn.edu
Jonathan W. Lartigue
Dept. of Computer Science and
Software Engineering
Auburn University
Alabama, USA
lartijw@auburn.edu
Brent Dutton
Dept. of Industrial Design
Auburn University
Alabama, USA
bjd0007@auburn.edu
Abstract: Available real-time voice transcription tools lack the capability to actively engage and
assist the user in understanding what is being said without delay. Similarly, conventional closedcaptioning systems are passive and often do little to augment spoken content.
The eScribe project proposes an assisted note-taking system augmented with multimedia
content and designed to work in real time with collaborative input and annotation by users utilizing
mobile devices such as laptops, tablets, and smartphones. At its most basic, the eScribe project will
provide live transcription of a lecture utilizing speech-to-text technology. The basic application
will enable users with disabilities, such as hearing impairment, to participate more effectively
during classroom lectures. This has clear benefits for disabled students and in assisting universities
and other organizations in complying with requirements set forth in the Americans with Disabilities
Act and related legislation. [1, 2]
In addition, eScribe intends to integrate all aspects of the lecture environment, including notes,
multimedia, websites, computer screenshots, etc., into an indexed, time-coded record of the lecture
that is suitable for archiving. We feel eScribe has the potential to provide both real-time assistance
specifically for users with disabilities and also a more universal benefit to general students through
an augmented lecture experience. We envision the final system to be applicable both in an
education setting and more generally in commercial and public environments.
Introduction
The current market of transcription tools is lacking in terms of actively engaging and helping the user
understand what is being said without delay. It is accepted that live captioning and a permanent transcript of lecture
content provide added value to the educational experience. Specifically, synchronized lecture notes can reduce
cognitive demand on students, allowing them to more fully concentrate on lesson content, increasing
comprehension. Additionally, a permanent record has both added benefit for students during review and study of
lecture material and also for the lecturer by providing a written record and potential value in pedagogy
improvement. [3, 4]
But what is lacking in current technologies is the capability for users to interact with an in-progress
transcription. Augmenting a transcription in real-time with grammatical corrections, user annotations, and even
multimedia can transform the lecture experience from that of passive participation to active engagement on the part
of the user. Enabling the user to actively engage in what is being said and to personally interact with that
information can lead to increased retention and comprehension of lecture material. [4]
Active closed-captioning has seldom been done in this way - adding such capabilities is the goal of the
eScribe project.
Existing Research
Current technologies generally send audio to servers where it is processed and then sent back to the mobile
device over data streams.
ViaScribe
One substantial research study by the Liberating Learning Consortium centered on synchronized captioning
using speech recognition software such as IBM’s ViaScribe. [3] The consortium leveraged ViaScribe for speech-totext transcription in educational settings for the purpose of investigating if speech recognition technology could
successfully be used to transcribe lectures in university classrooms. Ideally, the consortium wished to show that such
Figure 1: The eScribe architecture, employing a local server to perform speech-to-text processing and client devices
that receive the recorded audio stream and real-time transcription. Client devices then archive the
recording and transcription for future recall and review.
technology could be used as an alternative to traditional classroom note taking, especially for persons with
disabilities. [3]
The ViaScribe-enabled system creates a written transcript of existing audio or video content or provides a
near-instant transcript of live audio. Designed to automatically caption an audio stream as it is occurring, the
program requires a robust workstation with an intuitive interface for speakers to learn and use. The system also
requires each speaker to spend approximately one hour creating a “voice profile” that helps the ViaScribe system
recognize that speaker’s particular speech patterns.
Once the system has been configured for a particular speaker, audio from the speaker can be interpreted by
software on the workstation to generate captions in real time that are presented on a screen or saved for later use.
ViaScribe’s transcription software allows the speaker to talk naturally, without the need to unnaturally speak verbal
punctuation marks such as “period” or “comma.” When the speaker pauses, ViaScribe skips to a new line to make
the transcript more readable. In addition to real-time captioning, which benefits those audience members with
disabilities, the system also produces two artifacts: a digitized recording and a written transcript of that recording.
Finally, the system provides a simplified transcript editing system to allow the speaker, after the lecture has ended,
to correct any errors, provide additional information, and then make that content available over the Web or in other
accessible formats.
These written transcripts allow the speaker to more easily index lectures, which allows for more effective
and efficient user searches. Recorded audio is also automatically time-synchronized to captions and visual
information.
Automatic Speech Transcription
Transcription of lecture material – which is often characterized by noisy environments, sub-optimal audio
quality, complex vocabulary, and lengthy continuous speech – is notoriously error prone. [4] However, Munteanu, et
al, have shown that even error-prone transcriptions can increase audience comprehension as measured through postlecture assessments.
Therefore, we conclude that the benefits of real-time transcription are manifest, that such transcription adds
value to the lecture experience, and infer that the reduced cognitive load on the listener allows greater focus on
lecture concepts and content.
The eScribe System
The eScribe application will be capable of operation as a standalone client or as one of many clients in a
collaborative, real-time environment.
Standalone Operation. The eScribe client, when deployed on a mobile device such as tablet or
smartphone, will have the necessary capabilities to record, annotate, and archive a lecture without the need for an
external server or cloud-based computing system. This will permit eScribe users to employ the basic application in
any traditional lecture environment.
Collaborative Operation. The real power of the eScribe system will be realized when multiple clients are
employed in an appropriately equipped classroom or auditorium. The full eScribe system will consist of a central
computer that is tasked with performing an immediate transcription of the spoken lecture and providing that
translation to multiple participating users (see
Figure 1). Primarily, the in-progress transcription is
streamed live to participating clients, such as laptop
computers as well as tablet and mobile phone
devices, for the immediate benefit of users. This
environment also enables several advanced
capabilities of the eScribe system including:
supervised or user-directed transcription correction,
the addition of multimedia, web links, or other
material to the presentation transcription, and timecoded user annotations.
Benefits of Real-Time Transcription
The primary benefit of the full eScribe
system will be real-time transcription of spoken
lecture content.
In a conventional lecture environment, students
listen, take written notes, and then later utilize these
notes when studying. The eScribe project relieves
the student of the burden of taking notes, allowing
him or her to more fully concentrate on the content
of the lecture.
We hope that in a classroom setting the
program will make listening, editing and learning an
interactive process that holds the student’s attention
and actively engages him or her through the
duration of a lecture.
Supervised editing and attachments
Although the quality of speech-to-text
Figure 2: Simulated eScribe client screens, depicting realsoftware has improved greatly, the occurrence of
tine lecture transcription and timeline annotated
transcription errors is inevitable. This is usually
with multimedia, screenshots, and user notes.
resolved by manual correction of errors following a
lecture. A corrected transcript is then available –
usually long after the conclusion of the lecture – to
the audience.
The eScribe architecture will permit the immediate correction of transcription errors by a privileged user,
such as a graduate teaching assistant, such that users will receive the corrected transcription in real time and, at the
conclusion of the lecture, will have an accurate transcript to take with them.
Additionally, the privileged user will be able to attach additional resources, such as images, multimedia,
hyperlinks, etc., to the transcription to supplement the lesson content. Attachments can be time-coded to a specific
point in the lecture.
Collaborative annotation
Because real-time transcription relieves students of the burden of constant note taking, students potentially
have the freedom to capture supplemental material about the lecture themselves. This material, which may consist
of personal annotations, screenshots, supporting sketches or diagrams, or even photos and video, can be time-coded
and attached to the transcription record at the appropriate point.
Should collaborative annotation be desired, the annotations of individual students can potentially be shared among
all participants, thus augmenting the lecture for all with an increasing wealth of supplemental information.
Application Features
An advantage of real-time captioning and editing is that this enables the user to interact with the transcribed
text while the topic is “fresh” in the user’s mind. The user might add notes, web links, or even multimedia that is
time-coded to the relevant part of the lecture as is occurring. These annotations are then represented as icons or
buttons in-line with the transcribed text, or alternately along a running timeline of lecture. (see Figure 2) Actively
creating a hierarchy of annotations during, or even after, a lecture has the potential to promote active learning, make
the learning experience more engaging for the user, and give the user a sense of “ownership” or a personal-stake in
the lecture and subsequent notes.
Among the features of the eScribe application are:
Lecture Recording. When the “record” icon is pressed, the application allows the user to record the
lecture audio into a time-coded file and provide a “quick annotation” menu or full-keyboard access for annotation of
the lecture.
Real-time Transcription. Real-time transcription is supported in the presence of an eScribe-equipped
server as described above. The in-progress transcript is streamed by the server to participating clients.
Real-time Correction. Although supervised transcription corrections are supported, as described above,
the user has the ability to perform corrections him- or herself, should the speech-to-text program make a mistake.
An on-the-fly “quick fix” menu is available to the user to make personal corrections to the transcript with a
minimum of distraction and user interaction. This feature is designed specifically to enable users of any skill to edit
a transcription quickly and easily without undue distraction from the content of the in-progress lecture.
Time Coding. When a lecture is in progress, or when a recorded lecture is being reviewed, the user will
have access to a timeline of the lecture, with time-coded annotations and attachments connected to the timeline at
the appropriate point. Selecting any time-coded items permits the user to jump into the lecture at that point, allowing
easy recall of critical topics or material.
Such time-coded thumbnails break down a long lecture into manageable segments and also allow each user to
personalize a lecture trnascript to ft that user's learning style. Any edits to the transcription timeline are saved locally
and can be accessed later for playback and additional editing, if needed.
Collaborative Note Taking. If the user wishes to take notes, a “note” icon provides access to a full
keyboard. Notes are time-coded to the lecture at the point the note was created and becomes part of the user’s
timeline. During note taking, the lecture transcript continues to scroll in a portion of the window, allowing the user
to monitor the continuing transcription. Similarly, a “sketch” icon provides access to a free-form sketch screen for
similarly capturing non-textual diagrams or figures.
Hyperlink Archiving. Should the lecturer reference a website, eScribe provides an “internet” icon to
provide an interface for quickly entering a hyperlink address and attaching it to the timeline.
Image and Multimedia Capture. The device’s camera can be utilized to record screenshots, photos, or video that is
relevant to the lecture. As before, these are time-coded and attached to the lecture transcript.
Review and Recall. After a lecture is recorded, it can later be viewed through the “lecture recall” interface. This
provides a menu of recorded lectures and, when selected, an individual lecture’s recording, transcript, and annotated
timeline. When reviewing a recorded lecture, the user has the capability to annotate and augment the recorded
lecture in each of the same scenarios as above.
Long-term Goals
Designing for universality requires consideration for future changes in technology; ideas and concepts must
evolve with the tools used to create them to enable them to be used far into the future.
Transcribing audio currently requires a large amount of local computing power, data storage, and transmission
bandwidth. These limitations are most prevalent in mobile devices, which are the intended platform for the eScribe
client application.
Such mobile device limitations can be observed in Apple’s implementation of its Siri service, which
transmits a tightly compressed audio stream to cloud-based servers for processing into text. Further proprietary
software interprets the natural-language meaning of the text commands and the calculated response is then
transmitted to the local device to be performed locally.
Currently, personal mobile devices lack both the computational power and storage capacity to fully
eliminate the need for server-based support. The eScribe application, therefore, initially lacks the capability to
perform transcription without the presence of an appropriately equipped server in each lecture room or auditorium.
This limitation of the eScribe system can potentially be addressed in two ways:
Cloud-based Processing. Leveraging cloud-based language processing systems, as Siri does, is a next
logical step for this project. Such a capability could not only eliminate the need for an eScribe-equipped computer in
each lecture hall or auditorium but could also potentially enable real-time transcription on a standalone client in any
lecture environment.
Distributed Processing. Another potential extension of this architecture is to leverage the collective,
distributed processing capability of participating devices to meet the relatively high computing requirements of realtime speech-to-text transcription. Participating devices could work concurrently to process information using local
processing and storage resources for the benefit of all participants. This extension of the basic architecture becomes
more feasible as increasingly advanced and powerful mobile devices become available to the average consumer.
Should the eScribe system be developed to its full potential, and for as many mobile device operating systems (iOS,
Android, Blackberry, etc.) as possible, we envision it would be able to be employed in a wide range of noneducational environments including: business meetings, personal discussions, religious services, government
proceedings, and more.
Acknowledgements
This project was conceived in collaboration with the departments of Industrial Design, Special Education,
Rehabilitation and Counseling, and Computer Science and Software Engineering at Auburn University.
About the authors
Josh White, BID, is
a graduate of the
Department
of
Industrial Design at
Auburn University.
He
earned
his
Bachelor
of
Industrial Design degree in 2012
with a minor in sustainability. His
academic work has emphasized the
practical approach of design
utilizing research and analasis of
ideas and concepts to better fit user
needs. Josh believes user-centered
design is key to understanding
universal design, or “design for all,”
which requires a multi-disciplinary
knowledge and appreciation that
everything is connected and has the
potential to affect any part of a
system.
Jonathan Lartigue,
MSwE, is a PhD
candidate in the
Dept. of Computer
Science & Software
Engineering
at
Auburn University.
He has more than 12 years
experience as a software engineer
and has worked in the defense
industry, in project management,
and in mobile device development.
He was among the first published
iPhone
developers
and
has
developed more than a dozen apps
commercially and in academia. Two
such apps have achieved worldwide
No. 1 rankings. His research interest
includes
mobile
development,
design patterns, and the dynamics of
software engineering teams.
Brent Dutton, BID,
is a graduate of the
Department
of
Industrial Design at
Auburn University
and
earned
his
Bachelor
of
Industrial Design degree in 2012. He
received a Best in Studio award in
2011 for his work with aquaponics
in indoor garden design that
promoted a synergistic environment
in which fish supply necessary
nutrients for plant life, which then in
turn purifies water for fish. Brent is
interested
in
sustainable
manufacturing and has designed a
line of outdoor furniture comprised
of reclaimed and repurposed
materials for GroovyStuff.com.
References
[1] Bain, K., Basson, S. H. and Wald, M. Speech recognition in university classrooms: liberated learning project. In
Proceedings of the Proceedings of the fifth international ACM conference on Assistive technologies. ACM,
Edinburgh, Scotland, 2002.
[2] Kheir, R. and Way, T. Inclusion of deaf students in computer science classes using real-time speech
transcription. In Proceedings of the Proceedings of the 12th annual SIGCSE conference on Innovation and
technology in computer science education. ACM, Dundee, Scotland, 2007.
[3] Wald, M. Using Automatic Speech Recognition to Enhance Education for All Students: Turning a Vision into
Reality. 2005.
[4] Munteanu, C., Penn, G., Baecker, R. and Zhang, Y. Automatic speech recognition for webcasts: how good is
good enough and what to do when it isn't. In Proceedings of the Proceedings of the 8th international conference on
Multimodal interfaces. ACM, Banff, Alberta, Canada, 2006.
Download