An Architecture for Real-time Lecture Transcription and Collaborative Annotation using Mobile Devices Josh White Dept. of Industrial Design Auburn University Alabama, USA jpw0009@auburn.edu Jonathan W. Lartigue Dept. of Computer Science and Software Engineering Auburn University Alabama, USA lartijw@auburn.edu Brent Dutton Dept. of Industrial Design Auburn University Alabama, USA bjd0007@auburn.edu Abstract: Available real-time voice transcription tools lack the capability to actively engage and assist the user in understanding what is being said without delay. Similarly, conventional closedcaptioning systems are passive and often do little to augment spoken content. The eScribe project proposes an assisted note-taking system augmented with multimedia content and designed to work in real time with collaborative input and annotation by users utilizing mobile devices such as laptops, tablets, and smartphones. At its most basic, the eScribe project will provide live transcription of a lecture utilizing speech-to-text technology. The basic application will enable users with disabilities, such as hearing impairment, to participate more effectively during classroom lectures. This has clear benefits for disabled students and in assisting universities and other organizations in complying with requirements set forth in the Americans with Disabilities Act and related legislation. [1, 2] In addition, eScribe intends to integrate all aspects of the lecture environment, including notes, multimedia, websites, computer screenshots, etc., into an indexed, time-coded record of the lecture that is suitable for archiving. We feel eScribe has the potential to provide both real-time assistance specifically for users with disabilities and also a more universal benefit to general students through an augmented lecture experience. We envision the final system to be applicable both in an education setting and more generally in commercial and public environments. Introduction The current market of transcription tools is lacking in terms of actively engaging and helping the user understand what is being said without delay. It is accepted that live captioning and a permanent transcript of lecture content provide added value to the educational experience. Specifically, synchronized lecture notes can reduce cognitive demand on students, allowing them to more fully concentrate on lesson content, increasing comprehension. Additionally, a permanent record has both added benefit for students during review and study of lecture material and also for the lecturer by providing a written record and potential value in pedagogy improvement. [3, 4] But what is lacking in current technologies is the capability for users to interact with an in-progress transcription. Augmenting a transcription in real-time with grammatical corrections, user annotations, and even multimedia can transform the lecture experience from that of passive participation to active engagement on the part of the user. Enabling the user to actively engage in what is being said and to personally interact with that information can lead to increased retention and comprehension of lecture material. [4] Active closed-captioning has seldom been done in this way - adding such capabilities is the goal of the eScribe project. Existing Research Current technologies generally send audio to servers where it is processed and then sent back to the mobile device over data streams. ViaScribe One substantial research study by the Liberating Learning Consortium centered on synchronized captioning using speech recognition software such as IBM’s ViaScribe. [3] The consortium leveraged ViaScribe for speech-totext transcription in educational settings for the purpose of investigating if speech recognition technology could successfully be used to transcribe lectures in university classrooms. Ideally, the consortium wished to show that such Figure 1: The eScribe architecture, employing a local server to perform speech-to-text processing and client devices that receive the recorded audio stream and real-time transcription. Client devices then archive the recording and transcription for future recall and review. technology could be used as an alternative to traditional classroom note taking, especially for persons with disabilities. [3] The ViaScribe-enabled system creates a written transcript of existing audio or video content or provides a near-instant transcript of live audio. Designed to automatically caption an audio stream as it is occurring, the program requires a robust workstation with an intuitive interface for speakers to learn and use. The system also requires each speaker to spend approximately one hour creating a “voice profile” that helps the ViaScribe system recognize that speaker’s particular speech patterns. Once the system has been configured for a particular speaker, audio from the speaker can be interpreted by software on the workstation to generate captions in real time that are presented on a screen or saved for later use. ViaScribe’s transcription software allows the speaker to talk naturally, without the need to unnaturally speak verbal punctuation marks such as “period” or “comma.” When the speaker pauses, ViaScribe skips to a new line to make the transcript more readable. In addition to real-time captioning, which benefits those audience members with disabilities, the system also produces two artifacts: a digitized recording and a written transcript of that recording. Finally, the system provides a simplified transcript editing system to allow the speaker, after the lecture has ended, to correct any errors, provide additional information, and then make that content available over the Web or in other accessible formats. These written transcripts allow the speaker to more easily index lectures, which allows for more effective and efficient user searches. Recorded audio is also automatically time-synchronized to captions and visual information. Automatic Speech Transcription Transcription of lecture material – which is often characterized by noisy environments, sub-optimal audio quality, complex vocabulary, and lengthy continuous speech – is notoriously error prone. [4] However, Munteanu, et al, have shown that even error-prone transcriptions can increase audience comprehension as measured through postlecture assessments. Therefore, we conclude that the benefits of real-time transcription are manifest, that such transcription adds value to the lecture experience, and infer that the reduced cognitive load on the listener allows greater focus on lecture concepts and content. The eScribe System The eScribe application will be capable of operation as a standalone client or as one of many clients in a collaborative, real-time environment. Standalone Operation. The eScribe client, when deployed on a mobile device such as tablet or smartphone, will have the necessary capabilities to record, annotate, and archive a lecture without the need for an external server or cloud-based computing system. This will permit eScribe users to employ the basic application in any traditional lecture environment. Collaborative Operation. The real power of the eScribe system will be realized when multiple clients are employed in an appropriately equipped classroom or auditorium. The full eScribe system will consist of a central computer that is tasked with performing an immediate transcription of the spoken lecture and providing that translation to multiple participating users (see Figure 1). Primarily, the in-progress transcription is streamed live to participating clients, such as laptop computers as well as tablet and mobile phone devices, for the immediate benefit of users. This environment also enables several advanced capabilities of the eScribe system including: supervised or user-directed transcription correction, the addition of multimedia, web links, or other material to the presentation transcription, and timecoded user annotations. Benefits of Real-Time Transcription The primary benefit of the full eScribe system will be real-time transcription of spoken lecture content. In a conventional lecture environment, students listen, take written notes, and then later utilize these notes when studying. The eScribe project relieves the student of the burden of taking notes, allowing him or her to more fully concentrate on the content of the lecture. We hope that in a classroom setting the program will make listening, editing and learning an interactive process that holds the student’s attention and actively engages him or her through the duration of a lecture. Supervised editing and attachments Although the quality of speech-to-text Figure 2: Simulated eScribe client screens, depicting realsoftware has improved greatly, the occurrence of tine lecture transcription and timeline annotated transcription errors is inevitable. This is usually with multimedia, screenshots, and user notes. resolved by manual correction of errors following a lecture. A corrected transcript is then available – usually long after the conclusion of the lecture – to the audience. The eScribe architecture will permit the immediate correction of transcription errors by a privileged user, such as a graduate teaching assistant, such that users will receive the corrected transcription in real time and, at the conclusion of the lecture, will have an accurate transcript to take with them. Additionally, the privileged user will be able to attach additional resources, such as images, multimedia, hyperlinks, etc., to the transcription to supplement the lesson content. Attachments can be time-coded to a specific point in the lecture. Collaborative annotation Because real-time transcription relieves students of the burden of constant note taking, students potentially have the freedom to capture supplemental material about the lecture themselves. This material, which may consist of personal annotations, screenshots, supporting sketches or diagrams, or even photos and video, can be time-coded and attached to the transcription record at the appropriate point. Should collaborative annotation be desired, the annotations of individual students can potentially be shared among all participants, thus augmenting the lecture for all with an increasing wealth of supplemental information. Application Features An advantage of real-time captioning and editing is that this enables the user to interact with the transcribed text while the topic is “fresh” in the user’s mind. The user might add notes, web links, or even multimedia that is time-coded to the relevant part of the lecture as is occurring. These annotations are then represented as icons or buttons in-line with the transcribed text, or alternately along a running timeline of lecture. (see Figure 2) Actively creating a hierarchy of annotations during, or even after, a lecture has the potential to promote active learning, make the learning experience more engaging for the user, and give the user a sense of “ownership” or a personal-stake in the lecture and subsequent notes. Among the features of the eScribe application are: Lecture Recording. When the “record” icon is pressed, the application allows the user to record the lecture audio into a time-coded file and provide a “quick annotation” menu or full-keyboard access for annotation of the lecture. Real-time Transcription. Real-time transcription is supported in the presence of an eScribe-equipped server as described above. The in-progress transcript is streamed by the server to participating clients. Real-time Correction. Although supervised transcription corrections are supported, as described above, the user has the ability to perform corrections him- or herself, should the speech-to-text program make a mistake. An on-the-fly “quick fix” menu is available to the user to make personal corrections to the transcript with a minimum of distraction and user interaction. This feature is designed specifically to enable users of any skill to edit a transcription quickly and easily without undue distraction from the content of the in-progress lecture. Time Coding. When a lecture is in progress, or when a recorded lecture is being reviewed, the user will have access to a timeline of the lecture, with time-coded annotations and attachments connected to the timeline at the appropriate point. Selecting any time-coded items permits the user to jump into the lecture at that point, allowing easy recall of critical topics or material. Such time-coded thumbnails break down a long lecture into manageable segments and also allow each user to personalize a lecture trnascript to ft that user's learning style. Any edits to the transcription timeline are saved locally and can be accessed later for playback and additional editing, if needed. Collaborative Note Taking. If the user wishes to take notes, a “note” icon provides access to a full keyboard. Notes are time-coded to the lecture at the point the note was created and becomes part of the user’s timeline. During note taking, the lecture transcript continues to scroll in a portion of the window, allowing the user to monitor the continuing transcription. Similarly, a “sketch” icon provides access to a free-form sketch screen for similarly capturing non-textual diagrams or figures. Hyperlink Archiving. Should the lecturer reference a website, eScribe provides an “internet” icon to provide an interface for quickly entering a hyperlink address and attaching it to the timeline. Image and Multimedia Capture. The device’s camera can be utilized to record screenshots, photos, or video that is relevant to the lecture. As before, these are time-coded and attached to the lecture transcript. Review and Recall. After a lecture is recorded, it can later be viewed through the “lecture recall” interface. This provides a menu of recorded lectures and, when selected, an individual lecture’s recording, transcript, and annotated timeline. When reviewing a recorded lecture, the user has the capability to annotate and augment the recorded lecture in each of the same scenarios as above. Long-term Goals Designing for universality requires consideration for future changes in technology; ideas and concepts must evolve with the tools used to create them to enable them to be used far into the future. Transcribing audio currently requires a large amount of local computing power, data storage, and transmission bandwidth. These limitations are most prevalent in mobile devices, which are the intended platform for the eScribe client application. Such mobile device limitations can be observed in Apple’s implementation of its Siri service, which transmits a tightly compressed audio stream to cloud-based servers for processing into text. Further proprietary software interprets the natural-language meaning of the text commands and the calculated response is then transmitted to the local device to be performed locally. Currently, personal mobile devices lack both the computational power and storage capacity to fully eliminate the need for server-based support. The eScribe application, therefore, initially lacks the capability to perform transcription without the presence of an appropriately equipped server in each lecture room or auditorium. This limitation of the eScribe system can potentially be addressed in two ways: Cloud-based Processing. Leveraging cloud-based language processing systems, as Siri does, is a next logical step for this project. Such a capability could not only eliminate the need for an eScribe-equipped computer in each lecture hall or auditorium but could also potentially enable real-time transcription on a standalone client in any lecture environment. Distributed Processing. Another potential extension of this architecture is to leverage the collective, distributed processing capability of participating devices to meet the relatively high computing requirements of realtime speech-to-text transcription. Participating devices could work concurrently to process information using local processing and storage resources for the benefit of all participants. This extension of the basic architecture becomes more feasible as increasingly advanced and powerful mobile devices become available to the average consumer. Should the eScribe system be developed to its full potential, and for as many mobile device operating systems (iOS, Android, Blackberry, etc.) as possible, we envision it would be able to be employed in a wide range of noneducational environments including: business meetings, personal discussions, religious services, government proceedings, and more. Acknowledgements This project was conceived in collaboration with the departments of Industrial Design, Special Education, Rehabilitation and Counseling, and Computer Science and Software Engineering at Auburn University. About the authors Josh White, BID, is a graduate of the Department of Industrial Design at Auburn University. He earned his Bachelor of Industrial Design degree in 2012 with a minor in sustainability. His academic work has emphasized the practical approach of design utilizing research and analasis of ideas and concepts to better fit user needs. Josh believes user-centered design is key to understanding universal design, or “design for all,” which requires a multi-disciplinary knowledge and appreciation that everything is connected and has the potential to affect any part of a system. Jonathan Lartigue, MSwE, is a PhD candidate in the Dept. of Computer Science & Software Engineering at Auburn University. He has more than 12 years experience as a software engineer and has worked in the defense industry, in project management, and in mobile device development. He was among the first published iPhone developers and has developed more than a dozen apps commercially and in academia. Two such apps have achieved worldwide No. 1 rankings. His research interest includes mobile development, design patterns, and the dynamics of software engineering teams. Brent Dutton, BID, is a graduate of the Department of Industrial Design at Auburn University and earned his Bachelor of Industrial Design degree in 2012. He received a Best in Studio award in 2011 for his work with aquaponics in indoor garden design that promoted a synergistic environment in which fish supply necessary nutrients for plant life, which then in turn purifies water for fish. Brent is interested in sustainable manufacturing and has designed a line of outdoor furniture comprised of reclaimed and repurposed materials for GroovyStuff.com. References [1] Bain, K., Basson, S. H. and Wald, M. Speech recognition in university classrooms: liberated learning project. In Proceedings of the Proceedings of the fifth international ACM conference on Assistive technologies. ACM, Edinburgh, Scotland, 2002. [2] Kheir, R. and Way, T. Inclusion of deaf students in computer science classes using real-time speech transcription. In Proceedings of the Proceedings of the 12th annual SIGCSE conference on Innovation and technology in computer science education. ACM, Dundee, Scotland, 2007. [3] Wald, M. Using Automatic Speech Recognition to Enhance Education for All Students: Turning a Vision into Reality. 2005. [4] Munteanu, C., Penn, G., Baecker, R. and Zhang, Y. Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't. In Proceedings of the Proceedings of the 8th international conference on Multimodal interfaces. ACM, Banff, Alberta, Canada, 2006.