AudioSense: A Simulation Progress Report EECS 578 Allan Spale Background of Concept • Taking the train home and listening to the sounds around me • How would deaf people be able to perceive the environment? • What assistance would be useful in helping people adapt to the environment? Project Goals • Develop a CAVE application that will simulate aspects of audio perception • Display the text of “speaking” objects in space • Display the description text of “nonspeaking” objects in space • Display visual cues of multiple sound sources • Allow the user to selectively listen to different sound sources Topics in the Project • Augmented reality • Illustrated by objects in a virtual environment • 3D sound • Simulated by an object’s interaction property • Speech recognition • Simulated by text near the object • Will remain static during simulation • Virtual reality / CAVE • Method for presenting the project • Not discussed in this presentation Augmented Reality • Definition • “…provides means of intuitive information presentation for enhancing situational awareness and perception by exploiting the natural and familiar human interaction modalities with the environment.” -- Behringer et al. 1999 Augmented Reality: Device Diagnostics • Architecture components aid in performing a diagnostic tests • Computer vision used to track the object in space • Speech recognition (command-style) used for user interface • 3D graphics (wireframe and shaded objects) to illustrate an object’s internal structure • 3D audio emits from an item that allows the user to find the location within the object Augmented Reality • Device diagnostics Augmented Reality • Device diagnostics Augmented Reality: Device Diagnostics • Summary • Providing 3D graphics and sound helps the user better diagnose items • Might also want text information on the display • Tracking methodology still needs improvement • Speech recognition of commands could be expanded to include annotation • Utilize IP connection to distribute computing power from the wearable computer Augmented Reality: Multimedia Presentations in the Real World • Mobile Augmented Reality System (MARS) • Tracking performed by Global Positioning System (GPS) and another device • Display is a see-through and headmounted • Interaction based on location and gaze • Additional interaction provided by hand-held device Augmented Reality: Multimedia Presentations in the Real World • System overview • Selection occurs through proximity or gaze direction followed by a menu system • Information presentation • Video (on hand-held deivce) or images accompanied by narration (on head-mounted display) • Virtual reality (for places that are not able to be visited) • Augmented reality (illustrate where items were) Augmented Reality • Multimedia presentations in the real world Augmented Reality • Multimedia presentations in the real world Augmented Reality: Multimedia Presentations in the Real World • Conclusions • Current system is too heavy and visually undesirable • Might want to make hand-held display a palmtop computer • Permit authoring of content • Create a collaboration between indoor and outdoor system users 3D Sound: Audio-only Web Browsing • Must overcome difficulties with utilizing 3D sound • X axis sounds identifiable, Y and Z axes sounds are not identifiable • Need exists to create structure in audio rendered web pages • Document reading appears spatially from left to right in an adequate amount of time • Utilize earcons and selective listening • Provide meta-content for quick document overview 3D Sound • Audio-only Web browsing 3D Sound: Audio-only Web Browsing • Future work • Improve link information that extends beyond web page title and time duration • Benefits of auditory browsing aids • Improved comprehension • Better browsing experience for visually impaired and sited users 3D Sound: Interactive 3D Sound Hyperstories • Hyperstories • Story occurring in a hypermedia context • Forms a “nested context model” • World objects can be passive, active, static, or dynamic 3D Sound: Interactive 3D Sound Hyperstories • AudioDoom • Like computer game of Doom, but different • All world objects represented with sound • Sound represented in a “volume” almost parallel to the user’s eyes • User interacts with the world objects using an ultrasonic joystick with haptic functionality • Organized by partitioned spaces 3D Sound • Interactive 3D sound hyperstories 3D Sound • Interactive 3D sound hyperstories 3D Sound: Interactive 3D Sound Hyperstories • Despite elapsed time between sessions, users remembered the world structure well • Authors illustrate the possibility of “render[ing] a spatial navigable structure by using only spatialized sound.” • Opens the possibilities for educational software for the blind within the hyperstory context Speech Recognition: Media retrieval and indexing • Problems with media retrieval and indexing • Lots of media being generated; too costly and time-consuming to index manually • Ideal system design • Speaker independence • Noisy-recording environment capability • Open vocabulary Speech Recognition: Media retrieval and indexing • Using Hidden Markov Models the system achieved the results in Table 1 • To improve results, “using string matching techniques” will help overcome recognition stream errors Speech Recognition: Media retrieval and indexing • String matching strategy • Develop the search term • Divide the recognition stream into a set of sub-strings • Implement an initial filter process • “Identify edit operations for remaining substrings in [the] recognition stream” • Calculate the similarity measure for the search term and matched strings Speech Recognition • Media retrieval and indexing Speech Recognition: Media retrieval and indexing • Results of implementing the string matching strategy • Permitting more operations improved recall performance but degraded precision performance • Despite low performance rates, a system performing these tasks will be commercially viable Speech Recognition: Continuous Speech Recognition • Problems with continuous speech recognition • Has unpredictable errors that are unlike other “predictable” user input errors • The absence of context aids makes recognition difficult for the computer • Speech user interfaces are still in a developmental stage and will improve over time Speech Recognition: Continuous Speech Recognition • Two modes • Keyboard-mouse and speech • Two tasks • Composition and transcription • Results • Keyboard-mouse tasks were faster and more efficient than speech tasks Speech Recognition: Continuous Speech Recognition • Correction methods • Two general correction methods • Inline correction, separate proofreading • Speech inline correction methods • Select text and reenter, delete text and reenter, use correction box, correct problems during correction Speech Recognition • Continuous speech recognition Speech Recognition • Continuous speech recognition Speech Recognition: Continuous Speech Recognition • Discussion of errors • Inline correction is preferred by users • • • • regardless of modality Proofreading had increased usage with speech because of unpredictable system errors Keyboard-mouse involved deleting and reentering the word Despite ability to correct inline with speech, errors typically occurred during correction Dialog boxes used as a last resort Speech Recognition: Continuous Speech Recognition • Discussion of results • Users still do not feel that they can be productive using a speech interface for continuous recognition • More studies must be conducted to improve the speech interface for users Project Implementation • Write a CAVE application using YG • 3D objects simulate sound producing objects • No speech recognition will occur since predefined text will be attached to each object • Objects will move in space • Objects will not always produce sound • Objects may not be in the line of sight Project Implementation • Write a CAVE application using YG • Sound location • Show directional vectors for each object that emits a sound – Longer the vector, the farther away the object is from the user – X, Y will use arrowheads, Z will use dot / "X" symbol – Dot is for an object behind the user, "X" symbol is for an object in front of the user – Only visible if sound can be “heard” by the user Project Implementation • Write a CAVE application using YG • Sound properties • Represented using a square • Size represents volume/amplitude (probably will not consider distance that affects volume) • Color represents pitch/frequency • Only visible if sound can be “heard” by the user Project Implementation • Write a CAVE application using YG • Simulate “cocktail party effect” • Allow user to enlarge text from an object that is far away • Provide configuration section to ignore certain sound properties – Volume/amplitude – Pitch/frequency Project Tasks Completed • • • • Basic project design Have read some documentation about YG Tested functionality of YG in my account Established contacts with people that have programmed CAVE applications using YG • Will provide 3D models and code that demonstrates some functionalities of YG features upon request • Will help with answering questions and demonstrating and explaining features of YG Project Timeline • Week of March 25 • Practice modifying existing YG programs • Collect needed 3D models for program • Week of April 1 • Code objects and their accompanying text • Implement movement patterns for objects Project Timeline • Week of April 8 • Attempt to “turn on and off” the sound of objects • Work with interaction properties of objects that will determine visualizing sound properties • Week of April 15 • Continue working on visualizing sound properties • Work on “enlarging/reducing” text of an object Project Timeline • Week of April 22 • Create simple sound filtering menus • Test program in CAVE • EXAM WEEK: Week of April 29 • Practice presentation • Present project Bibliography Behringer, R., Chen, S., Sundareswaran, V., Wang, K., and Vassiliou, M. (1998). A Novel Interface for Device Diagnostics Using Speech Recognition, Augmented Reality Visualization, and 3D Audio Auralization, in Proceedings of IEEE International Conference on Multimedia Computing and Systems Vol I, Institute of Electrical and Electronics Engineers, Inc., 427-432. Goose, S. and Moller, C. (1999). A 3D Audio Only Interactive Web Browser: Using Spatialization to Convey Hypermedia Document Structure, in Proceedings of the seventh ACM international conference on Multimedia (Orlando FL, October 1999), ACM Press, 363-371. Bibliography Hollerer, T., Feiner, S., and Pavlik, J. (1998). Situated Documentaries: Embedding Multimedia Presentations in the Real World, in Proceedings of the 3rd International Symposium on Wearable Computers (October 1999, San Francisco CA), Institute of Electrical and Electronics Engineers, Inc., 1-8. Karat, C.-M., Halverson, C., Horn, D., and Karat, J. (1999). Patterns of Entry and Correction in Large Vocabulary Continuous Speech Recognition Systems, in CHI '99, Proceeding of the CHI 99 conference on Human factors in computing systems: the CHI is the limit (Pittsburgh PA, May 1999), ACM Press, 568-575. Bibliography Lumbreras, M., Sanchez, J. (1999). Interactive 3D Sound Hyperstories for Blind Children, in CHI '99, Proceeding of the CHI 99 conference on Human factors in computing systems: the CHI is the limit (Pittsburgh PA, May 1999), ACM Press, 318-325. Robetison, J., Wong, W. Y., Chung, C., Kim, D. K. (1998). Automatic Speech Recognition for Generalised Time Based Media Retrieval and Indexing, in Proceedings of the sixth ACM international conference on Multimedia (Bristol UK, September 1998), ACM Press, 241-246.