FocalSpace: Enhancing Users' Focus on Foreground through Diminishing the Background ARCHIVES By Lining Yao Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2012 @ Massachusetts Institute of Technology 2012. All rights reserved. Autho r ..................................... / ..... .................................................................... L in in g Ya o Program in Media Arts and Sciences May 1st, 2012 Certified by ....................................... H iroshi Ishii Jerome B.Wiesner Professor of Media Arts and Sciences Program in Media Arts and Sciences Thesis Supervisor Accepted by ............................................................................................................. Mitc hel Res n ick LEGO Papert Professor of Learning Research Academic Head Program in Media Arts and Sciences 1 FocalSpace: Enhancing Users' Focus on Foreground through Diminishing the Background By Lining Yao Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on May 11, 2012, in partial fulfillment of the Requirements for the degree of Master of Science in Media Arts and Sciences ABSTRACT In this document we introduce FocalSpace, a video conferencing system that helps users focus on the foreground by diminishing the background through synthetic blur effects. The system can dynamically recognize the relevant and important activities and objects through depth sensors and various cues such as voices, gestures, proximity to drawing surfaces, and physical markers. FocalSpace can help direct remote participants' focus, save transmission bandwidth, remove irrelevant pixels, protect privacy, and conserve display space for augmented content. The key of this research lies in augmenting user experience through diminishing irrelevant information, based on the philosophy of "less is more." In short, we use DR (Diminished Reality) to improve communication. Based on our philosophy of "less is more", we describe some design concepts and applications beyond the video conferencing system. We explain how the approach of "AR through DR" can be utilized in driving, and sport-watching experiences. In this document, we detail the system design for FocalSpace, a 3-D tracking and localization technology used. We also discuss some initial user observations and suggest further directions. Thesis Supervisor: Hiroshi Ishii Title: Jerome B. Wiesner Professor of Media Arts and Sciences, Program in Media Arts and Sciences 2 DR AR "Imagine a technology with which you could see more than others see, hear more than others hear, and perhaps even touch, smell and taste things that others can not. " By D.W.F. van Krevelen and R. Poelman [11] "Then, imagine a technology with which you could see less than others see, hear less than others hear, so that you can concentrate and care only the stuff you want to care about." By the author of this thesis 3 FocalSpace: Enhancing Users' Focus on Foreground through Diminishing the Background By Lining Yao The following people served as readers for this thesis: Thesis Reader ........................................... ................................. Pattie M aes Professor of Media Arts and Sciences Program in Media Arts and Sciences Thesis Reader ............................. .......................................................... Ra m esh Raska r Assistant Professor of Media Arts and Sciences Program in Media Arts and Sciences 7 ACKNOWLEDGEMENTS First, I thank my first collaborator, Anthony DeVencenzi. This work would not have been possible if I had conducted it alone. Together with Tony, I developed the very first application, "Kinected Conference", one week after Microsoft Kinect came to the market. I enjoyed the moment when we used open source code to hack into the device and blew people's minds with our final presentation for MAS 531, Computational Camera and Photography. I fondly recall the moments when we sketched on the 5th floor cafe before meeting professors, coded during holidays, shot video on the ground floor at midnight, wrote papers on the plane, and sat in front of cameras to describe our story and concept for the media. I appreciate Tony's support, encouragement, and contributions during each step of the project. I thank my advisor, Hiroshi Ishii, for inspiring me with his vision of "Tangible Interface" back when I was in China. I have always been encouraged by his energy, his passion, and his critical perspective. When we explained ideas in one sentence during a group meeting, it was Hiroshi who picked up the original idea of Kinected Conference and encouraged us to go ahead with it. When we tried to frame our work to tell a better story, it was Hiroshi who inspired us with the vision of "Diminishing Reality" + "Augmenting Reality". FocalSpace is a result we cultivated together over two years. I would like to express my gratitude to him for his enormous support and encouragement. It was a great pleasure to take MAS 531, Computational Camera and Photography, under Ramesh Raskar, the director of the Camera Culture group at the Media Lab. Without this class, FocalSpace would not exist. Ramesh showed me how to "play" and "design" with technology, and how a real MIT genius invents something new. The night before the final presentation for the class, as Tony and I were shooting project videos and preparing presentations, Ramesh came to me and listened to my idea. After brainstorming with him, I extended the single "voice activated refocusing" feature to five different visual features. As a result, I got the most votes for the final presentation and won the 1st place award as a designer in one the most technical classes at the media lab. Ramesh is a tutor who has not only driven me to work harder, but also inspired me with a unique perspective, great talent, and creativity. I am grateful to him for telling me, "you should not be constrained by the research direction of your group; you should be the one who creates the direction." His advice has consistently pushed me to move forward and outside the box. 5 I have been fortunate to have the pleasure of communicating with another of my thesis readers, Professor Pattie Maes. I appreciate the support I received anytime I sought her help. She really inspired me to think about the effectiveness of the system seriously. I would like to thank all of the MIT undergraduate students who helped with this project. They were always there and willing to help at any moment I needed them. Erika Lee, who began working with us as a junior, will complete a bachelor's and master's thesis on this topic as well. She did a great deal to build up the infrastructure of FocalSpace after we switched to Windows and Microsoft official SDK for Kinect. Many times, she was the person I looked for when I had no clue how to solve a technical problem. Kaleb Ayalew started collaborating with me a year ago on another project; he switched to FocalSpace at the later stage. He started to implement the entire record-and-review interface from scratch, making it work nicely. He is very kind, codes quickly, and was always careful in debugging the code, even when I felt impatient. I would like to thank Shawn Xu as well for his hard work day and night on the private mobile interface, as well as his caring and friendship throughout the semester when he was here at MIT. I still remember the night when we had to catch the last train at 1 o'clock in heavy rain. These collaborators really made my time at MIT meaningful. It has been an absolute pleasure spending time with Pranav Mistry, one of my best friends here at MIT. When I talk to Pranav, I see a perfect mind combining technology, design, art, philosophy, and even pop culture. I was moved by the way he told a story with technology during his TED talk entitled "Six Sense" back in China. I never imagined that I would one day hang out with this amazing creator and hear firsthand how passionate and obsessed he was when he was working on Six Sense. At MIT, we have access to the smartest people all the time. Many people are good at sharing their own thoughts, but only some of them are willing to connect their vision more closely to daily life, and Pranav is one of them. Pranav and I share the belief that invention, or research, should be simple, unique, and useful. He keeps reminding me that I shouldn't go in the direction of being too geeky. Instead, I should care about the real-life impact of my work. He believes that we should learn to think independently and never follow in others' footsteps. Moreover, Pranav introduced me to Abhijit Bendale, who became another of my lifelong friends. I will never forget the days when Abhijit visited and chatted with me on the 5th floor cafe, brainstorming and dreaming of the future. I would like to thank the researchers from the Cisco Researcher Center: Steve Fraser from California, Stephen Quatrano from Boston, and Lars Aurdal from Norway. I appreciate their insight, honest comments, and suggestions. With their help, I received the Cisco Fellowship for PhD study. The appreciation our work has received within industry means a lot to us. They are very fun people to talk to as well. I am looking forward to close collaboration with them in the following years. 6 It has been my pleasure to get to know all of the people in the Tangible Media group. They have offered me great help and friendship. I've been working on FocalSpace for almost two years. When I coded and set up the hardware systems, there were always group members around, helping or giving suggestions from time to time. I would like to thank Sean Follmer and Jinha Lee for the encouragement they extended the very first time they heard about the project. Daniel Leithinger has always been there to listen, help, and take care of any mess I created. David Lakatos spent much time in the same office for the whole summer to test out audio-video features. And Austin Lee has always given kind suggestions and encouragement. I thank Leo Bonanni for his consistently short, sharp, but effective suggestions and support. I miss Keywon Chung's kindness. I enjoy being lab mates with Xiao Xiao and Samuel Luescher. Finally, I have had the pleasure of working with Natalia Villegas, Mary Tran Niskala, Jonathan Williams, Paula Aguilera, and many others who made the Media Lab a better place to stay. I thank my former advisor in China, Professor Fangtian Ying. He is the one who told me, "be imaginative about yourself". Without him, I would never have mustered the courage to leave my country and friends and apply to MIT. He is a professor who blew my mind every time I listened to him talk. He told us that being a student means maintaining passion and curiosity. I learned how to be a better designer and human being because of him. Although he couldn't speak English, his vision of design and creating a better life transcends countries and time. In conclusion, I want to say thank you to my great family. No matter how far I go, my lovely parents and little sister are always standing where I can see them immediately if I turn back. I want to thank Wei for his constant patience whether I am happy or angry, considerate or demanding. He has taught me that loving a person means caring for and appreciating him or her. My family and Wei make me feel safe and warm. 7 TABLE OF CONTENTS Abstract ................................................................................................... 2 Acknow ledgem ents.................................................................................5 List of Figures.......................................................................................11 Introduction ......................................................................................... 14 Motivation for Diminishing Reality ......................................................... 15 Current Challenges in the Video Conferencing Room ........................... 16 Technical constraint ...................................................................................................... 16 Perceptive constraint....................................................................................................... 16 Inspirations .............................................................................................. 17 Hypothesis and Goal ................................................................................ 18 Contribution ............................................................................................ 19 V is io n ................................................................................................................................ 19 Usefu lness ........................................................................................................................ 19 Technical contribution ................................................................................................ 20 Outline ...................................................................................................... 20 Related W ork ........................................................................................ 22 Computer Mediated Reality .................................................................... 23 Visual perception of the reality .................................................................................. 23 Augm ented Reality..................................................................................................... 23 Time Augmented Reality............................................................................................ 25 Im proved Reality .............................................................................................................. 25 Alte red Rea lity .................................................................................................................. 26 Dim inished Reality ..................................................................................................... 28 Diminished "Digital Reality"....................................................................................... Abstracted Reality ....................................................................................................... 30 33 Focus and Attention ................................................................................ 34 Awareness of focus for video conferencing .............................................................. 34 Focus and attention in other application domains ....................................................................... 35 8 Im proving Focus by DR ....................................................................... The "Layered Space Model" of FocalSpace........................................... 39 41 Fo reg ro und laye r .............................................................................................................. 41 Diminished background layer ..................................................................................... 41 A d d -o n laye r......................................................................................................................4 Interaction Cues of "Layered Space Model".......................................... 1 41 A u d io c u e .......................................................................................................................... 43 Gestu ral cue ..................................................................................................................... 43 P roxim ity cue .................................................................................................................... 44 Physical maker as an interactive cue........................................................................ 45 Remote user defined/discretionary selection ......................................................... 46 Other potential cues .................................................................................................. 46 User Scenarios...................................................................................... 47 Filtering Visual Detritus and Directing Focus......................................... 48 Saving display space for augmented information................................. 49 Saving Bandwidth for Transmission........................................................52 52 Keeping Privacy ....................................................................................... Im plem entation ................................................................................... 53 System ...................................................................................................... Setup ........................................................................................................ 54 54 Tracking Techniques ............................................................................. 56 Image Processing Pipeline.................................................................... 57 User Interface .......................................................................................... 58 User Perspectives ................................................................................ 60 Feedback from the Showcase .................................................................... 61 User test conducted at the lab ............................................................... 62 Go a l...................................................................................................................................6 2 Se tu p ................................................................................................................................. 62 M e th o d .............................................................................................................................. 63 Ste ps ................................................................................................................................. 64 Fin d in gs ............................................................................................................................ 65 Extended Application ......................................................................... 68 9 Extended Applications..........................................................................68 Record and Review................................................................................ 69 Mining gestures for navigating video archives ......................................................... vo ice in de x ........................................................................................................................ 69 71 Portable Private View ............................................................................. Extended Dom ain ................................................................................ 71 73 FocalCockpit............................................................................................. 74 FocalStadium .......................................................................................... 76 Conclusion............................................................................................ 78 Camera Man ............................................................................................ 79 Future W ork ............................................................................................ 80 S pe cific tasks ................................................................................................................... Cloud-based solution for FocalSpace ....................................................................... 80 80 Conclusion............................................................................................... 80 Bibliography.......................................................................................... 82 10 LIST OF FIGURES Figure 1: Time Square with Information Overload ...................................................................... 15 Figure 2: Depth of Field Techniques used in Photography.. ...................................................... 17 Figure 3: FocalSpace System ........................................................................................................ 18 Figure 4: FocalSpace central configuration. ................................................................................ 19 Figure 5: an "evolutional" approach to outline the thesis .......................................................... 21 Fig ure 6 : EyeTa p................................................................................................................................. 23 Figure 7: "Office of the Future"...................................................................................................... 24 Figure 8: Timescope (Left) and Jurascopes (Right) from ART+COM.. ......................... 25 Fig ure 9 : A rtve rtise r............................................................................................................................ 26 Figure 10: Umkehrbrille Upside Down Goggles by Carsten H61ler............................................. 27 Figure 11: "Anim al Superpowers"................................................................................................. 28 Figure 12: Diminished and augmented planar surface. .............................................................. 29 Figure 13: Art w ork of Steve Mann. ............................................................................................... 29 Figure 14: "Rem ove" from Scalado.............................................................................................. 30 Figure 15: Diminished Reality in Data Visualization and Software Applications....................... 31 Figure 16: The company webpage of Tribal DDB......................................................................... 32 Figure 17: "Turning on and off the light" on YouKu.................................................................... 32 Figure 18: Non-Photorealistic Camera. ........................................................................................ 33 Figure 19: the abstracted rendering is generated by transforming a photograph based on view ers' eye-gaze fixation..................................................................................................... 34 Figure 20: "Obsessed by Sound" ................................................................................................... 36 Figure 21: Aggregated fixations from 131 subjects viewing Paolo Veronese's Christ addressing a Kneeling W om an........................................................................................................................ 37 11 Figure 22: Gaze Contingent Display. ............................................................................................. 38 Figure 23: Dynamic Layered Space Model................................................................................... 40 Figure 24: Categories of the semantic cues. ............................................................................... 42 Figure 25: Voice Activated Refocusing.......................................................................................... 43 Figure 26: Gesture Activated Refocusing..................................................................................... 44 Figure 27: Proximity Activated Refocusing................................................................................... 45 Figure 28: physical marker as an interactive cue....................................................................... 46 Figure 29: 3 Steps to develop the application on top of the "Layered Space Model"............... 48 Figure 30: Voice activated refocusing .......................................................................................... Figure 3 1: A ugm entation ............................................................................................................... 48 .49 Figure 32: Contextual augmentation of local planar surfaces.................................................... 50 Figure 33: Sketching on the planar surface can be detected and augmented for the remote us e rs . .......................................................................................................................................... 51 Figure 34: Contextual augmentation of shared digital drawing surfaces................................... 51 Figure 35: 3 degrees of blur, with dramatically changed data size............................................. 52 Figure 36: FocalSpace central configuration. ............................................................................... 55 Figure 3 7 : S atellite cam eras ........................................................................................................... 55 Figure 38: Categories of the segmenting and tracking targets ................................................. 57 Figure 39: Video Conferencing Pipeline of "voice activated refocusing"................................... 58 Figure 40: The front end user interface. The slider bar can be auto hided................................ 59 Figure 41: FocalSpace during the showcases. ............................................................................. 62 Figure 42: FocalSpace setup for user test. .................................................................................. 63 Figure 43: Switching between Live Mode and Review Mode....................................................... 69 Figure 44: "Good idea" gesture index. ......................................................................................... 70 Figure 45: "Talking head" m ark 71 .................................................................................................. Fig u re 4 6 : Gestu re cue ...................................................................................................................... 72 12 Figure 47: To focus on a certain active area, in this case, the flipchart. .................................... 72 Figure 48: We envision drivers could get a fog-diminished view in a rainy day......................... 74 Figure 49: The real time updating of the "bird view" map........................................................... 75 Figure 50: Chatting on the road ................................................................................................... 75 Figure 5 1: Finding the parking slots. ............................................................................................. 75 Figure 52: Focus on the most active speaker.............................................................................. 76 Figure 53: The metaphor of "Camera man" for FocalSpace...................................................... 79 13 INTRODUCTION "LESS IS MORE" This chapter serves as an introduction. We firstly described the motivation for "Diminished Reality", and then moved a step further to talk about the challenges in the video conferencing room as a special real life scenario. Section 2 explains the inspiration the FocalSpace, introduces the hypothesis, and highlights the novel interaction and research contributions. 14 Motivation for Diminishing Reality We are in an information age. We are facing information when we are at our computers, on our mobile phones, or even walking on the street. Inventors whose job is to imagine the future are trying hard to bring information everywhere in the physical environment. This work has taken various research directions, such as "Ubiquitous Computing", " Physical Computing", "Augmenting Reality", and "Tangible Interaction." Soon, we will be able to, or we will be forced to, receive information everywhere, both consciously and subconsciously. The importance of information accessibility should never be denied. However, people are facing information overload. This situation brings us an obligation to research how to organize and filter information. In particular, we are interested in helping people filter out unwanted or unimportant information and focus on the most relevant information. Diminishing unwanted information while rendering the important information more accessible/visible is the goal of this research. Figure 1: Times Square with Information Overload. The street is overwhelmed with activities, lighting and displays; it's information overload in the real world. 15 Current Challenges in the Video Conferencing Room In this paper, we take video conferencing as an example to explain our belief in and approach to Diminishing Reality. In recent years, the use of video conferencing for remote meetings in the work environment has been widely adopted. However, a number of challenges remain associated with the use of video conferencing tools. Technical constraint Firstly, for large group video conferencing, it is hard to render the entire scene captured in high detail. One obvious reason is the limitation of transmission bandwidth. Most of the video conferencing tools in the market, such as Skype (Skype, 2012), samples video frames at low resolution, to mitigate this problem. In addition, another possible reason could be that there is not enough real estate available on the screen to render each participant in sufficient detail (Tracy Jenkin, 2005). Some of the existing systems try to solve this problem by allocating the screen real estate according to interest. For example, Cisco's WebEx (Cisco, 2012)system can detect the volume thresholds and simply render the active speakers on the screen. eyeView (Tracy Jenkin, 2005) adapts the visual cues of looking behavior to dynamically change the size of different remote participants and render the current object of interest in a high-resolution focus window. Perceptive constraint The second problem we are trying to address is the loss of semantic cues, such as visual-spatial and auditory-spatial cues in the remote video conferencing system, which makes it difficult for remote participants to focus and get engaged compared to in a co-located meeting. In face-toface conversation we commonly adjust the focus of our gaze towards those who are speaking. Without conscious thought, our eyes dart in and out of different focal depths, expanding and contracting the aperture of our inner eye. The subtle depth-of-field created when focusing on who you are speaking with or who you are looking at is a natural tool which affords literal and cognitive focus while fading out the irrelevant background information. Moreover, based on auditory-cue including the direction of the sound source, people can easily focus on the selective interest; this is also known as the "cocktail party effect". In the case of video conferencing, these natural behaviors are reduced by the inherent constraints of a "flat screen" (Tracy Jenkin, 2005) (Vertegaal, 1999). Eye gaze and sound have been commonly explored as potential cues (Vertegaal, 1999) (Okada, 1994) by utilizing omnidirectional cameras, multiple web cameras, and speaker arrays. We believe the conveyance of audio and visual cues serves a critical role in terms of directing and managing attention and focus. 16 Inspirations In photography and cinematography, one of the basic techniques artists use is Depth of Field (DOF), to blur out the background and draw audiences' attention to the character or speaker in focus (Figure 2). In lots of movies, when the main speaker changes to be another person, the foreground focus will switch accordingly as well. The effect of DOF gives a better focus, while at the same time making the viewers aware of the context. Figure 2: Depth of Field Techniques used in Photography. Photo courtesy of Joshua Hudson. Another inspiration is from the real life movie watching experience. When the movie starts, the lighting system will be turned off in the movie theater. By diminishing the bright light in the surround environment, people get a better concentration on the central movie projection. People might take this process for granted, but they start to mimic this process naturally when they watch movies at home in front of their own computers or TVs. But maybe mimicking is not a right word, because people unconsciously want to create a dark environment when they want to concentrate on a smaller screen even without borrowing concepts from movie theaters. 17 Hypothesis and Goal To achieve better focus through Diminished Reality. We propose the idea of emphasizing the foreground through de-emphasizing the background and simplifying the visual contents. Based on the philosophy of "less is more", we try to create a visually simplified reality that can help people to focus and concentrate. The purpose of the display is not to restore the reality in the remote space with uniform high-resolution video. Instead, the system diminishes non-essential background elements such as clutter while highlighting focal persons and objects (such as whiteboards). In the end, what remote people see is a "fake" or "synthesized" reality, yet it is a more useful reality. We address the above hypothesis in our design of FocalSpace (Figure 3,4), a video conferencing system that can dynamically track the foreground, and diminish the background through synthetic blur effects. Figure 3: FocalSpace System. It has 3 depth cameras and 3 microphone arrays to track the video and audio in an entire video conferencing room. 18 N X - -Remote Participants C36 Depth and RGB image space Horizontal Sound angle Figure 4: FocalSpace central configuration. Three depth cameras are arranged in such a way that 180 degrees of the scene can be captured. One microphone array is integrated in each depth camera. Contribution Vision e DR as the Filter: We propose dynamic DR (Diminished Reality) as a basic approach for filtering information, and organizing people's focus and attention through visual communication. * Natural Blur: We applied synthetic blur as a main visual effect to diminish the background. Synthetic blur has been proven to function as a natural means of helping people gain a central focus while remaining aware of the context. Usefulness e To give sufficient detail for the rendering of items of interest in the foreground, even with very limited transmission bandwidth and screen real estate resources. * To diminish the peripheral visual and audio information, and give cognitively natural, computationally effective cues for selective attention. " To remove unwanted background noise, or protect information privacy in the background. 19 Technical contribution We invented an interactive video conferencing system based on a customized technical solution. Enabled by a depth map of the meeting room taken by 3 depth cameras and 3 microphone arrays, the FocalSpace system detects interactive cues (such as audio cues, gesture cues, proximity cues, and physical markers) and builds a dynamically changing "layered space model" on top. The tracked foreground participants and subjects are taken in and out of focus computationally, with no need for additional lens or mechanical parts. Due to our ability to infer many different layers of depth in a single scene, multiple areas of focus and soft transitions in focal range can be simulated. Outline An "evolutional" approach has been adapted for the thesis writing (Figure 5). The thesis starts with the inspiration from photograph and real life visual experience, and then moves to an overview of computer mediated reality, which includes great work by researchers, artists, designers and industry practitioners who have through about altering human perception of the real world. Moving forward, a design framework of "Diminished Reality + Augmented Reality" is demonstrated in chapter 3, it explains the idea of dividing visual contents into background layer and foreground layer in order to visually simplify and filter information. Following the design framework, the definition of "foreground" in video conferencing room is expended from "talking heads" to all the physical artifacts that are involved in the foreground activities; a wider approach to emphasize or augment the foreground is explored as well. Various use cases are explained in chapter 4 through chapter 6, with a detailed description of the design concept, system setup, technical implementation and user feedbacks. The chapter that follows describes about the design vision beyond video conferencing room. The conceptual framework of "DR+AR" is explained under the context of driving and watching sports games separately. 20 Expanded Domain (Chapter 7) Expanded FocalSpace (Second half of Chapter 4) Use Cases (First half of Chapter 4) Design Framework (Chapter 3) Real world inspiration (Chapter 1) Figure 5: an "evolutional" approach to outline the thesis 21 RELATED WORK "WHAT YOU SEE IS NOT REAL" This chapter introduces related work in two different categories. We first introduce the efforts of researchers trying to create a visual perception of computer mediated reality, which is followed by an introduction to research related to human's focus and attention on screen. 22 Computer Mediated Reality Visual perception of the reality It's a common belief that computation doesn't sit inside the computer screen anymore. Computation is everywhere. Mobile phones and tablets make computation portable, giving people the feeling that digital information is always around, regardless of time or place. But currently, people can still tell where the physical world ends and the digital world begins, as when they switch their eyes from their mobile phones to the real road in front of them. What if, in the future, you cannot trust your eyes anymore? What if you cannot tell what is reality and what is reality modified by the computer? This situation might be scary in some contexts, but it is becoming an inevitable part of the future. Steve Mann and his group developed the EyeTap, a camera and display "eye glasses", which can show digital information at a proper focal distance (Steve Mann, 2002). This technology can process and alter what the user sees of the reality. Computer Mediated Reality is a concept raised by EyeTap [Figure 6]. It is "a way to give humans a new perspective on reality, by adding, subtracting or other ways to manipulate the reality". It's a visual perception of the real world mediated by computational technology. Figure 6: EyeTap. It's a device that can be worn as an eyeglass. It gives humans a new perspective on reality through computer mediation. Augmented Reality As the most common type of Computer Mediated Reality, Augmenting Reality (AR) has been widely adapted to different use cases. As one type of Computer Mediated Reality, the Augmenting Reality (AR) is widely explored and developed in different scopes. As described on Wikipedia (AugmentedReality, 2012), Augmenting Reality is "a live, direct or indirect, view of a 23 physical, real-world environment whose elements are augmented by sensory input such as sound, video, graphics or GPS data". AR applications, especially those based on mobile platforms are starting to be widely used in various domains, such as consuming, entertainment, medicine ,and design. AR can be used both indoors and outdoors with various technical solutions. This paper will not address AR in depth, as AR itself is an open-ended topic. Ronald gives an overview of AR at the current stage (Azurma, 2004). It should be noted that researchers from University of North Carolina and University of Kentucky envisioned the future office with every planar surface augmented with digital information for remote communication and collaboration (Figure 7). This is relevant to the topic of this thesis, as "the future office" also utilizes per-pixel depth tracking to learn about visible surfaces including walls, furniture, objects, and people, and then to "either project images on the surfaces, render images of the surfaces, or interpret changes in the surfaces" (Ramesh Raskar G. W., 1998). Figure 7: The "Office of the Future" depicts a vision of a future office where all visible surfaces, including walls, furniture, and objects, can be detected and utilized to hold rendered images. 24 Time Augmented Reality The term "Computer Mediated Reality" alludes to a broader goal beyond simply augmenting the surrounding environment with GUI or GPS related data. ART+COM (ART+COM, 2012) has two projects, Timescope and Jurascopes, which augment the perception of reality along the time axis. This is one innovative example of "Computer Mediated Reality". Timescope (Figure 8) was installed in front of the location of the former Berlin Wall. It enables viewers to travel back in time at their present location. Through a media scope, people can take a trip back to see the history of a structure that defined the city for over 30 years. Jurascopes (Figure 8) were installed in the Berlin Museum of Natural History. Through these media telescopes, viewers can see, one after the other, inner organs, muscles, and skin placed on top of the skeleton of a dinosaur. Eventually, the dinosaur appears to come to life and run around. Sounds from the environment and the animal itself contribute to the experience. Figure 8: Timescope (Left) and Jurascopes (Right) from ART+COM. Timescope enables viewers to travel back in time at their present location. Jurascopes turn dinosaur skeletons into live dinosaurs through an augmenting lens. Improved Reality "Improved Reality" is an informal term used to describe the approach of enhancing or enriching the visual perception of reality. Artvertiser is an example of this phenomenon. This project enables users to view a virtual canvas on top of an existing advertisement through a mobile device (Artvertiser, 2012). The detected advertisement can be on a building, in a magazine, or on the side of a car. Artists can create their own visual art on top of the virtual canvas via mobile devices (Figure 9). 25 Figure 9: Artvertiser is a project to replace the advertisement on the billboard with digital art work. One may notice that the technique of "Diminishing Reality" is normally used in combination with other approaches to "Computer Mediated Reality". In most cases, augmented contents are added on top of the diminished reality. Altered Reality There's an old saying in Chinese: "i, isdn 4I ]". It means "although it's the same mountain, it looks like a ridge if people see it from the front, but it will turn into a peak if people see it from the side; it seems so different depending on how far and how high the spectator is from the mountain". In the same manner, if we change the perspective from which people see reality, an altered world could exist in human perception. The artist Carsten H61ler made goggles out of optical glass prisms that he called "Umkehrbrille Upside Down Goggles" (Figure 10). By wearing these goggles, people could perceive the real world upside down. In the 1890s, George Stratton conducted an experiment to see what would happen if he viewed the world upside down through an inverting prism. He found that after 4 days, his brain started to form a perceptual adaption, and he could see the world the right way up again (Stratton, 1896). 26 Figure 10: Umkehrbrille Upside Down Goggles by Carsten H61ler. By wearing these goggles, people could perceive the real world upside down. Rather than altering perspectives based on human vision, Chris Woebken and Kenichi Okada developed "Animal Superpowers" to alter the real world with animal senses and perceptions (Chris Woebken, 2012). "Animal Superpowers" includes three physical wearable devices to mimic animal senses on the human perceptive level (Figure 11). "Ant" makes people feel like ants by magnifying human vision 50x through a microscope in the hand. The device enables users to see through the hands and explore tiny cracks and details of a surface. The "bird" device borrows birds' capability of recognizing directions. It uses a GPS system to vibrate when people move in a certain direction, such as home or an ice cream store. Finally, the "giraffe" device extends users' necks and gives them the capability of seeing tall things. It can act as a child-to-adult converter by raising children's perspective by 30 centimeters. Jeff Lieberman, who created "Moore Pattern", a kinetic optical-illusion sculpture (Lieberman, 2012), subscribes to the notion of "seeing is believing". Within the context of this document, "seeing" could be mediated by computation, which generates unique perception, or "believing" to people. The perception might help to extend human capability, offer a unique experience, or serve as an entertaining style. 27 Figure 11: "Animal Superpowers" includes three wearable physical devices that mimic animal senses on the human perceptive level. "Ant" makes people feel like ants by magnifying human vision 50x through a microscope in the hand. "Bird" borrows birds' capability of recognizing directions. "Giraffe" can act as a child-to-adult converter by raising children's perspective by 30 centimeters. Diminished Reality Compared to AR, much less attention has been devoted to other types of Computer Mediated Reality, including Diminishing Reality. A clear definition for Diminishing Reality has yet to be offered. Steve Mann and his colleagues were among the first to import the term "Diminishing Reality" into the HCI field. The Reality Mediator allows wearers' visual perception of the real world to be altered in such a way that users can diminish or modify visual detritus freely. It also ensures that the augmented information can be added on top of the diminished reality without causing information overload. For example, road directional information, or personal text messages, as a 28 type of digital augmentation, can be seen on top of a board used for advertisement in reality [Figure 12]. Figure 12: Diminished and augmented planar surface. Through wearable goggles, road directional information, or personal text messages, as a type of digital augmentation, can be seen on top of a board used for advertisements in reality. Steve Mann also created some art pieces along the same conceptual lines. On a monitor screen facing the real environment, viewers could see a different status of the reality where the monitor was pointing. For example, people could see part of the reality without fog even in a foggy day or a brightly light view could be seen in the dark [Figure 12]. Such art pieces expand people's perspectives on reality. Observers could see reality beyond a specific point in time. Figure 13: Artwork by Steve Mann. Diminishing fog and darkness. Scalado is a company focusing on creative video capturing and viewing tools (Scalado, 2012). By embedding various image-processing technology in real time, it augments the captured image with angles, time, perspectives, digital contents, etc. One feature related to Diminishing Reality is called "Remove". This feature can highlight the background elements and enable 29 users to select and delete them. For example, users can easily delete walkers passing by while keeping a focal figure in the foreground of a street picture (Figure 14). The camera works by capturing several images once the shutter is turned on and determining the difference between each captured image. Similar technology for background subtraction is used for real-time image processing. Figure 14: "Remove" from Scalado: mobile software to capture and edit phone for diminishing background detritus. By choosing different criteria, work related to Diminishing Reality could be categorized in different ways. It could be categorized based on triggering cue; DR effect can be triggered by voice, eye-gaze, or manual selection through GUI and manual selection through pointing devices. It could also be categorized based on domain; DR can be used for art pieces, educational applications, entertainment, daily life, work, and so forth. Diminished "Digital Reality" There are two points in the concept of "Diminishing Reality" mentioned in this paper: diminishing, or de-emphasizing part of the scene perceived by human eyes; leaving a new perception of the reality to the viewers. We will describe some graphic interface design and data visualizations that seek to diminish or filter information on visual display. Technically speaking, the issue is not diminishing "reality", as everything is virtually displayed. However, it has the same design concept of "simplifying the background visual contents, or context, and helping the viewers to focus on the current 30 foreground". Some researchers tend to call it "attentive display". It's helpful to have an overview of some of the project and learn how it is possible to create a visual contrast between the foreground and background information. In their paper "Semantic Depth of Field", Kosara et al. divided the design approaches of attentive visualizations into 3 types: spatial method, dimension method, and cue method (Robert Kosara, 1997). Through spatial or distortion-oriented methods, the geometry of the display is distorted to allow magnification of the central focus. Various methods have been put into practice, including fish eyes (Furnas, 1986) (M.Sarkarand, 1994), stretchable rubber sheets (M. Sarkar, 1993), distorted perspectives, and hyperbolic trees (Munzner, 1998), etc. For some objects that have a large amount of data related to them, only a small part of the data can be shown as an overview, and another dimension of data can be displayed based on the selection. Examples are magic lenses (M. C. Stone, 1994) and tool glasses (E. A. Bier, 1993). Finally, the cue method, which is the method most relevant to "Diminishing Reality", makes the foreground data noticeable not by changing the locational relationships, but by assigning them certain visual cues to emphasize their features. One example is the Geographic Information System (GIS). When different types of data are shown on top of the same layout, the central information is emphasized with higher color saturation and opacity. Semantic depth of field (DOF) (Robert Kosara, 1997) is one type of cue method as well. Adding "semantic" in front of DOF means that the blur is averaged without depth information for the part that is out of focus (Figure 15). U.. Figure 15: Diminishing Reality in Data Visualization and Software Applications. (Left) A file browser showing today's files sharply, with the older ones blurred; (Right) A chess tutorial showing the chessmen that threaten the knight, with other inactive roles blurred. Adjusting the transparency or blurring the background is a common method in Web User Interface to highlight the foreground and filter out visual detritus for viewers. The project website 31 of Tribal DDB (DDB, 2012) is one of example (Figure 16). The reason is that this visual effect could help viewers to concentrate better on the current information, whether it is a webpage or a video. Figure 16: The company webpage of Tribal DDB. By adjusting the transparency or blurring the background, we could highlight the foreground, and filter out visual detritus for the viewers. Some video streaming websites have a button labeled "turn on/off the light" as a selection of the video modes (youKu, 2012). When the "light" is off, as in the right image in Figure 17, the background of the video turns black and subtracts all the other visual elements such as reviews, ads, menus, and titles. The name of the feature, "Turn on/off the light" reminds people of turning the light off in the movie theater. Indeed, that's also a real-life scenario where "Diminishing Reality" is used naturally. wc a i~~b004 **OswomOG a~ Figure 17: "Turning on and off the light" on YouKu. (Left) When "Turn on the light" button is on; (Right) When "Turn off the light" button is on [29]. "Turn on/off the light" also reminds people of turning the light off in the movie theater. Indeed, that's a real life scenario where 'Diminishing Reality" is used naturally. We just offered a glimpse of related work in this category. There was and is a strong research community focusing on how to visualize data in an effective way while keeping the balance between focus and context. 32 A bigger question we want to ask ourselves is this: In the past, when we got too much virtual data, researchers started to think about all the creative ways there are to manipulate visualizations for easy human perception. In the current age, as we start to encounter a large amount of visual and artificial information in the physical world, shouldn't we take action and think about ways to organize human perceived reality as well? Abstracted Reality A non-photorealistic camera gives a non-photorealistic rendering approach to "capture and convey shape features of the real world scene" (Ramesh Raskar K.-H. T., 2004). By augmenting a camera with a multi-flashlight that can cast shadows from different perspectives while capturing the picture, it could be possible to highlight the outline, simplify the textural information, and suppress unwanted details of the image (Figure 18). In the use case of video conferencing, by abstracting the face of the talking head, we could transmit the shape and facial expression accurately without showing detailed texture. This would be useful for the purpose of maintaining privacy. Figure 18: Non-Photorealistic Camera. By augmenting a camera with a multi-flashlight that can cast shadows from different perspectives while capturing the picture, it could be possible to highlight the outline, simplify the textural information, and suppress unwanted details of the image. In the next example of computer-generated artwork (Doug DeCarlo, 1991), researchers believe that the abstract rendering of a photorealistic image is one way to clarify the meaningful structures of an image in information design. The abstract rendering (Figure 19) was generated by transforming a photograph based on the viewers' eye-gaze fixation. It has been proven in the art and design field that abstract renderings could achieve more effective communication than realistic photos in some cases. The goal of this work is to computationally abstract a picture and create a non-photorealistic art piece. One of the questions is to what extent the system should simplify the visual components of the real picture. Based on eye-gaze data obtained when different viewers perceived the same picture, the part of the picture attracting most of the 33 attention was left with the most details, and the other part was much more abstracted. This is a good example of using "Diminishing Reality" but in an abstract way. O1 Figure 19: The abstracted rendering is generated by transforming a photograph based on viewers' eye-gaze fixation. (Original photo courtesy of philip.greenspun.com.) Focus and Attention "What's the information people actually care about?" Awareness of focus for video conferencing In FocalSpace, we care about focus and attention. How can we take users' focus and attention into consideration in the display? For remote video conferencing, researchers have explored different methods of tracking and estimating attendees' attention and focus. Eye gaze and sound have been commonly explored as potential cues by utilizing omnidirectional cameras, multiple web cameras, and speaker arrays. The idea of identifying focus and attention in a video conference dates back to a "voice voting" system that automatically switched between multiple camera views based on the active speakers (ROBERT C EDSON, 1971). Research in video conferencing has been concerned with remote listeners' focus and attention. On one hand, some systems should actively communicate the proper focus to remote users. On the other hand, some systems try to give remote users flexibility to access the part of the remote scene on which they want to focus. In "Reflection", researchers add the reflections of all the participants from different remote locations onto the same display (Stefan Agamanolis, 1997). Auditory cues are used to track and emphasize the foreground. The active speakers are rendered opaque and on the foreground to emphasize their visual preference, and other 34 participants are rendered slightly faded in the background in a manner that maintains their presence without drawing too much attention. The Clearboard system explores how to seamlessly combine the working space with interpersonal space (H. Ishii, 1994). This system saved participants' effort to switch attentions between the two spaces. Layered video signal processing for conferencing has also been explored, where there has been a stark and clear differentiation in foreground from background. A number of real-time algorithms (Zhang, 2006)have been implemented to blur background content in order to protect attendees' privacy, while only parts of the face are made to be clear. Moreover, HyperMirror adapted blue screen technology to layer on participant into a scene consisting of another (Maesako, 1998). Focus and attention in other application domains We'd like to talk about focus and attention beyond video conferencing system. In collaboration with Tribal DDB Amsterdam and Grammy award-winning orchestra Dutch Metropolis, Philips sound developed an interactive orchestra system, "Obsessed with Sound". Through the interface, users can interactively select and single out any one of 51 musicians and hear every detail of that musician's contribution to the orchestral piece. The visual cue that indicates the selection visually blacks out other musicians and leaves the selected musician in sharp focus and full color saturation (sound., 2012). When people listen to an orchestra, it's very hard to focus on a single player. The designers believe that every single detail in music is very important and should be heard. To capture this, they created this unique campaign to celebrate each individual artist behind every musical moment. A special orchestra piece in 55 separate music tracks was recorded and combined in a single music system. As they listen to each musician, users can also discover what's behind the sound, from the total hours they've played in their lifetime, to the number of notes they played in the piece, to their twitter feeds, Facebook accounts, and musicians' personal webpage (Figure 20). 35 Figure 20: "Obsessed by Sound," an interactive music system that enables users to single out a particular musician and hear every detail of his or her contribution. In Fixation Maps, a study (Wooding, 2002)has shown that by controlling the luminance, color contrasts, and depth cues, artists can guide the viewer's gaze towards the part that expresses the main theme. By aggregating fixation data from 131 subjects viewing Paolo Veronese's Christ addressing a Kneeling Woman, an artist noticed that subjects' average eye gaze is drawn to the two figures, Jesus and the woman looking at him. Based on this finding, the artist blacked out the other part of the painting to get a more focused art piece (Figure 21). The modified work has spotlight effects on the two characters that have attracted the most attention and thus directs viewers' attention to the most attractive portion of the painting. 36 Figure 21: Aggregated fixations from 131 subjects viewing Paolo Veronese's Christ addressing a Kneeling Woman. Subjects' gaze is drawn to the two main figures. (Original image © National Gallery, London, annotations @ IBS, University of Derby, UK, courtesy of David Wooding.) There is some other artwork caring about focus and attention. ECS Display was used to track the users' point of gaze from a distance and display artwork with visually highlighted attentive areas (Reingold, 2002). The approach of dynamic information filtering based on users' interest is applied in different applications. Most of the work uses eye gaze tracking to gain information on people's points of attention on the display. In conclusion, different visual effects can convey different meanings. Different focus on the same display could generate different interpretations. Through the manipulation of visual effects, we can direct people's attention to certain portions of the screen. Moreover, work on GSD (gaze-contingent display) resulted in multi-resolution, ultimately high lighting visual effects (Figure 22). But almost all the work started from functional perspective. They either tried to save the rendering power for large displays (Reingold, 2002), or tried to save bandwidth for transmission. As it was concluded by Loschky and McConkie, researchers wanted to create a display that is although distinguishable from a full-resolution image, won't 37 deteriorate visual task performance (Loschky, 2005). That means, the hypothesis here is: the highlighting effects, or partially blur effects is at most not giving bad influence in terms of performance. However, we are trying to use those highlighting effects reversely as a positive visual factor for efficient communication. Figure 22: Gaze Contingent Display. The system tracks viewers' gaze movement and display the portion where the viewers focus on in full resolution, while render the other parts in low resolution to save the rendering power for large display. Motivation for including synthetic focal points is derived from the constraining size of screen displays; a system may want to call attention to specific details while at the same time keeping the viewers aware of the global context. Moreover, additional methods may be employed to bring the user's attentions to the most relevant, central information on display, as shown with the spotlight technique (AzamKhan, 2005). As discussed, most of attention based display systems have been designed to address extremely specific problems (Rainer Stiefelhagen, 2001). FocalSpace collectively utilizes techniques of focus points and background subtraction to bring attention to dynamic points of interest. 38 IMPROVING Focus BY DR DR AR This chapter introduces the "layered space model" enabled by the FocalSpace system, and various interaction cues that can be used to build the dynamic spatial layers. 39 The "Layered Space Model" of FocalSpace We divide the remote space into three discrete layers as shown in Figure 23. These layers are (1) background, (2) both the active and inactive foreground which contains participants or relevant objects, and (3) the augmented layer which includes contents such as contextually registering digital contents to the foreground, and graphic user interface elements. These layers exist in parallel for the user, but our system's understanding of both depth and object notation allows us to keep a rich dataset of the distant space and its inhabitants. Diminished reality effect and other FocalSpace use cases are implemented based on the construction of those space layers. For example, the system can diminish/subtract the background layer by blurring it out and augment the foreground layer based on different interaction scenarios. This technique of augmenting diminished reality balances cognitive load by adding information only once unnecessary information has been removed. active participants active object and media active environment remote view "A IL graphic inted augmentator layer Augmentation Layer augmented background layer inactive foreground layer active foreground l ayer 1 j Foreground Layer Background Layer 0 diminished Figure 23: Dynamic Layered Space Model. FocalSpace divides the space into foreground layer, background layer, augmented layer and graphic layer. 40 Foreground layer Conference participants and objects of interest exist within the foreground layer. Using depth, audio, skeletal, and physical markers recognition we are able to automatically identify elements within the foreground. These elements are generally the conference participants themselves, along with contextual objects such as planar surfaces and arbitrary objects. The methods of detection are further discussed in the features & techniques section. We divide the foreground into active foreground layer and inactive foreground layer. Active foreground layer includes the tracked active elements; while the inactive foreground layer includes the relevant and important however not currently active elements. Elements on the active foreground layer are always clearly in focus on the remote display. Diminished background layer The background layer refers to the irrelevant visual elements, sometimes the distractive noise that is transmitted during a normal video conferencing. Visual and auditory clutter could add unwanted distraction, especially when participants are in public or heavily trafficked areas. In order to reduce visual noise and ultimately create what will become the active foreground, we utilize the technique of diminished reality where the first layer of contextually inactive physical space is removed through a number of different methods. Inactive elements are blurred to simulate focus. Objects existing within the immediate focus of the active foreground will stay sharp, while objects existing in the background become blurry or out of focus. Add-on Layer Remote participants are able to observe and interact with the augmented foreground, which presents a number of interactive menus and dialogs that are spatially and temporally registered with objects found in the active foreground. These augmentations can include user nametags, contextual dialogs, shared drawings and documents, amongst other features otherwise invisible to a remote viewer. Finally, the GUI layer presents the user with a suite of controls to manage the FocalSpace system. The GUI layer allows for fine control over the diminished background layer enabling the user to specific effects and their respective intensity. Interaction Cues of "Layered Space Model" Interaction cues will influence people's focus and attention directly. We try to categorize cues that might indicate the important and active foreground activities in the videoconferencing. We 41 use human body as a starting point, to explore the relevant behaviors in the video conferencing. The factors of physical artifacts are taken into consideration as well (Figure 24, Table 1). ENVIRONMENT HUMAN see talk move gesture OBJECT Figure 24: To categorize the interaction cues which can be used to build the Layered Space Model, we use human body as an entry, to explore the relevant behaviors in the video conferencing. The factors of environment and physical artifacts are taken into consideration as well. Human Behaviors Interaction Cues in FocalSpace Listen Audio cue Talk Audio cue See Gaze cue Gesture Gesture cue Touch Proximity cue; Physical maker as an interactive cue Move Proximity cue; Physical maker as an interactive cue Table 1: Interaction Cues in FocalSpace 42 For most of the remote meetings, the talking heads are important focus. Voice cue can be used to track the active talking head. Beyond voice cue, participants' iconic gestures, movements in certain locations, or even eye contact should be taken into consideration as well. The Layered Space model uses depth sensors and various cues such as voices, gestures, and physical markers to identify the foreground of the video and subsequently diminish the background. The cues and related interactions are discussed in this section. Audio cue In group meetings, the most interesting foreground is typically the current speaker (Vertegaal, 1999). The primary method of determining the foreground is through detecting participants by whether or not they are actively speaking. Audio conversation is a central component to remote collaboration - as such, we can use it to determine implied activity (Figure 25). Remote participants who begin to speak will be moved from inactive to active foreground layer and automatically focused. In order to detect the absolute location of an active participant, we use a combination of audio direction as derived from a microphone array and computer vision for skeletal tracking. Once a substantial skeletal form is detected within a reasonable angle of the detected audio, we can deterministically assign the according RBG and depth pixels to a participant. 0 L 0 Liuzie 0.10 Tony W Figure 25: Voice Activated Refocusing. (Left) User 1 is talking and in focus; (Right) User 2 is talking and in focus. In FocalSpace, Remote participants who begin to speak will be placed into the active foreground layer and automatically focused. Gestural cue FocalSpace enables the detection of pre-defined gestures. For trained users, gestures can be used to tag, trigger certain interaction, and indicate certain features. For new users, FocalSpace can track their natural gesture. 43 Take one use case as an example: In videoconference with many participants it can be hard to get attention. In classical meetings we typically have a meeting leader who will keep track of the people who raises their hands. FocalSpace can track the gesture of "right hand up" automatically, and put the awaiting person into focus together with the former active participant simultaneously (Figure 26). Figure 26: Gesture Activated Refocusing. Human natural gesture can be detected in FocalSpace. One use case is: system could put the person who raises his left hand up to a waiting list, and people on the waiting list will be focused as well. Proximity cue Not all forms of communication during remote collaboration are purely auditory. Many forms of expression take the form of written word or illustration. We explore the automatic detection of activity regarding planar drawing surfaces, such as flipchart and white board. When the active participants move within a certain distance to the planar surfaces, the surfaces will be moved from inactive to active foreground, and activate the auto refocusing (Figure 27). The dynamic detection of the distance between active human body and planar surfaces is defined as proximity cue. 44 Figure 27: Proximity Activated Refocusing. (Up) The user is talking and in focus; (Bottom) When the user moves to the flipchart and starts to sketch on it, the flipchart takes focus. In a calibrated space, the system knows which region of the depth map belongs to the flip chart. The proximity relationship between the flipchart and the user can be tracked through the depth map. Physical maker as an interactive cue To detect arbitrary physical objects, such as a prototype model from a design meeting, we introduce physical markers (Figure 28). Defining the role of remote objects as tools of expression and communication is an important component of the FocalSpace system. Participants can invoke selection by moving the physical token with identifiable markers to the object. All the arbitrary objects sharing the same 3D locational range will be focused as foreground. Object activated selection contains additional features, which augment the remote participants ability to distinguish the object's state through spatial, contextual augmentation. We discuss more about this case in the user scenarios section. Moreover, physical maker can be used as command in the video conferencing. For example, by tagging a flip chart image with a physical "email" marker, the system could receive the command, capture the image and share it with the group mailing list. 45 Figure 28: physical marker as an interactive cue to trigger the tracking, refocusing and augmentations. Remote User Defined/Discretionary Selection In the case of participants who wish to define an area, per-son, or object of interest at their discretion, a user-defined selection mode is provided. Participants are given the ability to override both the automatic voice and object-activated modes of selection by manually choosing the area of interest. Interactive objects within a remote space are presented as hyperlink objects with common user interface behaviors such as hover and click states. Participants are able to select people and objects directly within the environment using a cursor, or from within an enumerated list of active objects found within the accompanying graphical user interface. Other Potential Cues Except for all the semantic cues we mentioned above, there are more potential cues we could look into. For example, the remote user might want to look at the person whom the other person on the screen is looking at. In this case, the cue is the third persons' eye gaze. 46 USER SCENARIOS This chapter explains how FocalSpace can be useful in different use cases, which include Filtering Visual Detritus and Directing Focus, Saving Display Spaces for Augmentation, saving bandwidth and keeping privacy. 47 We develop the application on top of the "Layered Space Model" in 3 steps: attention, augmentation, and transmission. First, FocalSpace could clean up the visual detritus on the background and direct remote participants' attention to the central focus. To augment or enhance the foreground activities, we utilize the diminished space to augment the focus with related digital contents. Finally, with a blurred and compressed background, we save bandwidth for transmission (Figure 29). attention - augmentation - 4 transmission Figure 29: We develop the application on top of the "Layered Space Model" in 3 steps: attention, augmentation, and transmission. FocalSpace could clean up the visual detritus in the background and direct remote participants' attention to the central focus. To augment or enhance the foreground activities, we utilize the diminished space to augment the focus with related digital contents. Finally, with a blurred and compressed background, we save bandwidth for transmission. Filtering Visual Detritus and Directing Focus For some of the cases, the background layer refers to the unwanted noise that is transmitted during a videoconference. Visual and auditory clutter adds unwanted distraction, especially when participants are in a dynamic working environment. By removing the unwanted background noise, we are able to increase communication bandwidth and direct participants' focus toward the foreground activity in a cognitively natural and effective way (Figure 30). This scenario demonstrates that FocalSpace allows effective video conferencing in flexible environments. Figure 30: FocalSpace can effectively diminish the background detritus and direct remote participants' focus to the foreground information. 48 Saving display space for augmented information In VideoOrbits (Steve Mann, 2002), the work demonstrates ways to alter the original contents on an advertisement board and augment digital information on its surface. The project takes Diminishing Reality as an approach to allow additional information to be inserted without causing the user to experience information overload. On top of the FocalSpace system, we've implemented several applications by augmenting the diminished reality. Through those applications, we show how diminished space could be effectively utilized with augmented contents that have high relevance to the foreground people or objects. By augmenting participants, we create context to assist in conversation and collaboration. Participants within FocalSpace can be augmented with rich meta-data about their personae. These data can include the participant's name, title, email, twitter account, calendar availability, shared documents, and total speaking time, among other information. Participants are augmented with virtual nametags that are spatially associated with their respective participant (Figure 31). Figure 31: (Right) Time and name time metaphor from the real life scenario. (Right) Feature in FocalSpace: a participant with an augmented nametag displaying user name and total talk time. Rather than just talking, more dynamic activities in the foreground can be augmented. To make information sharing more flexible no matter where it is, the system can track the behavior of the active participants and display the best perspective of knowledge sharing surfaces to the other side. Participant augmentation can be activated through voice cue, gesture cue, or manual selection. In manual mode, participants are presented as clickable objects that can be opened and closed, bringing them in and out of the active foreground at the remote viewer's discretion. In addition to augmenting the participants, we present a technique for augmenting objects whereby the object's or surface's relevant portions become more visibly clear to the remote participant. Motivated by the general difficulty in expressing visual concepts remotely, 49 FocalSpace provides spatially registered, contextual augmentations of pre-defined objects within the system. One such application is for sketching on surfaces. By capturing a video stream of planar drawing surfaces with an additional camera, we are able to display a virtual representation of the surface to the remote user shown perpendicularly to the paper itself. We demonstrate sketching on paper flip charts and a whiteboard. This technique allows remote users to observe planar surfaces in real time that would otherwise be invisible or illegible. For example, if the system tracks that the person moves towards the flip chart and starts drawing, a high-resolution front view of the flipchart video stream will be displayed to the remote user automatically (Figure 32); sketching on the table can be displayed in a similar way to remote users (Figure 33). Figure 32: Contextual augmentation of local planar surfaces. The augmented perspectives are captured through satellite web cameras. A proximity cue is the trigger to open up the augmented flipchart. 50 Figure 33: Sketching on the planar surface can be detected and augmented for the remote users. If we go one step further, the digital sketching surface on a tablet or computer can be augmented and shared without losing context or reference to the conversation as well. Both the local and remote participants are able to explore visual concepts on a shared whiteboard, which exists both in the physical space and within the virtual space. At any point, the remote participant can click the spatially registered display window, bringing it to the GUI layer, and begin to contribute (Figure 34). Figure 34: Contextual augmentation of shared digital drawing surfaces. It is spatially registered with its respective users. The augmented view of the canvas can be enlarged for local participants. 51 Saving Bandwidth for Transmission Since diminishing background information may not be informative for the remote user, the system can reduce the transmission bandwidth while keeping the foreground in-formation. Fig 20 shows the bandwidth evaluation results comparing FocalSpace and regular videoconferencing. These video sequences are compressed by the same JPEG 2000 standard at the rate of 0.05[bit/pixel]. It is clear that the general video conferencing system allocates the bandwidth uniformly, and then the quality of the foreground layer is worse compared to that of Focal Space. This feature can be easily implemented for FocalSpace by using the rate-allocation technology named region of interest (ROI) of JPEG and MPEG standards, e.g. JPEG 2000 and Scalable video coding of H.264/AVC. Figure 35: 3 degrees of blur, with dramatically changed data size. The original JPEG2000 image is 309K. It turns to be 126K, 109K and 7K with the compression rate of 1.387828, 1.202384 and 0.079389 respectively. Keeping Privacy As FocalSpace allows for flexible control of layered background and foreground, it can keep information in the background safe. It allows a more flexible video conferencing environment. This advantage can be utilized in certain situations, such as isolating irrelevant activities or background noise from the scene displayed to the remote user. 52 I MPLEMENTATION "SEEING DEPTH" This chapter explains the system setup and software development of FocalSpace. 53 System Our interaction design focus heavily on designing this conferencing experience around existing meeting environment. For common setup in the meeting room that often means long or round table with TV or projected display in front. By adding depth cameras in front of the display, and extra webcams pointing to part of the planar surfaces in the meeting rooms, we could easily adapt our system to the existing ones. Setup We divide the system into central setup and periphery devices. The central setup includes a large display, and three depth cameras. The periphery setup includes some other add-on devices, such as high-resolution satellite-webcams pointing to specific locations for augmenting reality related scenarios. The central system setup includes a large display, and 3 depth cameras (Microsoft Kinect for Windows ) with integrated microphone array for each one. Cameras are placed shoulder by shoulder on top of the display to ensure good coverage of the meeting room. Participants could sit around facing the display and cameras [Figure 36]. If we align the 3 cameras with 120 degrees between each other, we could get a seamless scene of the space they could cover. To make the capturing more flexible, a rotatable frame was built for holding the 3 cameras. By adjusting the angle of the frame, different part of the conference room can be captured. For the peripheral setup, A number of standard web cameras are placed in the local environment to capture drawing areas, or objects of interest as best fits the topic of the conference (Figure 37). Some of these cameras have been mounted near a whiteboard in order to capture a higher resolution image of sketching, as well as on table surfaces where a lamp web-cam captures paper sketching. Augmented views from the cameras can be triggered by either auto detection or manual selection. 54 3 Kinects skeleton tracking Figure 36: FocalSpace central configuration. Three depth camera are arranged in a way that 180 degree of the scene can be captured. One microphone array was integrated in each depth camera. Depth sensing and human skeleton tracking are enabled by the 3 depth cameras. Figure 37: Satellite cameras are pointing to the pre-defined areas, for potential augmentations. For example, the camera pointing to the flip chart could give a high-resolution augmented view of the contents on the flip chart when someone moves close to it. 55 Additionally, the local space may incorporate tablet devices, in our case an Apple iPad 2, for use during collaborative drawing. Tracking Techniques We've talked about different interactive cues to trigger the detection of space layers. The detection of the cues is based on depth sensing in combination with human skeletal tracking and physical marker recognition. We choose human skeletal tracking and physical marker tracking, over other methods such as face recognition or background subtraction, as a means for detecting the active foreground and creating special visual effects on top because of its flexibility to keep tracking of the continuously moving bodies and objects with very little calibration effort. Because the system is designed to be used on a daily basis in a flexible meeting environment, our tracking method needs to work easily with quick setup when someone walks into the meeting room. We chose to implement a simple foreground tracking system, which uses depth cameras to help segment and track foreground elements. The depth camera gives an easy and robust way to differentiate foreground from the background. By reading out the depth of each pixel captured, it's easy to recognize and track a human body and physical markers within a certain 3D range. Based on our analysis of foreground activities in the meeting room, the foreground to be tracked is divided into three categories: human body, planar surfaces, and arbitrary objects. For human body tracking, our implementation adapts the skeletal tracking algorithm supplied in Microsoft Kinect for Windows SDK (Kinect, 2012), which can automatically detect human shape skeleton within a certain range based on the gray scale depth image. For planar surface and arbitrary object tracking, we use physical markers (ARtoolKit, 2012). In combination with the depth data, we could get the 3D locational data of where the physical markers are placed, and further segment the surfaces or artifacts where the physical marker is attached to, through the depth difference and edge detection. 56 Human body tracking Within the Same depth range AR tag %,Within the Same depth range U - -Planar surface tracking Skeleton ---------------------------------Within the Same depth range - - - - - - - - - Arbitrary object tracking AR tag Figure 38: The segmenting and tracking targets are divided into 3 categories: human body, planar surfaces, and arbitrary object; Skeletal tracking is used to detect human body, and AR tags are used to detect other foreground elements. Image Processing Pipeline The interactive display is built based upon both pixel depth sensing and the tracking of interaction cues. When each interaction cue is detected, it corresponds to a certain foreground element. For example, when "hand up" gesture is detected, it corresponds to the same human body; when a proximity cue is detected, it corresponds to a pre-defined planar surface. We take audio cue as an example to explain the image processing pipeline (Figure 39). 57 Match, Track eeMatch RGB image space Depth Image space Sekeletal space Depth cam video frames Combine Inactive background it1k Track Depth image with RGB and skeletal infor Match Subtract ARGBof active skeleton Background with blur shader Combine Sound sensor array 0101 FocalSpace Figure 39: Video Conferencing Pipeline of "voice activated refocusing". The depth camera segmented depth and RGB video frames are combined and sent to onscreen graphics. In combination with spatial audio tracking, the corresponding skeleton along the direction of the detected voice is segmented and put into central focus, through an Open GL blur texture. The image processing is activated by two factors: the physical location of the human body skeleton, and the horizontal angle of the voice source. The system keeps tracking the angle of the voice source through sound sensors array, and tries to find skeletons along the sound angle. If a matched skeleton is detected, that pixels within the same depth range as the head joint of the skeleton will be copied out into a texture with transparent background, and superimposed on a computationally blurred layer of the entire scene. User Interface The front-end user interface was implemented under openFrameworks (Frameworks, 2012) (Figure 40). The active viewport is the main window for teleconference and communication. This window shows the focus effect for the active foreground, and replaces the classic video chat window with the enhanced FocalSpace system. 58 Further, a suite of system controls allows the remote user to toggle and operate a number of highly granular effect filters that affect the active viewport rendering. While the default Diminished Reality effect is to blur out the background, the "semi-transparent mask" and "black out" can be selected as background visual effects as well. The controls include: 3 Effect sliders, and a switchEfor auto/manual foreground focus. Finally, the augmented high-resolution views of the planar surfaces can be toggled through this interface. Figure 40: The front end user interface. The slider bar can be auto hided. 59 USER PERSPECTIVES This chapter introduces the feedback received and conclusion drawn from the user evaluation. 60 We've showcased two generations of FocalSpace at 2 four-day demo events. In the first event, we used one depth camera and one microphone array; in the second event, we extended our equipment to 3 depth cameras and 3 microphone arrays to capture a larger area. Beyond the showcase, we selected two features, "directing remote participants' attention" and "collaborative sketching", respectively, and conducted a user test comparing FocalSpace with Skype. We will talk about the user feedback we obtained in the following sections. Feedback from the Showcase We showcased the FocalSpace prototype at 2 four-day demo events to an audience of about 200 people. After testing the responsiveness of the system and the friendliness of the user interface, we gained valuable feedback. Occasionally, the gesture cue would fail as another user was standing and moving between the cameras. This motivated us to optimize our algorithm for gesture detection and figure out alternative cues to achieving the same effect. For example, in order to get attentive focus, the participant could show a marked physical token instead of raising a hand, similar to the realworld scenario of bidding. We found the effectiveness of one of the use cases, "directing focus", highly depends on the surrounding environment. During the four-day demo period, in most cases when at least 6 people were surrounding the system, the users showed greater interest in the synthetic blur effect for the background noise, compared to the later lab test. We also noted that when FocalSpace is employed, user behavior can be both designed and predicted by the state of the application (Figure 41). For example, when a user is placed in the active foreground, his or her attentiveness and likelihood to become distracted or disinterested decrease. We noted that collocated users are more likely to respect those currently speaking, as the implication of interrupting or 'talking over' their colleagues is now accompanied with a dramatic visual impact. We are interested in studying the long-term effects and potential behavioral implications uncovered by a system where users are more aware and cognizant of their contributions. 61 Figure 41: FocalSpace during the showcases. User test conducted at the lab Goal We conducted an experiment to clarify the effectiveness of Diminishing Reality to help listeners focus on foreground information during a videoconference. Hypothesis 1: The visual effect of "Diminishing Reality" improves effectiveness and accuracy for the absorption of information from remote speakers. The advantage increases if there are more people in the remote location, if the meeting lasts longer, or if the communication is in a narrative mode. Hypothesis 2: If the positive area selection (system following the attention) for diminishing reality is synchronized with voice-activated refocusing (system directing the attention), listeners will have more flexibility in controlling the environment and will improve initiatives in information capturing. Conditions To examine hypothesis H1, we compared two conditions: clear display and voice activated diminishing display. To examine hypothesis H2, we compared two conditions: voice activated diminishing, and voice activation plus eye gaze controlled diminishing display. Setup The basic system setup included a large display and 3 Microsoft Windows Kinect cameras placed shoulder by shoulder in front of the display. Participants could sit facing the display and 62 cameras. We placed the depth cameras in such a way that 145 degrees of the continuous conference table in front of display could be captured upon the screen. The angle of the sound source could be detected within 145 degrees as well. FY]Desktop computer Desk1o cPre-recorded video 26" Display Kinect Camera Eye tracking device Subject Figure 42: FocalSpace setup for user test. Method We recruited 25 participants or volunteers for the experiment test. Based on our setup, we need a 26-inch screen, an eye-tracking analysis device, a desktop and 3 Kinect cameras. Figure 3 shows the basic configuration of the devices. According the setup of the FocalSpace system, the participants were required to sit in front of the 26-inch screen for three different video sessions. The first one involved display without any special visual effect; the second involved display with voice-activated refocusing effects to direct listeners' attention; and the third involved display with eye gaze-directed refocusing to follow listeners' attention. We will provide all the displays with pre-recorded videos. In the pre-recorded video, there are six remote speakers describing a scenario of a marketing report. They also use graphics and posters for in-depth illustrations. We made the pre-recorded video before starting the experiment in the lab. Each time, the following conditions applied: (a) the same ambient environment, (b) an equal amount of information communicated over a distance, (c) the same number of remote participants, and (d) the same display and voice quality. Throughout the user test, we utilized both qualitative and quantitative measurement. Qualitative measurement was used to compare effectiveness by (a) counting the portion of the information 63 captured by the subjects out of the total and (b) measuring the accuracy when they tried to allocate information back to the speaker. Quantitative measurement concerned the problems subjects normally experience when they perform remote communication and asked them to compare and comment on both of the interfaces. Steps According to the hypothesis above and the system setup, we divided the experiment into two parts, each of which would demonstrate and prove one of our hypotheses, respectively. 1) In order to prove hypothesis one that a system with a diminishing setup will be more effective in information delivery than normal, we recruited 30 participants for the first part. All of these 30 subjects would watch two session videos. When the subject finished each of them, he or she would be asked to finish a questionnaire sheet (with two open questions). We would compare the accuracy and correctness of these two answer sheets. The participants could make it at the same time in the same conference room. 2) In order to prove hypothesis two that adds positive area selections, we would recruit another 10 participants for our experiment. Each subject will watch two-session videos individually. The videos would be the same as the former ones; however, this control group would have an opportunity to use a mouse to realize positive area selection (Table 2). 64 Length Hypothesis 1 . Hypothesis 2 Refocus Effect Number of subjects Interact ivity Questionnai re Interview & Feedback Session 1 15 m no 30 no yes no Session 2: Plan A 15m yes 30 no yes with open questions no Session 1 15m no 10 no yes no Session 2: Plan B 15m yes 10 yes yes yes Table 2: Session 1: 15 minutes, play the video of a scene with 6 people doing video conferencing, followed by a questionnaire. 20 subjects in total were invited to this session. Session 2: 15 minutes, play the video of a scene with 6 people doing video conferencing, with the visual effect of voice-activated refocusing. The test was followed by a questionnaire. 20 subjects in total were invited for this session. Findings Focus and Attention All the participants agreed that when compared to the video conferencing display without additive visual effects, our system provided an enhanced sensation of focus. One participant mentioned: "When we are physically together I know where the sound is coming from and I can turn my head to find the sound source; but without the blur effect, sometimes it's hard for me to understand who is speaking, especially when there are a lot of people or the internet quality is bad". (1) Objects, participants, and space as part of the foreground: We found that most user's concept of foreground information includes not only active participants, but also tools, artifacts, and the space itself which surrounded the activity. 22 out 30 participants expressed a strong interest in the manual selection feature. For example, one participant mentioned that if remote participants were talking about a physical object, she would find the manual selection very useful when she wanted to "get closer" and see what people are discussing, such as a physical model or prototype. She explained that not only the people, but also the physical objects that are related to the discussion are of interest to her. Another participant mentioned that if the remote group began to draw on the white board, he 65 would like to focus his attention onto the white board instead of the participants themselves; in which case, the white board could be manually selected to display the augmented viewport. (2) Peripheral Awareness: During the normal video conferencing test, 5 out of 6 non-native English speakers found that it was difficult to remember the remote collaborator's name, while 2 out of 6 of the native speakers had the same issues. While using FocalSpace, people found the augmented nametag, which displayed the participant's name very useful in terms of assisting better communication & name retention. While testing out manual focus, one participant explained that it was nice to give him a tool to select different people and review their names. Manual Selection of Positive Area Through the comparison of FocalSpace with and without manual selection of positive area, we found that in terms of efficiency and effectiveness, our system with positive area selection had certain advantages (Table 3). (1= much worse; 5 = much better) Ease of focus Speaker awareness Collaborative efficiency Collaborative effectiveness Distraction prevention 1 2 3 4 5 Table 1: User reported results for FocalSpace as compared to commonly used videoconference software. Augmented perspective of the planar surfaces on the remote display provided users a real time render and access to the remote information beyond talking head. We noticed that the participants could get a faster and more accurate understanding of the overall topic in this case. Participants found the augmentation of shared drawing surfaces useful noting that location awareness was important. User Perception In general, users indicated that depending on their need for teleconferencing, be it work related or personal, the system should adapt the interface accordingly. One user noted, "For talking with friends or family, there is too much user interface and too many features. I would only be 66 interested in focus." Understanding this, we believe that future implementations of FocalSpace could utilize adaptive rendering and graceful degradation depending on intended use. 67 EXTENDED APPLICATIONS EXTENSION TO POST-CONFERENCE AND PRIVATE VIEW In this chapter, we talk about how the tracking capability of FocalSpace can be utilized to embed gesture recognition and smart tags in live and recorded video conferences. 68 Record and Review Currently, FocalSpace captures and displays in real time but discards the data once a session has concluded. We believe that by tracking, storing the rich, semantic information collected from both the sensors and user activity, we can begin to build a new type of search-and-review interface. With such an interface, hundreds of hours of conversation could easily be categorized and filtered by participant, topic, object, or specific types of interaction. Recalling previous teleconference sessions would allow users to adjust their focal points by manually selecting different active foregrounds, which would enable them to review different perspectives on past events. Most video conferencing systems have recording capability. But without smart indexing, the large amount of video data is hard to navigate and review. As our system has the capability to detect gestures, the human body, voice, and the environment dynamically, we are proposing a series of ideas for smart indexing (Figure 43). Figure 43: Switching between Live Mode and Review Mode Mining Gestures for Navigating Video Archives We addressed gesture detection and tracking in real-time video conferencing in an earlier chapter. However, here the idea is to mine gestures and help users navigate archives. The system provides some passive sensing for new users and also gives long-term users ways to tag the archive as they're moving in a natural way. We explore and build up a gestural library. The design of the library is based on social and behavioral terms of meetings in the workspace. Cultural factors are considered as well. Some examples of potential gestures in videoconferencing are demonstrated below (Table 4). In front of depth cameras, all the gestures can be easily defined and detected. 69 Table 2: Gesture examples with different intentions. To prove this concept, we implement the gesture of "Thumbs Up" as symbolic of a good idea (Figure 44). Whenever the participant hears a good idea or comment that he wants to tag in the video, he can just lift his right arm and thumb up. The system will indicate the participant once the gesture is detected through the graphic hint. Later, when the recorded session is reviewed, all the moments when the gesture is detected will be marked with a "thumb" icon along the timeline. Figure 44: "Good idea" gesture index. (Left) Gesture was detected in live mode. (Right) In the replay mode, thumb icons show up along the timeline at the moment when the gesture was detected. Viewers can click the thumb icon and revisit the moment when the corresponding gesture occurred. 70 Voice Index Utilizing the capability of the system to identify the active foreground, we could use active person, physical object, or environment as a smart index for later review as well. For example, the system could detect and remember when a certain person started talking and embed different "talking head" marks along the timeline (Figure 45). This feature enables the reviewers to revisit the moments when a certain speaker started to talk by clicking the "talking head" icon along the timeline or by clicking the talking head directly in the main window. Figure 45: "Talking head" mark. By clicking the "talking head" icon along the timeline, or by clicking the talking head directly in the main window, reviewers could revisit the moment when a certain selected speaker started to talk. Portable Private View By adding a private view on either mobile phone/tablet or personal laptop, we are trying to give the attendees more flexible access of the remote information. Through the private view, different 71 participant can get their unique focus, and start to explore the relevant digital augmentations associated with the focus they selected: * * * To pay more attention to the elements that are interesting to the particular user, rather than the auto focus on the global view (Figure 46). To access augmented information (Figure 47). Portability across devices. For example, if a participant wants to go away from his table tentatively during a videoconferencing on his desktop, he could easily copy the same view from the computer to his mobile device. Figure 46: Focus on a certain person even if he is not the current speaker, access his related digital information, and open a private message channel Figure 47: To focus on a certain active area, in this case, the flipchart. Access the historical image archive of the flipchart through clickable interface. 72 EXTENDED DOMAIN " YOU CAN DO MORE!" In this chapter, we talk about how the design guideline of "improving focus through reality" could be utilized in other application domains. 73 We believe that "DR+AR" as a design approach can be used in different fields. It's an effective way to filter out information and computationally mediate humans' perception of the real world. As conceptual ideas, both FocalCockpit and FocalStadium are explored. Both of the projects are conducted with commercial companies, which indicates that our design framework can be put into practice. FocalCockpit The purpose of FocalCockpit is to augment the driving experience while taking safety and efficiency into consideration. This proposal is based on the increasing amount of accessible information on the road. Such information includes real-time traffic conditions, facilities on the road, inter-communication channels between vehicles, and so on. To date, we have built mock-ups to explore the interaction concepts without real-time accessible data. We will describe some of the scenarios being discussed. For example, we envision drivers having a fog-diminished view on a rainy day; drivers could have a sky view map to augment their current location (similar to the distorted perspective of the road in the movie Inception). Please refer to the following figures for some of the scenarios we have proposed (Figure 48, 49, 50, 51). Figure 48: We envision drivers could get a fog-diminished view in a rainy day. 74 Figure 49: The real time updating of the "bird view" map. The designer was inspired by "Inception". Figure 50: Chatting on the road; it's a scenario of "car-to-car communication". During heavy traffic times, drivers could communicate with friends detected on the same road. Figure 51: Finding parking spots: the display on the front window could highlight potential available parking spots in real time. 75 FocalStadium The FocalStadium concept has been developed based on the observation that activities in a stadium are normally tracked by multiple channels: cameras from different view angles, handset cameras from judges and players, microphones on judges and players, coded markers on the clothes of players, and so forth. Diverse video, audio, sensory, and locational information make it possible to use our DR+AR applications. When people are watching a sports game, they always see the sports game from a fixed angle. On one hand, it's easy for them to get an overview of the sports game; on the other hand, they cannot watch the game from other perspectives, focus and get a clearer view of a specific player, or go back time to review an exciting moment. To enrich audiences' experience, we mocked up a mobile application with the guideline of "improving focus through diminished reality". In the example shown below, the starting point of active action is chosen as the trigger of focus (Figure 52). For example, in a 100-meter run, the system wants to focus on the individual who is leading the run, while in the case of cricket, the system tends to focus on the thrower and picker. Figure 52: Focus on the most active speaker currently; the focus can be manually selected as well. Users can access the relevant digital information on a foreground player, revisit the moment when he made an excellent hit, or review the same action from a different perspective. 76 Performance Activated Refocusing: What the focus should be is always the question we ask first when we try to put the "DR+AR" concept into practice. In the early application for video conferencing, the current speaker should get the most attention. Thus, voice activation was chosen as the trigger for the focal point. Due to the native characteristic of competition, audience members care about the players who are having the best or most active performance. In this case, the starting point of active action is chosen as the trigger of focus. For example, in a 100-meter run, the system wants to focus on the one who is leading the run, while in the case of cricket, the system tends to focus on the throwers and other foreground players. The same design concept can be extended to many sports categories. One approach to the technical solution is to add bright patterns to the players' clothing. We envision the same design approach being used for both the real-time stadium experience and live TV mode. When audience members are sitting in front of the TV screen, the same effect can be achieved when they point their mobile devices toward the screen. Potentially, it's another means of mobile-tocommunicate communication and a way to interact with television. 77 CONCLU SION This chapter outlines the future work and conclusion of the thesis project. 78 Professor Ramesh Raskar, the director of the Camera Culture group at MIT Media Lab, mentioned his vision of "Super-human vision" in an interview. He said that his interest lies in creating "super-human abilities to visually interact with the world - with the cameras that can see the unseen, and displays that can sense the altar of reality". His inspiring aim of "creating a unique imaging platform that have an understanding of the world that far exceeds human ability, that we can meaningfully abstract and synthesize something that's well within human comprehensibility" is what our system, FocalSpace is trying to achieve, on an abstract level. Camera Man "Camera Man" is our metaphor for FocalSpace (Figure 53). We've built up a system as a computational cameraman, who could set up spotlight, shooting assistant, perspectives, autofocus in real time. Figure 53: The metaphor of "Camera man" for FocalSpace. Just like camera man who could set up spot light, shooting assistant, perspectives, auto-focus in real time, FocalSpace is a system which could automatically display information in the right perspective, assist communication, and give spotlight to the central focus. 79 Future Work Specific Tasks As one of the next steps, we want to find out how FocalSpace can be utilized in more specific tasks. FocalSpace has been designed as a flexible system for remote communication. As such, we are also interested in investigating the implications of a support system for remote tasks with more specific purposes such as long form presentation and education. We believe that the study of embodied gesture and movement for presentation, and the augmentation of context and state aware story telling may lead to interesting new discoveries in remote communication. Cloud-based solution for FocalSpace Even though Focal Space utilizes layered space frameworks, each system is connected between all systems currently. However, Focal Space is compatible with cloud-based solution, because the layered space frameworks enable the platform to reconstruct individual contents from one transmitted source data easily. To make these individual contents, the center server in the cloud applies just "Copy-and-Past" processing. Then the total distribution complexity and bandwidth will be reduced dramatically. Furthermore, removing the some processing part into the cloud server is also beneficial for mobile device users. For additional future work, we plan to store the data to a cloud server and allow users to log on and analyze/access these data. Conclusion We have adopted an extension of the well-known depth-of-field effect that allows elements of the video conferencing to be blurred depending on relevant interaction cues rather than their distance from the camera. By using depth sensor and various cues such as voices, gestures, and physical markers, FocalSpace constructs "Layered Space model", identifies the foreground of the video and subsequently blur out the background. The technique makes it possible to diminish the background noise and point the remote users to relevant foreground elements in high quality, without consuming large bandwidth. Because of the similarity to the familiar depth-of-field effect and the fact that DOF is an intrinsic part of the human eye, we believe our approach of Diminished Reality is a quite natural metaphor for transmitting remote video streams and can be used quite effortlessly by most users. Our preliminary user study also proved this assumption. FocalSpace can be used for not only cleaning unwanted noise but also directing remote participants' focus, saving transmission bandwidth, keeping privacy, and saving up display space for additional information such as augmented reality. 80 We believe that FocalSpace contributes to video conferencing by observing the space we inhabit as a richly layered, semantic object through depth sensing. While much remains to be done, our work offers a glimpse at how emerging depth cameras can enable a flexible video conferencing setup, which brings semantic cues and interpretation of the 3D space to remote communication. Through the thesis, we've conveyed our belief of "less is more". We carefully choose the ways to alter humans' perception of the reality, and enhance, improve, or entertain real life experiences. Technology enables a variety of tracking and computational reconstructing, however, it's the designers who define how those accessible technologies serve the real life. To achieve a good design, we have to look deep into humans' conscious and subconscious behaviors. In all the use cases we've described in the thesis, such as conferencing, driving, and watching sports games, FocalSpace is very sensitive to participants' attention and focus, as all the interactive features are built upon the tracking and understanding users' intention and attention. For example, the listeners care not only the current speaker, but also the ones whom the current speaker is watching or referring. Paul Dourish and Sara Bly noted that awareness in the work space "involves knowing who is "around", what activities are occurring, who is talking with whom; it provides a view of one another in the daily work environments" (Paul Dourish, 1992). To better understand of target users' attention and focus is what we will pursuit in a long term. 81 BIBLIOGRAPHY Zhang, C. R. (2006). Lightweight background blurring for video conferencing applications. . ICIP 2006 (pp. 481-484). AMC. Wooding, D. (2002). Fixation maps: Quantifying eye-movement traces. . ETRA '02. New York: ACM. youKu. (2012). youKu. Retrieved April 2, 2012, from http://youku.com Vertegaal, R.(1999). The GAZE Groupware System: Mediating Joint Attention in Multiparty Communication and Collaboration. roc. of CHI'99 Conference on Human Factors in Computing System. ACM. AugmentedReality. (2012). Augmented reality. Retrieved 2012, from http://en.wikipedia.org/wiki/Augmentedreality Azuma, R. (2004). Overview of augmented reality. SIGGRAPH '04. New York: ACM. AzamKhan, J. K. (2005). Spotlight: directing users' attention on large displays. . CHI'05 (pp. 791798). New York: ACM. ART+COM. (2012). ART+COM. Retrieved 2012, from http://artcom.de Artvertiser. (2012). Artvertiser. Retrieved April 3, 2012, from http://theartvertiser.com/ ARtoolKit. (2012). ARtoolKit. Retrieved April 8, 2012, from http://handheldar.icg.tugraz.at/artoolkitplus.php Cisco. (2012). WebEx. Retrieved from http://www.webex.com/ Chris Woebken, K. 0. (2012). Animal Superpower. Retrieved 2012, from http://chriswoebken.com/animalsuperpowers.html E.A. Bier, M. C. (1993). Toolglass and magic lenses: The seethrough interface. SIGGRAPH'93 (pp. 73-80). ACM. David Holman, R.V. (2004). Attentive display: paintings as attentive user interfaces. CHI EA '04 (pp. 1127-1130). New York: ACM. DDB, T. (2012). Retrieved May 1, 2012, from Tribal DDB: www.tribalddb.nl 82 Doug DeCarlo, A. S. (1991). Stylization and abstraction of photographs. ACM (pp. 769-776). New York: 29th annual conference on Computer graphics and interactive techniques (SIGGRAPH '02). Furnas, G. W. (1986). Generalized fisheye views. ACM Conference on Human Factors in Computer Systems, SIGCHI (pp. 16-23). New York: ACM. Frameworks, 0. (2012). Open Frameworks. Retrieved January 2, 2012, from http://openframeworks.cc H. Ishii, M. K. (1994, August). Iterative Design of Seamless Collaboration Media. Communications of the ACM, , 83-97. Kinect. (2012). Kinect for Windows SDK. Retrieved March 1, 2012, from http://www.microsoft.com/en-us/kinectforwindows/ Lieberman, J. (2012). Moore Pattern. Retrieved 2012, from http://bea.st/sight/moorePattern/ Loschky, L. M. (2005). How late can you update? Detecting blur and transients in gazecontingent multi-resolutional displays. Human Factors and Ergonomics Society 49th Annual Meeting, (pp. 1527-1530). Santa Monica. Munzner, T. (1998). Drawing large graphs with H3Viewer and Site Manager. Graph Drawing'98 (pp. 384-393). Springer. M. C. Stone, K. F. (1994). The movable filter as a user interface too. ACM CHI'94 (pp. 306-312). ACM. M. Sarkar, S. S. (1993). Stretching the rubber sheet: A metaphor for visualizing large layouts on small screens. ACM Symposium on User Interface Software and Technology (pp. 81-91). ACM. M.Sarkarand, M. (1994). Graphical fisheye views. Communications, 37 (12), 73-83. Maesako, 0. M. (1998). HyperMirror: Toward Pleasant-to-use Video Mediated Communication System. CSCW (pp. 149-158). ACM. Okada, K. M. (1994). Multiparty Videoconferencing at Virtual Social Distance: MAJIC Design. CSCW'94 (pp. 385-393). ACM. Paul Dourish, S. B. (1992). Portholes: supporting awareness in a distributed work group. SIGCHI conference on Human factors in computing systems (CHI '92) (pp. 541-547). New York: ACM. Scalado. (2012). Scalado. Retrieved 2012, from http://www.scalado.com/display/en/Hom Skype. (2012). Retrieved from Skype: www.skype.com sound., o. w. (2012). Retrieved April 20, 2012, from http://ows.clients.vellance.net/ows/ 83 Steve Mann, J. F. (2002). EyeTap devices for augmented, deliberately diminished, or otherwise altered visual perception of rigid planar patches of real-world scenes. Virtual Environ , 11 (2), 158-175. Stefan Agamanolis, A. W. (1997). Reflection of Presence: Toward More Natural and Responsive Tel ecol la bo ration. SPIE Multimedia Networks, 3228A. Stratton, G. M. (1896). Some preliminary experiments on vision without inversion of the retinal image. Psychological Review, 3 (6), 611-7. Rainer Stiefelhagen, J. Y. (2001). Estimating focus of attention based on gaze and sound. PUI '01 (pp. 1-9). New York: ACM,. Ramesh Raskar, G. W. (1998). The Office of the Future : A Unified Approach to Image-Based Modeling and Spatially Immersive Displays. SIGGRAPH 1998. Orlando: ACM. Ramesh Raskar, K.-H. T. (2004). Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. In ACM SIGGRAPH 2004 Papers (SIGGRAPH '04) (pp. 679688). New York, NY, USA: Joe Marks (Ed.). ACM. Reingold, E. L. (2002). Gaze-contingent multi-resolutional displays: An integrative review. Human Factors . ROBERT C EDSON, D. M. (1971). Patent No. 3601530. US. Robert Kosara, S. M. (1997). Semantic Depth of Field. IEEE Symposium on Information Visualization 2001 (INFOVIS'01). Washington: IEEE Computer Society. Tracy Jenkin, J. M. (2005). eyeView: focus+context views for large group video conferences. CHI '05 extended abstracts on Human factors in computing systems (CHI EA '05) (pp. 1497-1500). New York, NY, USA: ACM. 84