FocalSpace: Enhancing Users' Focus on

FocalSpace: Enhancing Users' Focus on
Foreground through Diminishing the Background
ARCHIVES
By
Lining Yao
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning
in partial fulfillment of the requirements for the degree of
Master of Science in Media Arts and Sciences
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2012
@ Massachusetts Institute of Technology 2012. All rights reserved.
Autho r ..................................... /
..... ....................................................................
L in in g Ya o
Program in Media Arts and Sciences
May 1st, 2012
Certified by
....................................... H iroshi Ishii
Jerome B.Wiesner Professor of Media Arts and Sciences
Program in Media Arts and Sciences
Thesis Supervisor
Accepted by .............................................................................................................
Mitc hel Res n ick
LEGO Papert Professor of Learning Research
Academic Head
Program in Media Arts and Sciences
1
FocalSpace: Enhancing Users' Focus on
Foreground through Diminishing the Background
By
Lining Yao
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
on May 11, 2012, in partial fulfillment of the
Requirements for the degree of
Master of Science in Media Arts and Sciences
ABSTRACT
In this document we introduce FocalSpace, a video conferencing system that helps users focus
on the foreground by diminishing the background through synthetic blur effects. The system can
dynamically recognize the relevant and important activities and objects through depth sensors
and various cues such as voices, gestures, proximity to drawing surfaces, and physical markers.
FocalSpace can help direct remote participants' focus, save transmission bandwidth, remove
irrelevant pixels, protect privacy, and conserve display space for augmented content.
The key of this research lies in augmenting user experience through diminishing irrelevant
information, based on the philosophy of "less is more." In short, we use DR (Diminished Reality)
to improve communication.
Based on our philosophy of "less is more", we describe some design concepts and applications
beyond the video conferencing system. We explain how the approach of "AR through DR" can be
utilized in driving, and sport-watching experiences.
In this document, we detail the system design for FocalSpace, a 3-D tracking and localization
technology used. We also discuss some initial user observations and suggest further directions.
Thesis Supervisor: Hiroshi Ishii
Title: Jerome B. Wiesner Professor of Media Arts and Sciences, Program in Media Arts and
Sciences
2
DR
AR
"Imagine a technology with which you could see
more than others see, hear more than others
hear, and perhaps even touch, smell and taste
things that others can not. "
By D.W.F. van Krevelen and R. Poelman [11]
"Then, imagine a technology with which you
could see less than others see, hear less than
others hear, so that you can concentrate and
care only the stuff you want to care about."
By the author of this thesis
3
FocalSpace: Enhancing Users' Focus on
Foreground through Diminishing the Background
By
Lining Yao
The following people served as readers for this thesis:
Thesis Reader ...........................................
.................................
Pattie M aes
Professor of Media Arts and Sciences
Program in Media Arts and Sciences
Thesis Reader .............................
..........................................................
Ra m esh Raska r
Assistant Professor of Media Arts and Sciences
Program in Media Arts and Sciences
7
ACKNOWLEDGEMENTS
First, I thank my first collaborator, Anthony DeVencenzi. This work would not have been possible
if I had conducted it alone. Together with Tony, I developed the very first application, "Kinected
Conference", one week after Microsoft Kinect came to the market. I enjoyed the moment when
we used open source code to hack into the device and blew people's minds with our final
presentation for MAS 531, Computational Camera and Photography. I fondly recall the moments
when we sketched on the 5th floor cafe before meeting professors, coded during holidays, shot
video on the ground floor at midnight, wrote papers on the plane, and sat in front of cameras to
describe our story and concept for the media. I appreciate Tony's support, encouragement, and
contributions during each step of the project.
I thank my advisor, Hiroshi Ishii, for inspiring me with his vision of "Tangible Interface" back
when I was in China. I have always been encouraged by his energy, his passion, and his critical
perspective. When we explained ideas in one sentence during a group meeting, it was Hiroshi
who picked up the original idea of Kinected Conference and encouraged us to go ahead with it.
When we tried to frame our work to tell a better story, it was Hiroshi who inspired us with the
vision of "Diminishing Reality" + "Augmenting Reality". FocalSpace is a result we cultivated
together over two years. I would like to express my gratitude to him for his enormous support
and encouragement.
It was a great pleasure to take MAS 531, Computational Camera and Photography, under
Ramesh Raskar, the director of the Camera Culture group at the Media Lab. Without this class,
FocalSpace would not exist. Ramesh showed me how to "play" and "design" with technology,
and how a real MIT genius invents something new. The night before the final presentation for
the class, as Tony and I were shooting project videos and preparing presentations, Ramesh
came to me and listened to my idea. After brainstorming with him, I extended the single "voice
activated refocusing" feature to five different visual features. As a result, I got the most votes for
the final presentation and won the 1st place award as a designer in one the most technical
classes at the media lab. Ramesh is a tutor who has not only driven me to work harder, but also
inspired me with a unique perspective, great talent, and creativity. I am grateful to him for telling
me, "you should not be constrained by the research direction of your group; you should be the
one who creates the direction." His advice has consistently pushed me to move forward and
outside the box.
5
I have been fortunate to have the pleasure of communicating with another of my thesis readers,
Professor Pattie Maes. I appreciate the support I received anytime I sought her help. She really
inspired me to think about the effectiveness of the system seriously.
I would like to thank all of the MIT undergraduate students who helped with this project. They
were always there and willing to help at any moment I needed them. Erika Lee, who began
working with us as a junior, will complete a bachelor's and master's thesis on this topic as well.
She did a great deal to build up the infrastructure of FocalSpace after we switched to Windows
and Microsoft official SDK for Kinect. Many times, she was the person I looked for when I had no
clue how to solve a technical problem. Kaleb Ayalew started collaborating with me a year ago on
another project; he switched to FocalSpace at the later stage. He started to implement the entire
record-and-review interface from scratch, making it work nicely. He is very kind, codes quickly,
and was always careful in debugging the code, even when I felt impatient. I would like to thank
Shawn Xu as well for his hard work day and night on the private mobile interface, as well as his
caring and friendship throughout the semester when he was here at MIT. I still remember the
night when we had to catch the last train at 1 o'clock in heavy rain. These collaborators really
made my time at MIT meaningful.
It has been an absolute pleasure spending time with Pranav Mistry, one of my best friends here
at MIT. When I talk to Pranav, I see a perfect mind combining technology, design, art, philosophy,
and even pop culture. I was moved by the way he told a story with technology during his TED talk
entitled "Six Sense" back in China. I never imagined that I would one day hang out with this
amazing creator and hear firsthand how passionate and obsessed he was when he was working
on Six Sense. At MIT, we have access to the smartest people all the time. Many people are good
at sharing their own thoughts, but only some of them are willing to connect their vision more
closely to daily life, and Pranav is one of them. Pranav and I share the belief that invention, or
research, should be simple, unique, and useful. He keeps reminding me that I shouldn't go in
the direction of being too geeky. Instead, I should care about the real-life impact of my work. He
believes that we should learn to think independently and never follow in others' footsteps.
Moreover, Pranav introduced me to Abhijit Bendale, who became another of my lifelong friends. I
will never forget the days when Abhijit visited and chatted with me on the 5th floor cafe,
brainstorming and dreaming of the future.
I would like to thank the researchers from the Cisco Researcher Center: Steve Fraser from
California, Stephen Quatrano from Boston, and Lars Aurdal from Norway. I appreciate their
insight, honest comments, and suggestions. With their help, I received the Cisco Fellowship for
PhD study. The appreciation our work has received within industry means a lot to us. They are
very fun people to talk to as well. I am looking forward to close collaboration with them in the
following years.
6
It has been my pleasure to get to know all of the people in the Tangible Media group. They have
offered me great help and friendship. I've been working on FocalSpace for almost two years.
When I coded and set up the hardware systems, there were always group members around,
helping or giving suggestions from time to time. I would like to thank Sean Follmer and Jinha Lee
for the encouragement they extended the very first time they heard about the project. Daniel
Leithinger has always been there to listen, help, and take care of any mess I created. David
Lakatos spent much time in the same office for the whole summer to test out audio-video
features. And Austin Lee has always given kind suggestions and encouragement. I thank Leo
Bonanni for his consistently short, sharp, but effective suggestions and support. I miss Keywon
Chung's kindness. I enjoy being lab mates with Xiao Xiao and Samuel Luescher. Finally, I have
had the pleasure of working with Natalia Villegas, Mary Tran Niskala, Jonathan Williams, Paula
Aguilera, and many others who made the Media Lab a better place to stay.
I thank my former advisor in China, Professor Fangtian Ying. He is the one who told me, "be
imaginative about yourself". Without him, I would never have mustered the courage to leave my
country and friends and apply to MIT. He is a professor who blew my mind every time I listened
to him talk. He told us that being a student means maintaining passion and curiosity. I learned
how to be a better designer and human being because of him. Although he couldn't speak
English, his vision of design and creating a better life transcends countries and time.
In conclusion, I want to say thank you to my great family. No matter how far I go, my lovely
parents and little sister are always standing where I can see them immediately if I turn back. I
want to thank Wei for his constant patience whether I am happy or angry, considerate or
demanding. He has taught me that loving a person means caring for and appreciating him or
her. My family and Wei make me feel safe and warm.
7
TABLE OF CONTENTS
Abstract ...................................................................................................
2
Acknow ledgem ents.................................................................................5
List of Figures.......................................................................................11
Introduction .........................................................................................
14
Motivation for Diminishing Reality .........................................................
15
Current Challenges in the Video Conferencing Room ...........................
16
Technical constraint ......................................................................................................
16
Perceptive constraint.......................................................................................................
16
Inspirations ..............................................................................................
17
Hypothesis and Goal ................................................................................
18
Contribution ............................................................................................
19
V is io n ................................................................................................................................
19
Usefu lness ........................................................................................................................
19
Technical contribution ................................................................................................
20
Outline ......................................................................................................
20
Related W ork ........................................................................................
22
Computer Mediated Reality ....................................................................
23
Visual perception of the reality ..................................................................................
23
Augm ented Reality.....................................................................................................
23
Time Augmented Reality............................................................................................
25
Im proved Reality ..............................................................................................................
25
Alte red Rea lity ..................................................................................................................
26
Dim inished Reality .....................................................................................................
28
Diminished "Digital Reality".......................................................................................
Abstracted Reality .......................................................................................................
30
33
Focus and Attention ................................................................................
34
Awareness of focus for video conferencing ..............................................................
34
Focus and attention in other application domains .......................................................................
35
8
Im proving Focus by DR .......................................................................
The "Layered Space Model" of FocalSpace...........................................
39
41
Fo reg ro und laye r ..............................................................................................................
41
Diminished background layer .....................................................................................
41
A d d -o n laye r......................................................................................................................4
Interaction Cues of "Layered Space Model"..........................................
1
41
A u d io c u e ..........................................................................................................................
43
Gestu ral cue .....................................................................................................................
43
P roxim ity cue ....................................................................................................................
44
Physical maker as an interactive cue........................................................................
45
Remote user defined/discretionary selection .........................................................
46
Other potential cues ..................................................................................................
46
User Scenarios......................................................................................
47
Filtering Visual Detritus and Directing Focus.........................................
48
Saving display space for augmented information.................................
49
Saving Bandwidth for Transmission........................................................52
52
Keeping Privacy .......................................................................................
Im plem entation ...................................................................................
53
System ......................................................................................................
Setup ........................................................................................................
54
54
Tracking Techniques .............................................................................
56
Image Processing Pipeline....................................................................
57
User Interface ..........................................................................................
58
User Perspectives ................................................................................
60
Feedback from the Showcase ....................................................................
61
User test conducted at the lab ...............................................................
62
Go a l...................................................................................................................................6
2
Se tu p .................................................................................................................................
62
M e th o d ..............................................................................................................................
63
Ste ps .................................................................................................................................
64
Fin d in gs ............................................................................................................................
65
Extended Application .........................................................................
68
9
Extended Applications..........................................................................68
Record and Review................................................................................
69
Mining gestures for navigating video archives .........................................................
vo ice in de x ........................................................................................................................
69
71
Portable Private View .............................................................................
Extended Dom ain ................................................................................
71
73
FocalCockpit.............................................................................................
74
FocalStadium ..........................................................................................
76
Conclusion............................................................................................
78
Camera Man ............................................................................................
79
Future W ork ............................................................................................
80
S pe cific tasks ...................................................................................................................
Cloud-based solution for FocalSpace .......................................................................
80
80
Conclusion...............................................................................................
80
Bibliography..........................................................................................
82
10
LIST OF FIGURES
Figure 1: Time Square with Information Overload ......................................................................
15
Figure 2: Depth of Field Techniques used in Photography.. ......................................................
17
Figure 3: FocalSpace System ........................................................................................................
18
Figure 4: FocalSpace central configuration. ................................................................................
19
Figure 5: an "evolutional" approach to outline the thesis ..........................................................
21
Fig ure 6 : EyeTa p.................................................................................................................................
23
Figure 7: "Office of the Future"......................................................................................................
24
Figure 8: Timescope (Left) and Jurascopes (Right) from ART+COM.. ......................... 25
Fig ure 9 : A rtve rtise r............................................................................................................................
26
Figure 10: Umkehrbrille Upside Down Goggles by Carsten H61ler.............................................
27
Figure 11: "Anim al Superpowers".................................................................................................
28
Figure 12: Diminished and augmented planar surface. ..............................................................
29
Figure 13: Art w ork of Steve Mann. ...............................................................................................
29
Figure 14: "Rem ove" from Scalado..............................................................................................
30
Figure 15: Diminished Reality in Data Visualization and Software Applications....................... 31
Figure 16: The company webpage of Tribal DDB.........................................................................
32
Figure 17: "Turning on and off the light" on YouKu....................................................................
32
Figure 18: Non-Photorealistic Camera. ........................................................................................
33
Figure 19: the abstracted rendering is generated by transforming a photograph based on
view ers' eye-gaze fixation.....................................................................................................
34
Figure 20: "Obsessed by Sound" ...................................................................................................
36
Figure 21: Aggregated fixations from 131 subjects viewing Paolo Veronese's Christ addressing a
Kneeling W om an........................................................................................................................
37
11
Figure 22: Gaze Contingent Display. .............................................................................................
38
Figure 23: Dynamic Layered Space Model...................................................................................
40
Figure 24: Categories of the semantic cues. ...............................................................................
42
Figure 25: Voice Activated Refocusing..........................................................................................
43
Figure 26: Gesture Activated Refocusing.....................................................................................
44
Figure 27: Proximity Activated Refocusing...................................................................................
45
Figure 28: physical marker as an interactive cue.......................................................................
46
Figure 29: 3 Steps to develop the application on top of the "Layered Space Model"............... 48
Figure 30: Voice activated refocusing ..........................................................................................
Figure 3 1: A ugm entation ...............................................................................................................
48
.49
Figure 32: Contextual augmentation of local planar surfaces....................................................
50
Figure 33: Sketching on the planar surface can be detected and augmented for the remote
us e rs . ..........................................................................................................................................
51
Figure 34: Contextual augmentation of shared digital drawing surfaces................................... 51
Figure 35: 3 degrees of blur, with dramatically changed data size............................................. 52
Figure 36: FocalSpace central configuration. ...............................................................................
55
Figure 3 7 : S atellite cam eras ...........................................................................................................
55
Figure 38: Categories of the segmenting and tracking targets .................................................
57
Figure 39: Video Conferencing Pipeline of "voice activated refocusing"................................... 58
Figure 40: The front end user interface. The slider bar can be auto hided................................ 59
Figure 41: FocalSpace during the showcases. .............................................................................
62
Figure 42: FocalSpace setup for user test. ..................................................................................
63
Figure 43: Switching between Live Mode and Review Mode.......................................................
69
Figure 44: "Good idea" gesture index. .........................................................................................
70
Figure 45: "Talking head" m ark
71
..................................................................................................
Fig u re 4 6 : Gestu re cue ......................................................................................................................
72
12
Figure 47: To focus on a certain active area, in this case, the flipchart. ....................................
72
Figure 48: We envision drivers could get a fog-diminished view in a rainy day......................... 74
Figure 49: The real time updating of the "bird view" map...........................................................
75
Figure 50: Chatting on the road ...................................................................................................
75
Figure 5 1: Finding the parking slots. .............................................................................................
75
Figure 52: Focus on the most active speaker..............................................................................
76
Figure 53: The metaphor of "Camera man" for FocalSpace......................................................
79
13
INTRODUCTION
"LESS IS MORE"
This chapter serves as an introduction.
We firstly described the motivation for "Diminished Reality", and then moved a
step further to talk about the challenges in the video conferencing room as a
special real life scenario.
Section 2 explains the inspiration the FocalSpace, introduces the hypothesis,
and highlights the novel interaction and research contributions.
14
Motivation for Diminishing Reality
We are in an information age. We are facing information when we are at our computers, on our
mobile phones, or even walking on the street. Inventors whose job is to imagine the future are
trying hard to bring information everywhere in the physical environment. This work has taken
various research directions, such as "Ubiquitous Computing", " Physical Computing",
"Augmenting Reality", and "Tangible Interaction." Soon, we will be able to, or we will be forced
to, receive information everywhere, both consciously and subconsciously. The importance of
information accessibility should never be denied. However, people are facing information
overload. This situation brings us an obligation to research how to organize and filter
information. In particular, we are interested in helping people filter out unwanted or unimportant
information and focus on the most relevant information. Diminishing unwanted information
while rendering the important information more accessible/visible is the goal of this research.
Figure 1: Times Square with Information Overload. The street is overwhelmed with activities, lighting and displays; it's
information overload in the real world.
15
Current Challenges in the Video Conferencing Room
In this paper, we take video conferencing as an example to explain our belief in and approach to
Diminishing Reality.
In recent years, the use of video conferencing for remote meetings in the work environment has
been widely adopted. However, a number of challenges remain associated with the use of video
conferencing tools.
Technical constraint
Firstly, for large group video conferencing, it is hard to render the entire scene captured in high
detail. One obvious reason is the limitation of transmission bandwidth. Most of the video
conferencing tools in the market, such as Skype (Skype, 2012), samples video frames at low
resolution, to mitigate this problem. In addition, another possible reason could be that there is
not enough real estate available on the screen to render each participant in sufficient detail
(Tracy Jenkin, 2005). Some of the existing systems try to solve this problem by allocating the
screen real estate according to interest. For example, Cisco's WebEx (Cisco, 2012)system can
detect the volume thresholds and simply render the active speakers on the screen. eyeView
(Tracy Jenkin, 2005) adapts the visual cues of looking behavior to dynamically change the size
of different remote participants and render the current object of interest in a high-resolution
focus window.
Perceptive constraint
The second problem we are trying to address is the loss of semantic cues, such as visual-spatial
and auditory-spatial cues in the remote video conferencing system, which makes it difficult for
remote participants to focus and get engaged compared to in a co-located meeting. In face-toface conversation we commonly adjust the focus of our gaze towards those who are speaking.
Without conscious thought, our eyes dart in and out of different focal depths, expanding and
contracting the aperture of our inner eye. The subtle depth-of-field created when focusing on
who you are speaking with or who you are looking at is a natural tool which affords literal and
cognitive focus while fading out the irrelevant background information. Moreover, based on
auditory-cue including the direction of the sound source, people can easily focus on the selective
interest; this is also known as the "cocktail party effect". In the case of video conferencing, these
natural behaviors are reduced by the inherent constraints of a "flat screen" (Tracy Jenkin, 2005)
(Vertegaal, 1999). Eye gaze and sound have been commonly explored as potential cues
(Vertegaal, 1999) (Okada, 1994) by utilizing omnidirectional cameras, multiple web cameras,
and speaker arrays. We believe the conveyance of audio and visual cues serves a critical role in
terms of directing and managing attention and focus.
16
Inspirations
In photography and cinematography, one of the basic techniques artists use is Depth of Field
(DOF), to blur out the background and draw audiences' attention to the character or speaker in
focus (Figure 2). In lots of movies, when the main speaker changes to be another person, the
foreground focus will switch accordingly as well. The effect of DOF gives a better focus, while at
the same time making the viewers aware of the context.
Figure 2: Depth of Field Techniques used in Photography. Photo courtesy of Joshua Hudson.
Another inspiration is from the real life movie watching experience. When the movie starts, the
lighting system will be turned off in the movie theater. By diminishing the bright light in the
surround environment, people get a better concentration on the central movie projection. People
might take this process for granted, but they start to mimic this process naturally when they
watch movies at home in front of their own computers or TVs. But maybe mimicking is not a right
word, because people unconsciously want to create a dark environment when they want to
concentrate on a smaller screen even without borrowing concepts from movie theaters.
17
Hypothesis and Goal
To achieve better focus through Diminished Reality.
We propose the idea of emphasizing the foreground through de-emphasizing the background
and simplifying the visual contents. Based on the philosophy of "less is more", we try to create a
visually simplified reality that can help people to focus and concentrate. The purpose of the
display is not to restore the reality in the remote space with uniform high-resolution video.
Instead, the system diminishes non-essential background elements such as clutter while
highlighting focal persons and objects (such as whiteboards). In the end, what remote people
see is a "fake" or "synthesized" reality, yet it is a more useful reality.
We address the above hypothesis in our design of FocalSpace (Figure 3,4), a video conferencing
system that can dynamically track the foreground, and diminish the background through
synthetic blur effects.
Figure 3: FocalSpace System. It has 3 depth cameras and 3 microphone arrays to track the video and audio in an entire
video conferencing room.
18
N
X
-
-Remote
Participants
C36
Depth and RGB image space
Horizontal Sound angle
Figure 4: FocalSpace central configuration. Three depth cameras are arranged in such a way that 180 degrees
of the scene can be captured. One microphone array is integrated in each depth camera.
Contribution
Vision
e
DR as the Filter: We propose dynamic DR (Diminished Reality) as a basic approach for
filtering information, and organizing people's focus and attention through visual
communication.
*
Natural Blur: We applied synthetic blur as a main visual effect to diminish the background.
Synthetic blur has been proven to function as a natural means of helping people gain a
central focus while remaining aware of the context.
Usefulness
e
To give sufficient detail for the rendering of items of interest in the foreground, even with
very limited transmission bandwidth and screen real estate resources.
*
To diminish the peripheral visual and audio information, and give cognitively natural,
computationally effective cues for selective attention.
" To remove unwanted background noise, or protect information privacy in the background.
19
Technical contribution
We invented an interactive video conferencing system based on a customized technical solution.
Enabled by a depth map of the meeting room taken by 3 depth cameras and 3 microphone
arrays, the FocalSpace system detects interactive cues (such as audio cues, gesture cues,
proximity cues, and physical markers) and builds a dynamically changing "layered space model"
on top. The tracked foreground participants and subjects are taken in and out of focus
computationally, with no need for additional lens or mechanical parts. Due to our ability to infer
many different layers of depth in a single scene, multiple areas of focus and soft transitions in
focal range can be simulated.
Outline
An "evolutional" approach has been adapted for the thesis writing (Figure 5). The thesis starts
with the inspiration from photograph and real life visual experience, and then moves to an
overview of computer mediated reality, which includes great work by researchers, artists,
designers and industry practitioners who have through about altering human perception of the
real world. Moving forward, a design framework of "Diminished Reality + Augmented Reality" is
demonstrated in chapter 3, it explains the idea of dividing visual contents into background layer
and foreground layer in order to visually simplify and filter information. Following the design
framework, the definition of "foreground" in video conferencing room is expended from "talking
heads" to all the physical artifacts that are involved in the foreground activities; a wider
approach to emphasize or augment the foreground is explored as well. Various use cases are
explained in chapter 4 through chapter 6, with a detailed description of the design concept,
system setup, technical implementation and user feedbacks. The chapter that follows describes
about the design vision beyond video conferencing room. The conceptual framework of "DR+AR"
is explained under the context of driving and watching sports games separately.
20
Expanded Domain
(Chapter 7)
Expanded FocalSpace
(Second half of Chapter 4)
Use Cases
(First half of Chapter 4)
Design Framework
(Chapter 3)
Real world inspiration
(Chapter 1)
Figure 5: an "evolutional" approach to outline the thesis
21
RELATED WORK
"WHAT YOU SEE IS NOT REAL"
This chapter introduces related work in two different categories. We first
introduce the efforts of researchers trying to create a visual perception of
computer mediated reality, which is followed by an introduction to research
related to human's focus and attention on screen.
22
Computer Mediated Reality
Visual perception of the reality
It's a common belief that computation doesn't sit inside the computer screen anymore.
Computation is everywhere. Mobile phones and tablets make computation portable, giving
people the feeling that digital information is always around, regardless of time or place. But
currently, people can still tell where the physical world ends and the digital world begins, as
when they switch their eyes from their mobile phones to the real road in front of them. What if, in
the future, you cannot trust your eyes anymore? What if you cannot tell what is reality and what
is reality modified by the computer? This situation might be scary in some contexts, but it is
becoming an inevitable part of the future.
Steve Mann and his group developed the EyeTap, a camera and display "eye glasses", which
can show digital information at a proper focal distance (Steve Mann, 2002). This technology can
process and alter what the user sees of the reality. Computer Mediated Reality is a concept
raised by EyeTap [Figure 6]. It is "a way to give humans a new perspective on reality, by adding,
subtracting or other ways to manipulate the reality". It's a visual perception of the real world
mediated by computational technology.
Figure 6: EyeTap. It's a device that can be worn as an eyeglass. It gives humans a new perspective on reality through
computer mediation.
Augmented Reality
As the most common type of Computer Mediated Reality, Augmenting Reality (AR) has been
widely adapted to different use cases. As one type of Computer Mediated Reality, the
Augmenting Reality (AR) is widely explored and developed in different scopes. As described on
Wikipedia (AugmentedReality, 2012), Augmenting Reality is "a live, direct or indirect, view of a
23
physical, real-world environment whose elements are augmented by sensory input such as
sound, video, graphics or GPS data". AR applications, especially those based on mobile
platforms are starting to be widely used in various domains, such as consuming, entertainment,
medicine ,and design. AR can be used both indoors and outdoors with various technical
solutions. This paper will not address AR in depth, as AR itself is an open-ended topic. Ronald
gives an overview of AR at the current stage (Azurma, 2004). It should be noted that researchers
from University of North Carolina and University of Kentucky envisioned the future office with
every planar surface augmented with digital information for remote communication and
collaboration (Figure 7). This is relevant to the topic of this thesis, as "the future office" also
utilizes per-pixel depth tracking to learn about visible surfaces including walls, furniture, objects,
and people, and then to "either project images on the surfaces, render images of the surfaces,
or interpret changes in the surfaces" (Ramesh Raskar G. W., 1998).
Figure 7: The "Office of the Future" depicts a vision of a future office where all visible surfaces, including walls, furniture,
and objects, can be detected and utilized to hold rendered images.
24
Time Augmented Reality
The term "Computer Mediated Reality" alludes to a broader goal beyond simply augmenting the
surrounding environment with GUI or GPS related data. ART+COM (ART+COM, 2012) has two
projects, Timescope and Jurascopes, which augment the perception of reality along the time
axis. This is one innovative example of "Computer Mediated Reality". Timescope (Figure 8) was
installed in front of the location of the former Berlin Wall. It enables viewers to travel back in
time at their present location. Through a media scope, people can take a trip back to see the
history of a structure that defined the city for over 30 years. Jurascopes (Figure 8) were installed
in the Berlin Museum of Natural History. Through these media telescopes, viewers can see, one
after the other, inner organs, muscles, and skin placed on top of the skeleton of a dinosaur.
Eventually, the dinosaur appears to come to life and run around. Sounds from the environment
and the animal itself contribute to the experience.
Figure 8: Timescope (Left) and Jurascopes (Right) from ART+COM. Timescope enables viewers to travel back in time at
their present location. Jurascopes turn dinosaur skeletons into live dinosaurs through an augmenting lens.
Improved Reality
"Improved Reality" is an informal term used to describe the approach of enhancing or enriching
the visual perception of reality. Artvertiser is an example of this phenomenon. This project
enables users to view a virtual canvas on top of an existing advertisement through a mobile
device (Artvertiser, 2012). The detected advertisement can be on a building, in a magazine, or
on the side of a car. Artists can create their own visual art on top of the virtual canvas via mobile
devices (Figure 9).
25
Figure 9: Artvertiser is a project to replace the advertisement on the billboard with digital art work.
One may notice that the technique of "Diminishing Reality" is normally used in combination with
other approaches to "Computer Mediated Reality". In most cases, augmented contents are
added on top of the diminished reality.
Altered Reality
There's an old saying in Chinese:
"i,
isdn
4I ]". It
means "although it's
the same mountain, it looks like a ridge if people see it from the front, but it will turn into a peak
if people see it from the side; it seems so different depending on how far and how high the
spectator is from the mountain". In the same manner, if we change the perspective from which
people see reality, an altered world could exist in human perception.
The artist Carsten H61ler made goggles out of optical glass prisms that he called "Umkehrbrille
Upside Down Goggles" (Figure 10). By wearing these goggles, people could perceive the real
world upside down. In the 1890s, George Stratton conducted an experiment to see what would
happen if he viewed the world upside down through an inverting prism. He found that after 4
days, his brain started to form a perceptual adaption, and he could see the world the right way
up again (Stratton, 1896).
26
Figure 10: Umkehrbrille Upside Down Goggles by Carsten H61ler. By wearing these goggles, people could perceive the real
world upside down.
Rather than altering perspectives based on human vision, Chris Woebken and Kenichi Okada
developed "Animal Superpowers" to alter the real world with animal senses and perceptions
(Chris Woebken, 2012). "Animal Superpowers" includes three physical wearable devices to
mimic animal senses on the human perceptive level (Figure 11). "Ant" makes people feel like
ants by magnifying human vision 50x through a microscope in the hand. The device enables
users to see through the hands and explore tiny cracks and details of a surface. The "bird"
device borrows birds' capability of recognizing directions. It uses a GPS system to vibrate when
people move in a certain direction, such as home or an ice cream store. Finally, the "giraffe"
device extends users' necks and gives them the capability of seeing tall things. It can act as a
child-to-adult converter by raising children's perspective by 30 centimeters.
Jeff Lieberman, who created "Moore Pattern", a kinetic optical-illusion sculpture (Lieberman,
2012), subscribes to the notion of "seeing is believing". Within the context of this document,
"seeing" could be mediated by computation, which generates unique perception, or "believing"
to people. The perception might help to extend human capability, offer a unique experience, or
serve as an entertaining style.
27
Figure 11: "Animal Superpowers" includes three wearable physical devices that mimic animal senses on the human
perceptive level. "Ant" makes people feel like ants by magnifying human vision 50x through a microscope in the hand.
"Bird" borrows birds' capability of recognizing directions. "Giraffe" can act as a child-to-adult converter by raising children's
perspective by 30 centimeters.
Diminished Reality
Compared to AR, much less attention has been devoted to other types of Computer Mediated
Reality, including Diminishing Reality. A clear definition for Diminishing Reality has yet to be
offered.
Steve Mann and his colleagues were among the first to import the term "Diminishing Reality"
into the HCI field. The Reality Mediator allows wearers' visual perception of the real world to be
altered in such a way that users can diminish or modify visual detritus freely. It also ensures that
the augmented information can be added on top of the diminished reality without causing
information overload. For example, road directional information, or personal text messages, as a
28
type of digital augmentation, can be seen on top of a board used for advertisement in reality
[Figure 12].
Figure 12: Diminished and augmented planar surface. Through wearable goggles, road directional information, or
personal text messages, as a type of digital augmentation, can be seen on top of a board used for advertisements in reality.
Steve Mann also created some art pieces along the same conceptual lines. On a monitor screen
facing the real environment, viewers could see a different status of the reality where the monitor
was pointing. For example, people could see part of the reality without fog even in a foggy day or
a brightly light view could be seen in the dark [Figure 12]. Such art pieces expand people's
perspectives on reality. Observers could see reality beyond a specific point in time.
Figure 13: Artwork by Steve Mann. Diminishing fog and darkness.
Scalado is a company focusing on creative video capturing and viewing tools (Scalado, 2012).
By embedding various image-processing technology in real time, it augments the captured
image with angles, time, perspectives, digital contents, etc. One feature related to Diminishing
Reality is called "Remove". This feature can highlight the background elements and enable
29
users to select and delete them. For example, users can easily delete walkers passing by while
keeping a focal figure in the foreground of a street picture (Figure 14). The camera works by
capturing several images once the shutter is turned on and determining the difference between
each captured image. Similar technology for background subtraction is used for real-time image
processing.
Figure 14: "Remove" from Scalado: mobile software to capture and edit phone for diminishing background detritus.
By choosing different criteria, work related to Diminishing Reality could be categorized in
different ways. It could be categorized based on triggering cue; DR effect can be triggered by
voice, eye-gaze, or manual selection through GUI and manual selection through pointing devices.
It could also be categorized based on domain; DR can be used for art pieces, educational
applications, entertainment, daily life, work, and so forth.
Diminished "Digital Reality"
There are two points in the concept of "Diminishing Reality" mentioned in this paper:
diminishing, or de-emphasizing part of the scene perceived by human eyes; leaving a new
perception of the reality to the viewers.
We will describe some graphic interface design and data visualizations that seek to diminish or
filter information on visual display. Technically speaking, the issue is not diminishing "reality", as
everything is virtually displayed. However, it has the same design concept of "simplifying the
background visual contents, or context, and helping the viewers to focus on the current
30
foreground". Some researchers tend to call it "attentive display". It's helpful to have an overview
of some of the project and learn how it is possible to create a visual contrast between the
foreground and background information.
In their paper "Semantic Depth of Field", Kosara et al. divided the design approaches of
attentive visualizations into 3 types: spatial method, dimension method, and cue method
(Robert Kosara, 1997). Through spatial or distortion-oriented methods, the geometry of the
display is distorted to allow magnification of the central focus. Various methods have been put
into practice, including fish eyes (Furnas, 1986) (M.Sarkarand, 1994), stretchable rubber sheets
(M. Sarkar, 1993), distorted perspectives, and hyperbolic trees (Munzner, 1998), etc. For some
objects that have a large amount of data related to them, only a small part of the data can be
shown as an overview, and another dimension of data can be displayed based on the selection.
Examples are magic lenses (M. C. Stone, 1994) and tool glasses (E. A. Bier, 1993).
Finally, the cue method, which is the method most relevant to "Diminishing Reality", makes the
foreground data noticeable not by changing the locational relationships, but by assigning them
certain visual cues to emphasize their features. One example is the Geographic Information
System (GIS). When different types of data are shown on top of the same layout, the central
information is emphasized with higher color saturation and opacity. Semantic depth of field
(DOF) (Robert Kosara, 1997) is one type of cue method as well. Adding "semantic" in front of
DOF means that the blur is averaged without depth information for the part that is out of focus
(Figure 15).
U..
Figure 15: Diminishing Reality in Data Visualization and Software Applications. (Left) A file browser showing today's files
sharply, with the older ones blurred; (Right) A chess tutorial showing the chessmen that threaten the knight, with other
inactive roles blurred.
Adjusting the transparency or blurring the background is a common method in Web User
Interface to highlight the foreground and filter out visual detritus for viewers. The project website
31
of Tribal DDB (DDB, 2012) is one of example (Figure 16). The reason is that this visual effect
could help viewers to concentrate better on the current information, whether it is a webpage or a
video.
Figure 16: The company webpage of Tribal DDB. By adjusting the transparency or blurring the background, we could
highlight the foreground, and filter out visual detritus for the viewers.
Some video streaming websites have a button labeled "turn on/off the light" as a selection of
the video modes (youKu, 2012). When the "light" is off, as in the right image in Figure 17, the
background of the video turns black and subtracts all the other visual elements such as reviews,
ads, menus, and titles. The name of the feature, "Turn on/off the light" reminds people of
turning the light off in the movie theater. Indeed, that's also a real-life scenario where
"Diminishing Reality" is used naturally.
wc a
i~~b004
**OswomOG
a~
Figure 17: "Turning on and off the light" on YouKu. (Left) When "Turn on the light" button is on; (Right) When "Turn off the
light" button is on [29]. "Turn on/off the light" also reminds people of turning the light off in the movie theater. Indeed,
that's a real life scenario where 'Diminishing Reality" is used naturally.
We just offered a glimpse of related work in this category. There was and is a strong research
community focusing on how to visualize data in an effective way while keeping the balance
between focus and context.
32
A bigger question we want to ask ourselves is this: In the past, when we got too much virtual
data, researchers started to think about all the creative ways there are to manipulate
visualizations for easy human perception. In the current age, as we start to encounter a large
amount of visual and artificial information in the physical world, shouldn't we take action and
think about ways to organize human perceived reality as well?
Abstracted Reality
A non-photorealistic camera gives a non-photorealistic rendering approach to "capture and
convey shape features of the real world scene" (Ramesh Raskar K.-H. T., 2004). By augmenting
a camera with a multi-flashlight that can cast shadows from different perspectives while
capturing the picture, it could be possible to highlight the outline, simplify the textural
information, and suppress unwanted details of the image (Figure 18). In the use case of video
conferencing, by abstracting the face of the talking head, we could transmit the shape and facial
expression accurately without showing detailed texture. This would be useful for the purpose of
maintaining privacy.
Figure 18: Non-Photorealistic Camera. By augmenting a camera with a multi-flashlight that can cast shadows from
different perspectives while capturing the picture, it could be possible to highlight the outline, simplify the textural
information, and suppress unwanted details of the image.
In the next example of computer-generated artwork (Doug DeCarlo, 1991), researchers believe
that the abstract rendering of a photorealistic image is one way to clarify the meaningful
structures of an image in information design. The abstract rendering (Figure 19) was generated
by transforming a photograph based on the viewers' eye-gaze fixation. It has been proven in the
art and design field that abstract renderings could achieve more effective communication than
realistic photos in some cases. The goal of this work is to computationally abstract a picture and
create a non-photorealistic art piece. One of the questions is to what extent the system should
simplify the visual components of the real picture. Based on eye-gaze data obtained when
different viewers perceived the same picture, the part of the picture attracting most of the
33
attention was left with the most details, and the other part was much more abstracted. This is a
good example of using "Diminishing Reality" but in an abstract way.
O1
Figure 19: The abstracted rendering is generated by transforming a photograph based on viewers' eye-gaze
fixation. (Original photo courtesy of philip.greenspun.com.)
Focus and Attention
"What's the information people actually care about?"
Awareness of focus for video conferencing
In FocalSpace, we care about focus and attention. How can we take users' focus and attention
into consideration in the display? For remote video conferencing, researchers have explored
different methods of tracking and estimating attendees' attention and focus. Eye gaze and
sound have been commonly explored as potential cues by utilizing omnidirectional cameras,
multiple web cameras, and speaker arrays. The idea of identifying focus and attention in a video
conference dates back to a "voice voting" system that automatically switched between multiple
camera views based on the active speakers (ROBERT C EDSON, 1971).
Research in video conferencing has been concerned with remote listeners' focus and attention.
On one hand, some systems should actively communicate the proper focus to remote users. On
the other hand, some systems try to give remote users flexibility to access the part of the remote
scene on which they want to focus. In "Reflection", researchers add the reflections of all the
participants from different remote locations onto the same display (Stefan Agamanolis, 1997).
Auditory cues are used to track and emphasize the foreground. The active speakers are
rendered opaque and on the foreground to emphasize their visual preference, and other
34
participants are rendered slightly faded in the background in a manner that maintains their
presence without drawing too much attention. The Clearboard system explores how to
seamlessly combine the working space with interpersonal space (H. Ishii, 1994). This system
saved participants' effort to switch attentions between the two spaces.
Layered video signal processing for conferencing has also been explored, where there has been
a stark and clear differentiation in foreground from background. A number of real-time
algorithms (Zhang, 2006)have been implemented to blur background content in order to protect
attendees' privacy, while only parts of the face are made to be clear. Moreover, HyperMirror
adapted blue screen technology to layer on participant into a scene consisting of another
(Maesako, 1998).
Focus and attention in other application domains
We'd like to talk about focus and attention beyond video conferencing system.
In collaboration with Tribal DDB Amsterdam and Grammy award-winning orchestra Dutch
Metropolis, Philips sound developed an interactive orchestra system, "Obsessed with Sound".
Through the interface, users can interactively select and single out any one of 51 musicians and
hear every detail of that musician's contribution to the orchestral piece. The visual cue that
indicates the selection visually blacks out other musicians and leaves the selected musician in
sharp focus and full color saturation (sound., 2012).
When people listen to an orchestra, it's very hard to focus on a single player. The designers
believe that every single detail in music is very important and should be heard. To capture this,
they created this unique campaign to celebrate each individual artist behind every musical
moment. A special orchestra piece in 55 separate music tracks was recorded and combined in a
single music system. As they listen to each musician, users can also discover what's behind the
sound, from the total hours they've played in their lifetime, to the number of notes they played in
the piece, to their twitter feeds, Facebook accounts, and musicians' personal webpage (Figure
20).
35
Figure 20: "Obsessed by Sound," an interactive music system that enables users to single out a particular musician and
hear every detail of his or her contribution.
In Fixation Maps, a study (Wooding, 2002)has shown that by controlling the luminance, color
contrasts, and depth cues, artists can guide the viewer's gaze towards the part that expresses
the main theme. By aggregating fixation data from 131 subjects viewing Paolo Veronese's
Christ addressing a Kneeling Woman, an artist noticed that subjects' average eye gaze is drawn
to the two figures, Jesus and the woman looking at him. Based on this finding, the artist blacked
out the other part of the painting to get a more focused art piece (Figure 21). The modified work
has spotlight effects on the two characters that have attracted the most attention and thus
directs viewers' attention to the most attractive portion of the painting.
36
Figure 21: Aggregated fixations from 131 subjects viewing Paolo Veronese's Christ addressing a Kneeling
Woman. Subjects' gaze is drawn to the two main figures. (Original image © National Gallery, London,
annotations @ IBS, University of Derby, UK, courtesy of David Wooding.)
There is some other artwork caring about focus and attention. ECS Display was used to track the
users' point of gaze from a distance and display artwork with visually highlighted attentive areas
(Reingold, 2002). The approach of dynamic information filtering based on users' interest is
applied in different applications. Most of the work uses eye gaze tracking to gain information on
people's points of attention on the display.
In conclusion, different visual effects can convey different meanings. Different focus on the
same display could generate different interpretations. Through the manipulation of visual
effects, we can direct people's attention to certain portions of the screen.
Moreover, work on GSD (gaze-contingent display) resulted in multi-resolution, ultimately high
lighting visual effects (Figure 22). But almost all the work started from functional perspective.
They either tried to save the rendering power for large displays (Reingold, 2002), or tried to save
bandwidth for transmission. As it was concluded by Loschky and McConkie, researchers wanted
to create a display that is although distinguishable from a full-resolution image, won't
37
deteriorate visual task performance (Loschky, 2005). That means, the hypothesis here is: the
highlighting effects, or partially blur effects is at most not giving bad influence in terms of
performance. However, we are trying to use those highlighting effects reversely as a positive
visual factor for efficient communication.
Figure 22: Gaze Contingent Display. The system tracks viewers' gaze movement and display the portion where the
viewers focus on in full resolution, while render the other parts in low resolution to save the rendering power for large
display.
Motivation for including synthetic focal points is derived from the constraining size of screen
displays; a system may want to call attention to specific details while at the same time keeping
the viewers aware of the global context. Moreover, additional methods may be employed to bring
the user's attentions to the most relevant, central information on display, as shown with the
spotlight technique (AzamKhan, 2005).
As discussed, most of attention based display systems have been designed to address
extremely specific problems (Rainer Stiefelhagen, 2001). FocalSpace collectively utilizes
techniques of focus points and background subtraction to bring attention to dynamic points of
interest.
38
IMPROVING Focus BY DR
DR
AR
This chapter introduces the "layered space model" enabled by the FocalSpace
system, and various interaction cues that can be used to build the dynamic
spatial layers.
39
The "Layered Space Model" of FocalSpace
We divide the remote space into three discrete layers as shown in Figure 23. These layers are
(1) background, (2) both the active and inactive foreground which contains participants or
relevant objects, and (3) the augmented layer which includes contents such as contextually
registering digital contents to the foreground, and graphic user interface elements. These layers
exist in parallel for the user, but our system's understanding of both depth and object notation
allows us to keep a rich dataset of the distant space and its inhabitants.
Diminished reality effect and other FocalSpace use cases are implemented based on the
construction of those space layers. For example, the system can diminish/subtract the
background layer by blurring it out and augment the foreground layer based on different
interaction scenarios. This technique of augmenting diminished reality balances cognitive load
by adding information only once unnecessary information has been removed.
active participants
active object and media
active environment
remote view
"A
IL
graphic inted
augmentator
layer
Augmentation Layer
augmented
background layer
inactive foreground layer
active foreground l ayer
1
j
Foreground Layer
Background Layer
0 diminished
Figure 23: Dynamic Layered Space Model. FocalSpace divides the space into foreground layer, background layer,
augmented layer and graphic layer.
40
Foreground layer
Conference participants and objects of interest exist within the foreground layer. Using depth,
audio, skeletal, and physical markers recognition we are able to automatically identify elements
within the foreground. These elements are generally the conference participants themselves,
along with contextual objects such as planar surfaces and arbitrary objects. The methods of
detection are further discussed in the features & techniques section.
We divide the foreground into active foreground layer and inactive foreground layer. Active
foreground layer includes the tracked active elements; while the inactive foreground layer
includes the relevant and important however not currently active elements.
Elements on the active foreground layer are always clearly in focus on the remote display.
Diminished background layer
The background layer refers to the irrelevant visual elements, sometimes the distractive noise
that is transmitted during a normal video conferencing. Visual and auditory clutter could add
unwanted distraction, especially when participants are in public or heavily trafficked areas.
In order to reduce visual noise and ultimately create what will become the active foreground, we
utilize the technique of diminished reality where the first layer of contextually inactive physical
space is removed through a number of different methods. Inactive elements are blurred to
simulate focus. Objects existing within the immediate focus of the active foreground will stay
sharp, while objects existing in the background become blurry or out of focus.
Add-on Layer
Remote participants are able to observe and interact with the augmented foreground, which
presents a number of interactive menus and dialogs that are spatially and temporally registered
with objects found in the active foreground. These augmentations can include user nametags,
contextual dialogs, shared drawings and documents, amongst other features otherwise invisible
to a remote viewer.
Finally, the GUI layer presents the user with a suite of controls to manage the FocalSpace
system. The GUI layer allows for fine control over the diminished background layer enabling the
user to specific effects and their respective intensity.
Interaction Cues of "Layered Space Model"
Interaction cues will influence people's focus and attention directly. We try to categorize cues
that might indicate the important and active foreground activities in the videoconferencing. We
41
use human body as a starting point, to explore the relevant behaviors in the video conferencing.
The factors of physical artifacts are taken into consideration as well (Figure 24, Table 1).
ENVIRONMENT
HUMAN
see
talk
move
gesture
OBJECT
Figure 24: To categorize the interaction cues which can be used to build the Layered Space Model, we use human body
as an entry, to explore the relevant behaviors in the video conferencing. The factors of environment and physical artifacts
are taken into consideration as well.
Human Behaviors
Interaction Cues in FocalSpace
Listen
Audio cue
Talk
Audio cue
See
Gaze cue
Gesture
Gesture cue
Touch
Proximity cue; Physical maker as an interactive cue
Move
Proximity cue; Physical maker as an interactive cue
Table 1: Interaction Cues in FocalSpace
42
For most of the remote meetings, the talking heads are important focus. Voice cue can be used
to track the active talking head. Beyond voice cue, participants' iconic gestures, movements in
certain locations, or even eye contact should be taken into consideration as well.
The Layered Space model uses depth sensors and various cues such as voices, gestures, and
physical markers to identify the foreground of the video and subsequently diminish the
background. The cues and related interactions are discussed in this section.
Audio cue
In group meetings, the most interesting foreground is typically the current speaker (Vertegaal,
1999). The primary method of determining the foreground is through detecting participants by
whether or not they are actively speaking. Audio conversation is a central component to remote
collaboration - as such, we can use it to determine implied activity (Figure 25). Remote
participants who begin to speak will be moved from inactive to active foreground layer and
automatically focused. In order to detect the absolute location of an active participant, we use a
combination of audio direction as derived from a microphone array and computer vision for
skeletal tracking. Once a substantial skeletal form is detected within a reasonable angle of the
detected audio, we can deterministically assign the according RBG and depth pixels to a
participant.
0
L
0
Liuzie
0.10
Tony
W
Figure 25: Voice Activated Refocusing. (Left) User 1 is talking and in focus; (Right) User 2 is talking and in
focus. In FocalSpace, Remote participants who begin to speak will be placed into the active foreground layer
and automatically focused.
Gestural cue
FocalSpace enables the detection of pre-defined gestures. For trained users, gestures can be
used to tag, trigger certain interaction, and indicate certain features. For new users, FocalSpace
can track their natural gesture.
43
Take one use case as an example: In videoconference with many participants it can be hard to
get attention. In classical meetings we typically have a meeting leader who will keep track of the
people who raises their hands. FocalSpace can track the gesture of "right hand up"
automatically, and put the awaiting person into focus together with the former active participant
simultaneously (Figure 26).
Figure 26: Gesture Activated Refocusing. Human natural gesture can be detected in FocalSpace. One use case is:
system could put the person who raises his left hand up to a waiting list, and people on the waiting list will be focused
as well.
Proximity cue
Not all forms of communication during remote collaboration are purely auditory. Many forms of
expression take the form of written word or illustration. We explore the automatic detection of
activity regarding planar drawing surfaces, such as flipchart and white board. When the active
participants move within a certain distance to the planar surfaces, the surfaces will be moved
from inactive to active foreground, and activate the auto refocusing (Figure 27). The dynamic
detection of the distance between active human body and planar surfaces is defined as
proximity cue.
44
Figure 27: Proximity Activated Refocusing. (Up) The user is talking and in focus; (Bottom) When the user moves to
the flipchart and starts to sketch on it, the flipchart takes focus. In a calibrated space, the system knows which region
of the depth map belongs to the flip chart. The proximity relationship between the flipchart and the user can be
tracked through the depth map.
Physical maker as an interactive cue
To detect arbitrary physical objects, such as a prototype model from a design meeting, we
introduce physical markers (Figure 28). Defining the role of remote objects as tools of
expression and communication is an important component of the FocalSpace system.
Participants can invoke selection by moving the physical token with identifiable markers to the
object. All the arbitrary objects sharing the same 3D locational range will be focused as
foreground.
Object activated selection contains additional features, which augment the remote participants
ability to distinguish the object's state through spatial, contextual augmentation. We discuss
more about this case in the user scenarios section.
Moreover, physical maker can be used as command in the video conferencing. For example, by
tagging a flip chart image with a physical "email" marker, the system could receive the
command, capture the image and share it with the group mailing list.
45
Figure 28: physical marker as an interactive cue to trigger the tracking, refocusing and augmentations.
Remote User Defined/Discretionary Selection
In the case of participants who wish to define an area, per-son, or object of interest at their
discretion, a user-defined selection mode is provided. Participants are given the ability to
override both the automatic voice and object-activated modes of selection by manually choosing
the area of interest. Interactive objects within a remote space are presented as hyperlink objects
with common user interface behaviors such as hover and click states. Participants are able to
select people and objects directly within the environment using a cursor, or from within an
enumerated list of active objects found within the accompanying graphical user interface.
Other Potential Cues
Except for all the semantic cues we mentioned above, there are more potential cues we could
look into. For example, the remote user might want to look at the person whom the other person
on the screen is looking at. In this case, the cue is the third persons' eye gaze.
46
USER SCENARIOS
This chapter explains how FocalSpace can be useful in different use cases,
which include Filtering Visual Detritus and Directing Focus, Saving Display
Spaces for Augmentation, saving bandwidth and keeping privacy.
47
We develop the application on top of the "Layered Space Model" in 3 steps: attention,
augmentation, and transmission. First, FocalSpace could clean up the visual detritus on the
background and direct remote participants' attention to the central focus. To augment or
enhance the foreground activities, we utilize the diminished space to augment the focus with
related digital contents. Finally, with a blurred and compressed background, we save bandwidth
for transmission (Figure 29).
attention
-
augmentation
-
4
transmission
Figure 29: We develop the application on top of the "Layered Space Model" in 3 steps: attention, augmentation, and
transmission. FocalSpace could clean up the visual detritus in the background and direct remote participants' attention to
the central focus. To augment or enhance the foreground activities, we utilize the diminished space to augment the focus
with related digital contents. Finally, with a blurred and compressed background, we save bandwidth for transmission.
Filtering Visual Detritus and Directing Focus
For some of the cases, the background layer refers to the unwanted noise that is transmitted
during a videoconference. Visual and auditory clutter adds unwanted distraction, especially
when participants are in a dynamic working environment. By removing the unwanted
background noise, we are able to increase communication bandwidth and direct participants'
focus toward the foreground activity in a cognitively natural and effective way (Figure 30). This
scenario demonstrates that FocalSpace allows effective video conferencing in flexible
environments.
Figure 30: FocalSpace can effectively diminish the background detritus and direct remote participants' focus to the
foreground information.
48
Saving display space for augmented information
In VideoOrbits (Steve Mann, 2002), the work demonstrates ways to alter the original contents on
an advertisement board and augment digital information on its surface. The project takes
Diminishing Reality as an approach to allow additional information to be inserted without
causing the user to experience information overload. On top of the FocalSpace system, we've
implemented several applications by augmenting the diminished reality. Through those
applications, we show how diminished space could be effectively utilized with augmented
contents that have high relevance to the foreground people or objects.
By augmenting participants, we create context to assist in conversation and collaboration.
Participants within FocalSpace can be augmented with rich meta-data about their personae.
These data can include the participant's name, title, email, twitter account, calendar availability,
shared documents, and total speaking time, among other information. Participants are
augmented with virtual nametags that are spatially associated with their respective participant
(Figure 31).
Figure 31: (Right) Time and name time metaphor from the real life scenario. (Right) Feature in FocalSpace: a participant
with an augmented nametag displaying user name and total talk time.
Rather than just talking, more dynamic activities in the foreground can be augmented. To make
information sharing more flexible no matter where it is, the system can track the behavior of the
active participants and display the best perspective of knowledge sharing surfaces to the other
side. Participant augmentation can be activated through voice cue, gesture cue, or manual
selection. In manual mode, participants are presented as clickable objects that can be opened
and closed, bringing them in and out of the active foreground at the remote viewer's discretion.
In addition to augmenting the participants, we present a technique for augmenting objects
whereby the object's or surface's relevant portions become more visibly clear to the remote
participant. Motivated by the general difficulty in expressing visual concepts remotely,
49
FocalSpace provides spatially registered, contextual augmentations of pre-defined objects within
the system.
One such application is for sketching on surfaces. By capturing a video stream of planar drawing
surfaces with an additional camera, we are able to display a virtual representation of the surface
to the remote user shown perpendicularly to the paper itself. We demonstrate sketching on
paper flip charts and a whiteboard. This technique allows remote users to observe planar
surfaces in real time that would otherwise be invisible or illegible.
For example, if the system tracks that the person moves towards the flip chart and starts
drawing, a high-resolution front view of the flipchart video stream will be displayed to the remote
user automatically (Figure 32); sketching on the table can be displayed in a similar way to
remote users (Figure 33).
Figure 32: Contextual augmentation of local planar surfaces. The augmented perspectives are captured through
satellite web cameras. A proximity cue is the trigger to open up the augmented flipchart.
50
Figure 33: Sketching on the planar surface can be detected and augmented for the remote users.
If we go one step further, the digital sketching surface on a tablet or computer can be
augmented and shared without losing context or reference to the conversation as well. Both the
local and remote participants are able to explore visual concepts on a shared whiteboard, which
exists both in the physical space and within the virtual space. At any point, the remote
participant can click the spatially registered display window, bringing it to the GUI layer, and
begin to contribute (Figure 34).
Figure 34: Contextual augmentation of shared digital drawing surfaces. It is spatially registered with its respective
users. The augmented view of the canvas can be enlarged for local participants.
51
Saving Bandwidth for Transmission
Since diminishing background information may not be informative for the remote user, the
system can reduce the transmission bandwidth while keeping the foreground in-formation. Fig
20 shows the bandwidth evaluation results comparing FocalSpace and regular
videoconferencing. These video sequences are compressed by the same JPEG 2000 standard at
the rate of 0.05[bit/pixel]. It is clear that the general video conferencing system allocates the
bandwidth uniformly, and then the quality of the foreground layer is worse compared to that of
Focal Space.
This feature can be easily implemented for FocalSpace by using the rate-allocation technology
named region of interest (ROI) of JPEG and MPEG standards, e.g. JPEG 2000 and Scalable video
coding of H.264/AVC.
Figure 35: 3 degrees of blur, with dramatically changed data size. The original JPEG2000 image is 309K. It turns to
be 126K, 109K and 7K with the compression rate of 1.387828, 1.202384 and 0.079389 respectively.
Keeping Privacy
As FocalSpace allows for flexible control of layered background and foreground, it can keep
information in the background safe. It allows a more flexible video conferencing environment.
This advantage can be utilized in certain situations, such as isolating irrelevant activities or
background noise from the scene displayed to the remote user.
52
I MPLEMENTATION
"SEEING DEPTH"
This chapter explains the system setup and software development of
FocalSpace.
53
System
Our interaction design focus heavily on designing this conferencing experience around existing
meeting environment. For common setup in the meeting room that often means long or round
table with TV or projected display in front. By adding depth cameras in front of the display, and
extra webcams pointing to part of the planar surfaces in the meeting rooms, we could easily
adapt our system to the existing ones.
Setup
We divide the system into central setup and periphery devices. The central setup includes a
large display, and three depth cameras. The periphery setup includes some other add-on
devices, such as high-resolution satellite-webcams pointing to specific locations for augmenting
reality related scenarios.
The central system setup includes a large display, and 3 depth cameras (Microsoft Kinect for
Windows ) with integrated microphone array for each one. Cameras are placed shoulder by
shoulder on top of the display to ensure good coverage of the meeting room. Participants could
sit around facing the display and cameras [Figure 36]. If we align the 3 cameras with 120
degrees between each other, we could get a seamless scene of the space they could cover. To
make the capturing more flexible, a rotatable frame was built for holding the 3 cameras. By
adjusting the angle of the frame, different part of the conference room can be captured.
For the peripheral setup, A number of standard web cameras are placed in the local
environment to capture drawing areas, or objects of interest as best fits the topic of the
conference (Figure 37). Some of these cameras have been mounted near a whiteboard in order
to capture a higher resolution image of sketching, as well as on table surfaces where a lamp
web-cam captures paper sketching. Augmented views from the cameras can be triggered by
either auto detection or manual selection.
54
3 Kinects
skeleton tracking
Figure 36: FocalSpace central configuration. Three depth camera are arranged in a way that 180 degree of the
scene can be captured. One microphone array was integrated in each depth camera. Depth sensing and
human skeleton tracking are enabled by the 3 depth cameras.
Figure 37: Satellite cameras are pointing to the pre-defined areas, for potential augmentations. For example, the camera
pointing to the flip chart could give a high-resolution augmented view of the contents on the flip chart when someone
moves close to it.
55
Additionally, the local space may incorporate tablet devices, in our case an Apple iPad 2, for use
during collaborative drawing.
Tracking Techniques
We've talked about different interactive cues to trigger the detection of space layers. The
detection of the cues is based on depth sensing in combination with human skeletal tracking
and physical marker recognition. We choose human skeletal tracking and physical marker
tracking, over other methods such as face recognition or background subtraction, as a means
for detecting the active foreground and creating special visual effects on top because of its
flexibility to keep tracking of the continuously moving bodies and objects with very little
calibration effort.
Because the system is designed to be used on a daily basis in a flexible meeting environment,
our tracking method needs to work easily with quick setup when someone walks into the
meeting room. We chose to implement a simple foreground tracking system, which uses depth
cameras to help segment and track foreground elements. The depth camera gives an easy and
robust way to differentiate foreground from the background. By reading out the depth of each
pixel captured, it's easy to recognize and track a human body and physical markers within a
certain 3D range.
Based on our analysis of foreground activities in the meeting room, the foreground to be tracked
is divided into three categories: human body, planar surfaces, and arbitrary objects. For human
body tracking, our implementation adapts the skeletal tracking algorithm supplied in Microsoft
Kinect for Windows SDK (Kinect, 2012), which can automatically detect human shape skeleton
within a certain range based on the gray scale depth image. For planar surface and arbitrary
object tracking, we use physical markers (ARtoolKit, 2012). In combination with the depth data,
we could get the 3D locational data of where the physical markers are placed, and further
segment the surfaces or artifacts where the physical marker is attached to, through the depth
difference and edge detection.
56
Human body tracking
Within the Same
depth range
AR tag
%,Within
the Same
depth range
U
- -Planar surface tracking
Skeleton
---------------------------------Within the Same
depth range
-
-
- -
- -
-
- - Arbitrary
object tracking
AR tag
Figure 38: The segmenting and tracking targets are divided into 3 categories: human body, planar surfaces,
and arbitrary object; Skeletal tracking is used to detect human body, and AR tags are used to detect other
foreground elements.
Image Processing Pipeline
The interactive display is built based upon both pixel depth sensing and the tracking of
interaction cues. When each interaction cue is detected, it corresponds to a certain foreground
element. For example, when "hand up" gesture is detected, it corresponds to the same human
body; when a proximity cue is detected, it corresponds to a pre-defined planar surface. We take
audio cue as an example to explain the image processing pipeline (Figure 39).
57
Match,
Track
eeMatch
RGB image space
Depth Image space
Sekeletal space
Depth cam video frames
Combine
Inactive background
it1k
Track
Depth image with
RGB and skeletal infor
Match
Subtract
ARGBof
active skeleton
Background with
blur shader
Combine
Sound sensor array
0101
FocalSpace
Figure 39: Video Conferencing Pipeline of "voice activated refocusing". The depth camera segmented depth
and RGB video frames are combined and sent to onscreen graphics. In combination with spatial audio
tracking, the corresponding skeleton along the direction of the detected voice is segmented and put into
central focus, through an Open GL blur texture.
The image processing is activated by two factors: the physical location of the human body
skeleton, and the horizontal angle of the voice source. The system keeps tracking the angle of
the voice source through sound sensors array, and tries to find skeletons along the sound angle.
If a matched skeleton is detected, that pixels within the same depth range as the head joint of
the skeleton will be copied out into a texture with transparent background, and superimposed
on a computationally blurred layer of the entire scene.
User Interface
The front-end user interface was implemented under openFrameworks (Frameworks, 2012)
(Figure 40).
The active viewport is the main window for teleconference and communication. This window
shows the focus effect for the active foreground, and replaces the classic video chat window
with the enhanced FocalSpace system.
58
Further, a suite of system controls allows the remote user to toggle and operate a number of
highly granular effect filters that affect the active viewport rendering. While the default
Diminished Reality effect is to blur out the background, the "semi-transparent mask" and "black
out" can be selected as background visual effects as well. The controls include: 3 Effect sliders,
and a switchEfor auto/manual foreground focus.
Finally, the augmented high-resolution views of the planar surfaces can be toggled through this
interface.
Figure 40: The front end user interface. The slider bar can be auto hided.
59
USER PERSPECTIVES
This chapter introduces the feedback received and conclusion drawn from the
user evaluation.
60
We've showcased two generations of FocalSpace at 2 four-day demo events. In the first event,
we used one depth camera and one microphone array; in the second event, we extended our
equipment to 3 depth cameras and 3 microphone arrays to capture a larger area.
Beyond the showcase, we selected two features, "directing remote participants' attention" and
"collaborative sketching", respectively, and conducted a user test comparing FocalSpace with
Skype.
We will talk about the user feedback we obtained in the following sections.
Feedback from the Showcase
We showcased the FocalSpace prototype at 2 four-day demo events to an audience of about
200 people. After testing the responsiveness of the system and the friendliness of the user
interface, we gained valuable feedback.
Occasionally, the gesture cue would fail as another user was standing and moving between the
cameras. This motivated us to optimize our algorithm for gesture detection and figure out
alternative cues to achieving the same effect. For example, in order to get attentive focus, the
participant could show a marked physical token instead of raising a hand, similar to the realworld scenario of bidding.
We found the effectiveness of one of the use cases, "directing focus", highly depends on the
surrounding environment. During the four-day demo period, in most cases when at least 6
people were surrounding the system, the users showed greater interest in the synthetic blur
effect for the background noise, compared to the later lab test.
We also noted that when FocalSpace is employed, user behavior can be both designed and
predicted by the state of the application (Figure 41). For example, when a user is placed in the
active foreground, his or her attentiveness and likelihood to become distracted or disinterested
decrease. We noted that collocated users are more likely to respect those currently speaking, as
the implication of interrupting or 'talking over' their colleagues is now accompanied with a
dramatic visual impact. We are interested in studying the long-term effects and potential
behavioral implications uncovered by a system where users are more aware and cognizant of
their contributions.
61
Figure 41: FocalSpace during the showcases.
User test conducted at the lab
Goal
We conducted an experiment to clarify the effectiveness of Diminishing Reality to help listeners
focus on foreground information during a videoconference.
Hypothesis 1:
The visual effect of "Diminishing Reality" improves effectiveness and accuracy for the absorption
of information from remote speakers. The advantage increases if there are more people in the
remote location, if the meeting lasts longer, or if the communication is in a narrative mode.
Hypothesis 2:
If the positive area selection (system following the attention) for diminishing reality is
synchronized with voice-activated refocusing (system directing the attention), listeners will have
more flexibility in controlling the environment and will improve initiatives in information
capturing.
Conditions
To examine hypothesis H1, we compared two conditions: clear display and voice activated
diminishing display.
To examine hypothesis H2, we compared two conditions: voice activated diminishing, and voice
activation plus eye gaze controlled diminishing display.
Setup
The basic system setup included a large display and 3 Microsoft Windows Kinect cameras
placed shoulder by shoulder in front of the display. Participants could sit facing the display and
62
cameras. We placed the depth cameras in such a way that 145 degrees of the continuous
conference table in front of display could be captured upon the screen. The angle of the sound
source could be detected within 145 degrees as well.
FY]Desktop
computer
Desk1o cPre-recorded
video
26" Display
Kinect Camera
Eye tracking device
Subject
Figure 42: FocalSpace setup for user test.
Method
We recruited 25 participants or volunteers for the experiment test. Based on our setup, we need
a 26-inch screen, an eye-tracking analysis device, a desktop and 3 Kinect cameras. Figure 3
shows the basic configuration of the devices.
According the setup of the FocalSpace system, the participants were required to sit in front of
the 26-inch screen for three different video sessions. The first one involved display without any
special visual effect; the second involved display with voice-activated refocusing effects to direct
listeners' attention; and the third involved display with eye gaze-directed refocusing to follow
listeners' attention. We will provide all the displays with pre-recorded videos.
In the pre-recorded video, there are six remote speakers describing a scenario of a marketing
report. They also use graphics and posters for in-depth illustrations. We made the pre-recorded
video before starting the experiment in the lab.
Each time, the following conditions applied: (a) the same ambient environment, (b) an equal
amount of information communicated over a distance, (c) the same number of remote
participants, and (d) the same display and voice quality.
Throughout the user test, we utilized both qualitative and quantitative measurement. Qualitative
measurement was used to compare effectiveness by (a) counting the portion of the information
63
captured by the subjects out of the total and (b) measuring the accuracy when they tried to
allocate information back to the speaker. Quantitative measurement concerned the problems
subjects normally experience when they perform remote communication and asked them to
compare and comment on both of the interfaces.
Steps
According to the hypothesis above and the system setup, we divided the experiment into two
parts, each of which would demonstrate and prove one of our hypotheses, respectively.
1) In order to prove hypothesis one that a system with a diminishing setup will be more effective
in information delivery than normal, we recruited 30 participants for the first part. All of these 30
subjects would watch two session videos. When the subject finished each of them, he or she
would be asked to finish a questionnaire sheet (with two open questions). We would compare
the accuracy and correctness of these two answer sheets. The participants could make it at the
same time in the same conference room.
2) In order to prove hypothesis two that adds positive area selections, we would recruit another
10 participants for our experiment. Each subject will watch two-session videos individually. The
videos would be the same as the former ones; however, this control group would have an
opportunity to use a mouse to realize positive area selection (Table 2).
64
Length
Hypothesis
1
.
Hypothesis
2
Refocus
Effect
Number
of
subjects
Interact
ivity
Questionnai
re
Interview &
Feedback
Session
1
15 m
no
30
no
yes
no
Session
2:
Plan A
15m
yes
30
no
yes
with open
questions
no
Session
1
15m
no
10
no
yes
no
Session
2:
Plan B
15m
yes
10
yes
yes
yes
Table 2: Session 1: 15 minutes, play the video of a scene with 6 people doing video conferencing, followed by a
questionnaire. 20 subjects in total were invited to this session. Session 2: 15 minutes, play the video of a scene with 6
people doing video conferencing, with the visual effect of voice-activated refocusing. The test was followed by a
questionnaire. 20 subjects in total were invited for this session.
Findings
Focus and Attention
All the participants agreed that when compared to the video conferencing display without
additive visual effects, our system provided an enhanced sensation of focus. One participant
mentioned: "When we are physically together I know where the sound is coming from and I can
turn my head to find the sound source; but without the blur effect, sometimes it's hard for me to
understand who is speaking, especially when there are a lot of people or the internet quality is
bad".
(1) Objects, participants, and space as part of the foreground: We found that most user's
concept of foreground information includes not only active participants, but also tools, artifacts,
and the space itself which surrounded the activity.
22 out 30 participants expressed a strong interest in the manual selection feature. For example,
one participant mentioned that if remote participants were talking about a physical object, she
would find the manual selection very useful when she wanted to "get closer" and see what
people are discussing, such as a physical model or prototype. She explained that not only the
people, but also the physical objects that are related to the discussion are of interest to her.
Another participant mentioned that if the remote group began to draw on the white board, he
65
would like to focus his attention onto the white board instead of the participants themselves; in
which case, the white board could be manually selected to display the augmented viewport.
(2) Peripheral Awareness: During the normal video conferencing test, 5 out of 6 non-native
English speakers found that it was difficult to remember the remote collaborator's name, while 2
out of 6 of the native speakers had the same issues. While using FocalSpace, people found the
augmented nametag, which displayed the participant's name very useful in terms of assisting
better communication & name retention.
While testing out manual focus, one participant explained that it was nice to give him a tool to
select different people and review their names.
Manual Selection of Positive Area
Through the comparison of FocalSpace with and without manual selection of positive area, we
found that in terms of efficiency and effectiveness, our system with positive area selection had
certain advantages (Table 3).
(1= much worse; 5
=
much better)
Ease of focus
Speaker awareness
Collaborative efficiency
Collaborative effectiveness
Distraction prevention
1
2
3
4
5
Table 1: User reported results for FocalSpace as compared to commonly used videoconference software.
Augmented perspective of the planar surfaces on the remote display provided users a real time
render and access to the remote information beyond talking head. We noticed that the
participants could get a faster and more accurate understanding of the overall topic in this case.
Participants found the augmentation of shared drawing surfaces useful noting that location
awareness was important.
User Perception
In general, users indicated that depending on their need for teleconferencing, be it work related
or personal, the system should adapt the interface accordingly. One user noted, "For talking with
friends or family, there is too much user interface and too many features. I would only be
66
interested in focus." Understanding this, we believe that future implementations of FocalSpace
could utilize adaptive rendering and graceful degradation depending on intended use.
67
EXTENDED APPLICATIONS
EXTENSION TO POST-CONFERENCE AND PRIVATE VIEW
In this chapter, we talk about how the tracking capability of FocalSpace can be
utilized to embed gesture recognition and smart tags in live and recorded video
conferences.
68
Record and Review
Currently, FocalSpace captures and displays in real time but discards the data once a session
has concluded. We believe that by tracking, storing the rich, semantic information collected from
both the sensors and user activity, we can begin to build a new type of search-and-review
interface. With such an interface, hundreds of hours of conversation could easily be categorized
and filtered by participant, topic, object, or specific types of interaction. Recalling previous
teleconference sessions would allow users to adjust their focal points by manually selecting
different active foregrounds, which would enable them to review different perspectives on past
events.
Most video conferencing systems have recording capability. But without smart indexing, the
large amount of video data is hard to navigate and review. As our system has the capability to
detect gestures, the human body, voice, and the environment dynamically, we are proposing a
series of ideas for smart indexing (Figure 43).
Figure 43: Switching between Live Mode and Review Mode
Mining Gestures for Navigating Video Archives
We addressed gesture detection and tracking in real-time video conferencing in an earlier
chapter. However, here the idea is to mine gestures and help users navigate archives. The
system provides some passive sensing for new users and also gives long-term users ways to tag
the archive as they're moving in a natural way.
We explore and build up a gestural library. The design of the library is based on social and
behavioral terms of meetings in the workspace. Cultural factors are considered as well. Some
examples of potential gestures in videoconferencing are demonstrated below (Table 4). In front
of depth cameras, all the gestures can be easily defined and detected.
69
Table 2: Gesture examples with different intentions.
To prove this concept, we implement the gesture of "Thumbs Up" as symbolic of a good idea
(Figure 44). Whenever the participant hears a good idea or comment that he wants to tag in the
video, he can just lift his right arm and thumb up. The system will indicate the participant once
the gesture is detected through the graphic hint. Later, when the recorded session is reviewed,
all the moments when the gesture is detected will be marked with a "thumb" icon along the
timeline.
Figure 44: "Good idea" gesture index. (Left) Gesture was detected in live mode. (Right) In the replay mode,
thumb icons show up along the timeline at the moment when the gesture was detected. Viewers can click the
thumb icon and revisit the moment when the corresponding gesture occurred.
70
Voice Index
Utilizing the capability of the system to identify the active foreground, we could use active
person, physical object, or environment as a smart index for later review as well. For example,
the system could detect and remember when a certain person started talking and embed
different "talking head" marks along the timeline (Figure 45). This feature enables the reviewers
to revisit the moments when a certain speaker started to talk by clicking the "talking head" icon
along the timeline or by clicking the talking head directly in the main window.
Figure 45: "Talking head" mark. By clicking the "talking head" icon along the timeline, or by clicking the talking
head directly in the main window, reviewers could revisit the moment when a certain selected speaker started
to talk.
Portable Private View
By adding a private view on either mobile phone/tablet or personal laptop, we are trying to give
the attendees more flexible access of the remote information. Through the private view, different
71
participant can get their unique focus, and start to explore the relevant digital augmentations
associated with the focus they selected:
*
*
*
To pay more attention to the elements that are interesting to the particular user, rather than
the auto focus on the global view (Figure 46).
To access augmented information (Figure 47).
Portability across devices. For example, if a participant wants to go away from his table
tentatively during a videoconferencing on his desktop, he could easily copy the same view
from the computer to his mobile device.
Figure 46: Focus on a certain person even if he is not the current speaker, access his related digital
information, and open a private message channel
Figure 47: To focus on a certain active area, in this case, the flipchart. Access the historical image archive of
the flipchart through clickable interface.
72
EXTENDED DOMAIN
" YOU CAN DO MORE!"
In this chapter, we talk about how the design guideline of "improving focus
through reality" could be utilized in other application domains.
73
We believe that "DR+AR" as a design approach can be used in different fields. It's an effective
way to filter out information and computationally mediate humans' perception of the real world.
As conceptual ideas, both FocalCockpit and FocalStadium are explored. Both of the projects are
conducted with commercial companies, which indicates that our design framework can be put
into practice.
FocalCockpit
The purpose of FocalCockpit is to augment the driving experience while taking safety and
efficiency into consideration. This proposal is based on the increasing amount of accessible
information on the road. Such information includes real-time traffic conditions, facilities on the
road, inter-communication channels between vehicles, and so on.
To date, we have built mock-ups to explore the interaction concepts without real-time accessible
data. We will describe some of the scenarios being discussed. For example, we envision drivers
having a fog-diminished view on a rainy day; drivers could have a sky view map to augment their
current location (similar to the distorted perspective of the road in the movie Inception). Please
refer to the following figures for some of the scenarios we have proposed (Figure 48, 49, 50,
51).
Figure 48: We envision drivers could get a fog-diminished view in a rainy day.
74
Figure 49: The real time updating of the "bird view" map. The designer was inspired by "Inception".
Figure 50: Chatting on the road; it's a scenario of "car-to-car communication". During heavy traffic times, drivers could
communicate with friends detected on the same road.
Figure 51: Finding parking spots: the display on the front window could highlight potential available parking spots in real
time.
75
FocalStadium
The FocalStadium concept has been developed based on the observation that activities in a
stadium are normally tracked by multiple channels: cameras from different view angles, handset
cameras from judges and players, microphones on judges and players, coded markers on the
clothes of players, and so forth. Diverse video, audio, sensory, and locational information make
it possible to use our DR+AR applications.
When people are watching a sports game, they always see the sports game from a fixed angle.
On one hand, it's easy for them to get an overview of the sports game; on the other hand, they
cannot watch the game from other perspectives, focus and get a clearer view of a specific
player, or go back time to review an exciting moment. To enrich audiences' experience, we
mocked up a mobile application with the guideline of "improving focus through diminished
reality". In the example shown below, the starting point of active action is chosen as the trigger
of focus (Figure 52). For example, in a 100-meter run, the system wants to focus on the
individual who is leading the run, while in the case of cricket, the system tends to focus on the
thrower and picker.
Figure 52: Focus on the most active speaker currently; the focus can be manually selected as well. Users can
access the relevant digital information on a foreground player, revisit the moment when he made an excellent
hit, or review the same action from a different perspective.
76
Performance Activated Refocusing: What the focus should be is always the question we ask first
when we try to put the "DR+AR" concept into practice. In the early application for video
conferencing, the current speaker should get the most attention. Thus, voice activation was
chosen as the trigger for the focal point. Due to the native characteristic of competition,
audience members care about the players who are having the best or most active performance.
In this case, the starting point of active action is chosen as the trigger of focus. For example, in a
100-meter run, the system wants to focus on the one who is leading the run, while in the case of
cricket, the system tends to focus on the throwers and other foreground players.
The same design concept can be extended to many sports categories. One approach to the
technical solution is to add bright patterns to the players' clothing. We envision the same design
approach being used for both the real-time stadium experience and live TV mode. When
audience members are sitting in front of the TV screen, the same effect can be achieved when
they point their mobile devices toward the screen. Potentially, it's another means of mobile-tocommunicate communication and a way to interact with television.
77
CONCLU SION
This chapter outlines the future work and conclusion of the thesis project.
78
Professor Ramesh Raskar, the director of the Camera Culture group at MIT Media Lab,
mentioned his vision of "Super-human vision" in an interview. He said that his interest lies in
creating "super-human abilities to visually interact with the world - with the cameras that can
see the unseen, and displays that can sense the altar of reality". His inspiring aim of "creating a
unique imaging platform that have an understanding of the world that far exceeds human ability,
that we can meaningfully abstract and synthesize something that's well within human
comprehensibility" is what our system, FocalSpace is trying to achieve, on an abstract level.
Camera Man
"Camera Man" is our metaphor for FocalSpace (Figure 53). We've built up a system as a
computational cameraman, who could set up spotlight, shooting assistant, perspectives, autofocus in real time.
Figure 53: The metaphor of "Camera man" for FocalSpace. Just like camera man who could set up spot light, shooting
assistant, perspectives, auto-focus in real time, FocalSpace is a system which could automatically display information in
the right perspective, assist communication, and give spotlight to the central focus.
79
Future Work
Specific Tasks
As one of the next steps, we want to find out how FocalSpace can be utilized in more specific
tasks. FocalSpace has been designed as a flexible system for remote communication. As such,
we are also interested in investigating the implications of a support system for remote tasks with
more specific purposes such as long form presentation and education. We believe that the study
of embodied gesture and movement for presentation, and the augmentation of context and
state aware story telling may lead to interesting new discoveries in remote communication.
Cloud-based solution for FocalSpace
Even though Focal Space utilizes layered space frameworks, each system is connected between
all systems currently. However, Focal Space is compatible with cloud-based solution, because
the layered space frameworks enable the platform to reconstruct individual contents from one
transmitted source data easily. To make these individual contents, the center server in the cloud
applies just "Copy-and-Past" processing. Then the total distribution complexity and bandwidth
will be reduced dramatically. Furthermore, removing the some processing part into the cloud
server is also beneficial for mobile device users. For additional future work, we plan to store the
data to a cloud server and allow users to log on and analyze/access these data.
Conclusion
We have adopted an extension of the well-known depth-of-field effect that allows elements of
the video conferencing to be blurred depending on relevant interaction cues rather than their
distance from the camera. By using depth sensor and various cues such as voices, gestures,
and physical markers, FocalSpace constructs "Layered Space model", identifies the foreground
of the video and subsequently blur out the background. The technique makes it possible to
diminish the background noise and point the remote users to relevant foreground elements in
high quality, without consuming large bandwidth.
Because of the similarity to the familiar depth-of-field effect and the fact that DOF is an intrinsic
part of the human eye, we believe our approach of Diminished Reality is a quite natural
metaphor for transmitting remote video streams and can be used quite effortlessly by most
users. Our preliminary user study also proved this assumption.
FocalSpace can be used for not only cleaning unwanted noise but also directing remote
participants' focus, saving transmission bandwidth, keeping privacy, and saving up display
space for additional information such as augmented reality.
80
We believe that FocalSpace contributes to video conferencing by observing the space we inhabit
as a richly layered, semantic object through depth sensing. While much remains to be done, our
work offers a glimpse at how emerging depth cameras can enable a flexible video conferencing
setup, which brings semantic cues and interpretation of the 3D space to remote communication.
Through the thesis, we've conveyed our belief of "less is more". We carefully choose the ways to
alter humans' perception of the reality, and enhance, improve, or entertain real life experiences.
Technology enables a variety of tracking and computational reconstructing, however, it's the
designers who define how those accessible technologies serve the real life. To achieve a good
design, we have to look deep into humans' conscious and subconscious behaviors. In all the use
cases we've described in the thesis, such as conferencing, driving, and watching sports games,
FocalSpace is very sensitive to participants' attention and focus, as all the interactive features
are built upon the tracking and understanding users' intention and attention. For example, the
listeners care not only the current speaker, but also the ones whom the current speaker is
watching or referring. Paul Dourish and Sara Bly noted that awareness in the work space
"involves knowing who is "around", what activities are occurring, who is talking with whom;
it provides a view of one another in the daily work environments" (Paul Dourish, 1992). To
better understand of target users' attention and focus is what we will pursuit in a long term.
81
BIBLIOGRAPHY
Zhang, C. R. (2006). Lightweight background blurring for video conferencing applications. . ICIP
2006 (pp. 481-484). AMC.
Wooding, D. (2002). Fixation maps: Quantifying eye-movement traces. . ETRA '02. New York:
ACM.
youKu. (2012). youKu. Retrieved April 2, 2012, from http://youku.com
Vertegaal, R.(1999). The GAZE Groupware System: Mediating Joint Attention in Multiparty
Communication and Collaboration. roc. of CHI'99 Conference on Human Factors in Computing
System. ACM.
AugmentedReality. (2012). Augmented reality. Retrieved 2012, from
http://en.wikipedia.org/wiki/Augmentedreality
Azuma, R. (2004). Overview of augmented reality. SIGGRAPH '04. New York: ACM.
AzamKhan, J. K. (2005). Spotlight: directing users' attention on large displays. . CHI'05 (pp. 791798). New York: ACM.
ART+COM. (2012). ART+COM. Retrieved 2012, from http://artcom.de
Artvertiser. (2012). Artvertiser. Retrieved April 3, 2012, from http://theartvertiser.com/
ARtoolKit. (2012). ARtoolKit. Retrieved April 8, 2012, from
http://handheldar.icg.tugraz.at/artoolkitplus.php
Cisco. (2012). WebEx. Retrieved from http://www.webex.com/
Chris Woebken, K. 0. (2012). Animal Superpower. Retrieved 2012, from
http://chriswoebken.com/animalsuperpowers.html
E.A. Bier, M. C. (1993). Toolglass and magic lenses: The seethrough interface. SIGGRAPH'93
(pp. 73-80). ACM.
David Holman, R.V. (2004). Attentive display: paintings as attentive user interfaces. CHI EA '04
(pp. 1127-1130). New York: ACM.
DDB, T. (2012). Retrieved May 1, 2012, from Tribal DDB: www.tribalddb.nl
82
Doug DeCarlo, A. S. (1991). Stylization and abstraction of photographs. ACM (pp. 769-776). New
York: 29th annual conference on Computer graphics and interactive techniques (SIGGRAPH '02).
Furnas, G. W. (1986). Generalized fisheye views. ACM Conference on Human Factors in
Computer Systems, SIGCHI (pp. 16-23). New York: ACM.
Frameworks, 0. (2012). Open Frameworks. Retrieved January 2, 2012, from
http://openframeworks.cc
H. Ishii, M. K. (1994, August). Iterative Design of Seamless Collaboration Media.
Communications of the ACM, , 83-97.
Kinect. (2012). Kinect for Windows SDK. Retrieved March 1, 2012, from
http://www.microsoft.com/en-us/kinectforwindows/
Lieberman, J. (2012). Moore Pattern. Retrieved 2012, from http://bea.st/sight/moorePattern/
Loschky, L. M. (2005). How late can you update? Detecting blur and transients in gazecontingent multi-resolutional displays. Human Factors and Ergonomics Society 49th Annual
Meeting, (pp. 1527-1530). Santa Monica.
Munzner, T. (1998). Drawing large graphs with H3Viewer and Site Manager. Graph Drawing'98
(pp. 384-393). Springer.
M. C. Stone, K. F. (1994). The movable filter as a user interface too. ACM CHI'94 (pp. 306-312).
ACM.
M. Sarkar, S. S. (1993). Stretching the rubber sheet: A metaphor for visualizing large layouts on
small screens. ACM Symposium on User Interface Software and Technology (pp. 81-91). ACM.
M.Sarkarand, M. (1994). Graphical fisheye views. Communications, 37 (12), 73-83.
Maesako, 0. M. (1998). HyperMirror: Toward Pleasant-to-use Video Mediated Communication
System. CSCW (pp. 149-158). ACM.
Okada, K. M. (1994). Multiparty Videoconferencing at Virtual Social Distance: MAJIC Design.
CSCW'94 (pp. 385-393). ACM.
Paul Dourish, S. B. (1992). Portholes: supporting awareness in a distributed work group. SIGCHI
conference on Human factors in computing systems (CHI '92) (pp. 541-547). New York: ACM.
Scalado. (2012). Scalado. Retrieved 2012, from http://www.scalado.com/display/en/Hom
Skype. (2012). Retrieved from Skype: www.skype.com
sound., o. w. (2012). Retrieved April 20, 2012, from http://ows.clients.vellance.net/ows/
83
Steve Mann, J. F. (2002). EyeTap devices for augmented, deliberately diminished, or otherwise
altered visual perception of rigid planar patches of real-world scenes. Virtual Environ , 11 (2),
158-175.
Stefan Agamanolis, A. W. (1997). Reflection of Presence: Toward More Natural and Responsive
Tel ecol la bo ration. SPIE Multimedia Networks, 3228A.
Stratton, G. M. (1896). Some preliminary experiments on vision without inversion of the retinal
image. Psychological Review, 3 (6), 611-7.
Rainer Stiefelhagen, J. Y. (2001). Estimating focus of attention based on gaze and sound. PUI
'01 (pp. 1-9). New York: ACM,.
Ramesh Raskar, G. W. (1998). The Office of the Future : A Unified Approach to Image-Based
Modeling and Spatially Immersive Displays. SIGGRAPH 1998. Orlando: ACM.
Ramesh Raskar, K.-H. T. (2004). Non-photorealistic camera: depth edge detection and stylized
rendering using multi-flash imaging. In ACM SIGGRAPH 2004 Papers (SIGGRAPH '04) (pp. 679688). New York, NY, USA: Joe Marks (Ed.). ACM.
Reingold, E. L. (2002). Gaze-contingent multi-resolutional displays: An integrative review. Human
Factors .
ROBERT C EDSON, D. M. (1971). Patent No. 3601530. US.
Robert Kosara, S. M. (1997). Semantic Depth of Field. IEEE Symposium on Information
Visualization 2001 (INFOVIS'01). Washington: IEEE Computer Society.
Tracy Jenkin, J. M. (2005). eyeView: focus+context views for large group video conferences. CHI
'05 extended abstracts on Human factors in computing systems (CHI EA '05) (pp. 1497-1500).
New York, NY, USA: ACM.
84