The Long-Term Evaluation of Fisherman in a Partial

advertisement
The Long-Term Evaluation of Fisherman
in a Partial-Attention Environment
Xiaobin Shen
Andrew Vande Moere
xrshen@unimelb.edu.au
University of Melbourne
andrew@arch.usyd.edu.au
University of Sydney
Peter Eades
peter@it.usyd.edu.au
National ICT Australia
University of Sydney
Seokhee Hong
seokhee.hong@usyd.edu.au
University of Sydney
“ambient display” generically for the subfield in visualization
research that conveys information through for the periphery of
user attention. This paper treats ambient display as a specific type
of information visualization characterized by two design
principles: attention and aesthetics. Information visualization
demands full-attention (i.e. users explore, zoom and select
information mainly in the primary focus of their attention [18]),
while ambient display only requires partial attention, so that
human attention can also be committed to other tasks at hand. In
addition, aesthetics is a secondary consideration in the design of
most information visualization applications (versus the focus on
effectiveness and functional design [9]). Aesthetics is a key issue
in the development of ambient displays, for its aims is to be
visually unobtrusive in the architectural space to draw user
interest by way of curiosity and ambiguity, and to encourage
comprehension by providing a positive user experience.
ABSTRACT
Ambient display is a specific subfield of information visualization
that only uses partial visual and cognitive attention of its users.
Conducting an evaluation while drawing partial user attention is a
challenging problem. Many normal information visualization
evaluation methods (full attention) may not suit the evaluation of
ambient displays.
Inspired by concepts in the social and behavioral science, we
categorize the evaluation of ambient displays into two
methodologies: intrusive and non-intrusive. The major difference
between these two approaches is the level of user involvement, as
an intrusive evaluation requires a higher user involvement than a
non-intrusive evaluation.
Based on our long-term (5 months) non-intrusive evaluation of
Fisherman presented in [16], this paper provides a detailed
discussion of the actual technical and experimental setup of
unobtrusively measurement of user gaze over a long period by
using a face-tracking camera and IR sensors. In addition, this
paper also demonstrates a solution to the ethical problem of using
video cameras to collect data in a semi-public place. Finally, a
quantitative term of “interest” measurement with three remarks is
also addressed.
Several ambient display applications [8, 11, 12, 15, 2] have
already been designed and developed, but relatively little progress
has been made in determining appropriate evaluation strategies.
As effective evaluation methods aim to measure and improve a
display’s performance, we believe that further research in
evaluation methodologies should be a priority for researchers. In
particular, it is still an open question whether the user interest and
comprehensibility of an ambient display alters over time,
especially when its initial novelty effect wears of.
Keywords
Ambient displays, intrusive evaluation, information visualization,
human computer interaction
Following on from our previous paper presentation [16], this
paper aims to address more technical and practical aspects related
to the question “How can we conduct an evaluation of an
information display system for partial user attention?” More
specifically, we further discuss the differences between intrusive
and non-intrusive from participants points of view; we detail the
actual experimental setup to collect meaningful usage data by
using face tracking camera and IR sensors; we present a solution
to the ethical problem of using a video camera offending personal
privacy; and we draw new conclusions based on these new
results.
1. INTRODUCTION
Ambient displays to some extent originate from the ubiquitous
computing ideal, which was first proposed by Weiser [18]. He
stated that “the most profound technologies are those that
disappear. They weave themselves into the fabric of everyday life
until they are indistinguishable from it”. Currently, there are
many terminologies moving in this general research direction,
such as disappearing computing [3], tangible computing [8],
pervasive computing [6], peripheral display [11], ambient display
[10], informative art [4], notification system [12] or ambient
information system [14]. The qualitative differences between
some of these terms are not immediately obvious although some
subtle disparities might exist. In this paper, we use the term
2. INTRUSIVE
EVALUATION
AND
NON-INTRUSIVE
Our previous paper [16] proposed two evaluation styles: intrusive
and non-intrusive evaluation. Intrusive evaluation is where the
user’s normal behavior is consciously “disrupted” by the
evaluation experiment, but non-intrusive evaluation is not. A brief
discussion of differences from participant’s points of view is
below:
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
BELIV’08, April 5, 2008, Florence, Italy.
Copyright 2008 ACM 978-1-60558-016-6/08/0004...$5.00.
An intrusive evaluation method is determined by a predefined
number of participants in each experimental setup, while because
of its unobtrusive and real-world setting, a non-intrusive
1
evaluation cannot predict the size of the participant cohort. Many
intrusive evaluation experiments only require a small number of
participants to reveal significant results. For a non-intrusive
evaluation, a larger pool of participants is typically required; in
our case we consider all the potential participants passing by an
ambient display in a semi-public environment. We believe a welldesigned long-term non-intrusive evaluation should have enough
number of participants to reveal significant results, in comparison
to intrusive evaluation.
ambient displays, although none are expert. The study lasted five
months, from September 2005 to January 2006. One should note
that every single person passing by Fisherman was a subject of
this experiment, as it was based on the assumption that an
effective evaluation study of a publicly accessible ambient display
should be derived from actual “use” of the display in a real-life
environment.
Furthermore, for an intrusive evaluation it may be difficult to
choose participants with various backgrounds, while non-intrusive
study has less limitation on this issue. This is because a nonintrusive evaluation counts all users passing by the display as a
potential evaluation subject. In some circumstances this can lead
to a braod variety of participants, from various backgrounds; this
can improve the validity of experimental results. A non-intrusive
evaluation aims to draw less user awareness than intrusive
evaluation. This leads to important privacy issues for nonintrusive evaluations; these issues are quite different to ethical
issues in intrusive studies.
Finally, an intrusive evaluation normally results in a higher
cognitive load of the participant, due to the required high level
user involvement, high level short-term memory load and typical
psycho-physiological stress. This can affect the validity of results
[1]. In contrast, a non-intrusive evaluation can potentially have
better results because of its lower cognitive load, due to
participants not consciously knowing that they are being
observed.
3. NON-INTRUSIVE
FISHERMAN
EVALUATION
Figure 1. Implementation of Fisherman
Three questionnaires were scheduled within the five months time
span. The first questionnaire was scheduled in the middle of
second month (17-19 Oct., 2005); the second was scheduled at the
end of fourth month (19-20 Dec., 2005); and the last was
scheduled at the end of last month (26-27 Jan., 2006). About 30
researchers and students regularly passed by Fisherman to
enter/leave their office. Participants for each questionnaire were
randomly chosen from this group of frequent users. Participants
were not allowed to look at the display while answering
questions.
OF
In this section, a detailed discussion of the non-intrusive
evaluation of Fisherman is introduced, partly based on our
previous paper [16]. A quantitative term of “interest”
measurement is proposed by combining consideration of the total
number of participants passing by the display and the total
number of participants looking at the display (we assume the
more interest the display holds, the more participants will look at
it). Some significant remarks are made in this section.
Each questionnaire was specifically designed to measure three
attributes: the comprehension, usefulness and aesthetics of our
Fisherman. Our comprehension questions [16] were: CQ1: Does
Fisherman convey any information? CQ2: How many types of
information are represented in Fisherman? CQ3: What kind of
information does Fisherman represent? CQ4: Have you ever
noticed changes in Fisherman? The usefulness question was:
UQ1: Is Fisherman useful to you? UQ2: Why? The aesthetic
questions were: AQ1: Do you think Fisherman is visually
appealing? AQ2: If possible, would you put Fisherman in your
home/office? AQ3: Why?
The general aim of work is to discover the relationship between
comprehension and the terms of interest to the ambient display
over time. We hypothesize that the comprehension and subject
interest in Fisherman increases with time, on the basis that
participants do not keep up their interest in the display unless they
understand it.
3.1 The experiment
The actual experimental environment and detailed settings are
described in Figure 1. Specifically, the display was located in a
purpose-built frame, which also enclosed an infrared sensor and a
video camera1. The Fisherman metaphor was described on an A4
size paper on the frame, for all passers-by to see (see Figure 1).
Because of its unobtrusive setting in an everyday environment,
the privacy issues become very important. Australia, the state of
New South Wales, and local authorities have a number of relevant
laws (see, for example, [5]). Further, institutional and professional
guidelines need to be respected. It is clear from the legislation that
all cameras mounted inside of a building can only be used for
security purposes, while 11 additional principles are listed to
guide actual video capture, recording, access and storage. Here,
we used a camera and an IR sensor to collect data on user gaze
events.
The frame was placed in a semi-public area in a research institute.
As the actual users of the display are mainly academic
researchers, many have some knowledge of the general idea of
1
Since the system included a sensor and camera in a semi-public
place, legal opinion was obtained to ensure that the system
complied with privacy legislation.
To resolve these ethics issues, we used a modified face detection
and recognition program based on Intel OpenCV [7] so that our
system only recorded the number of faces (versus storing the
2
actual image or video files). This file was saved to the hard disk
every day of the experiment. Each file only lists two pieces of
information: actual face detection time and the number of faces
detected. This modified OpenCV program ran continuously for
5 months with a regular weekly checks to ensure accuracy.
Furthermore, to meet the New South Wales state government
legislation, a paper notice announcing a continually running a
web camera was mounted on the built frame (see Figure 1).
3.2 Results
Three parameters were analyzed in the non-intrusive evaluation of
Fisherman.
• The Mean Comprehension Rate (MCR), based on the answers
from the comprehension questionnaire (CQ1-CQ4). A higher
MCR indicates better understanding of the display.
• The Total number of Subjects Passing by Fisherman (TSP) in
one day, measured using the IR sensor.
The function of the camera and IR sensor is as follows:
•
•
•
• The Total number of Subjects Looking at Fisherman (TSL) in
one day, measured by the facial detection system.
The Intel OpenCV face detection program (camera) was used
to discover whether subjects that passed by looked at
Fisherman or not. The face detection program only identified
human faces when subjects looked at the display.
It is clear that TSL ≤ TSP, as TSP also counts subjects passing by
Fisherman without looking at the display. In this paper, we
propose a quantitative term of “interest” defined as below:
The Intel OpenCV face recognition program (camera) was
used to determine how many different subject faces looked at
our display within one day.
ES = TSL/TSP
The IR sensor was used to count how many subjects passing
by Fisherman.
3000
The purpose of using an IR sensor was to have a more accurate
count of the number of subjects passing; this count is more
accurate than that given by the face detection program. The IR
sensor had a parallel interface (25-pin), which connected to the
local PC and a small script was written in C++ to count the
number of subjects passing by the display within one day and
saved as an individual file daily.
Week 1
Week 2
Week 3
Week 4
2000
1500
1000
500
Two thresholds settings of face detection program were used:
•
Total number of subjects Passing
2500
Also, to meet the ethical requirements, the subjects’ faces were
not pre-recorded in the database. The face recognition only
distinguishes between different subjects, instead of attempting to
recognize the identity of each face.
•
(1)
We hypothesize that the more interest an ambient display can
attract, the more participants will be disrupted in their primary
tasks, have a look, and engage in the secondary task.
0
Sep., 05
A participant is only counted if he/she stays in front of our
display more than 10 second.
Oct., 05
Nov., 05
Dec., 05
Jan., 06
Figure 2. Total number of participants passing by
Fisherman
A participant is be counted as the second visit, if he/she left
the display more than 1 minute.
The total number of participants passing by Fisherman within the
5 months was about 28,388. The average number of participants
passing by was approximately 5678 per month and 1419 per
week. The total number of participants passing by Fisherman in
each month is in Figure 2.
We did a pilot study that indicated that the performance of the
face recognition depended on the local environment conditions
(i.e. lighting condition, camera pose angle, or even different
image resolutions). Of course this is a well known practical
problem for face recognition, and it is expected that current
research will improve things considerably. . In this paper, we
simply use Intel OpenCV face detection program as a tool to
collect data; we expect that the similar experiments in the future
will have more accurate data.
The total number of participants who looked at the display within
a week is available in Figure 3. It shows that the peak number of
visits occurred at the beginning of the study. This may be because
of the novelty effect, which is a strong factor in drawing the
attention of passers-by. It also shows how the number of visits
dramatically decreases after two weeks and stabilized after about
four weeks.
To make a rough estimate of the error rate of face detection
program, results from face detection program were calibrated with
IR sensor data. For example, if the face detection program
reported a participant passing by display at 10:30am, 18 Oct.,
2005, but there is no record in IR sensor, then there is an error.
We treat this is an error made by face detection program. Our
pilot study shows that the error rate of face detection program is
about 30% and the errors occur consistently over time. This face
detection error rate sounds high, but it meets our experimental
requirements.
3
1000
900
Total number of subjects Looked
Table 1 shows the mean value of “interest” with standard
deviation in each week (the first value in each cell is the mean
value of “interest”; the second value is the standard deviation).
From Table 1, it seems that interest decreases at the beginning,
thenit starts to stabilize. This can be explained by the novelty
effect: many participants were initially interested to take a look.
Over time, some participants lost interest. At the same time, some
participants appreciated Fisherman so much that they went to
check the display a couple of times a day.
Week 1
Week 2
800
Week 3
700
Week 4
600
500
400
300
Week 1
Week 2
Week 3
Week 4
Sep., 05
34.8%/0.1
32.9%/0.2
16.9%/0.12
16.7%/0.1
Oct., 05
8.4%/0.03
9.0%/0.04
8.1%/0.03
7.2%/0.01
Nov, 05
7.4%/0.03
6.1%/0.02
5.7%/0.03
5.3%/0.02
Dec., 05
4.3%/0.02
4.7%/0.01
4.1%/0.01
Holiday
Jan., 05
Holiday
4.1%/0.01
3.9%/0.01
4.1%/0.01
200
100
0
Sep., 05
Oct., 05
Nov., 05
Dec., 05
Jan., 06
Figure 3. Total number of subjects looked at Fisherman
Furthermore, we discovered that the largest number of visits
occurred in three separate time periods (see Figure 4): early
morning (8:50AM—10:00AM); lunch time (12:00AM-- 1:45PM)
and late afternoon (4:50PM --5:45PM). This reflects arrival at
work, lunchtime, and leaving work.Note that subjects seem to
have more time at lunchtime than in the morning and afternoon
“rush” hours.
Table 1. Mean value of “interest” in each week
Finally, our aesthetic judgment measurement achieved very good
results. Table 2 shows results of the aesthetics question 1 and 2.
1st test
2nd test
3rd test
AQ1
100%
100%
100%
AQ2
56%
70%
74%
Table 2. results of aesthetics questions
Table 2 shows that 100% of participants think our Fisherman are
visually appealing in all three tests (AQ1: Do you think
Fisherman is visually appealing?). Furthermore, Table 2 also
shows that the percentage of participants who want Fisherman is
increasing as time goes (AQ2: If possible, would you put
Fisherman in your home/office?).
3.3 Remarks
The results of the evaluation study of Fisherman show that the
term of “interest” in Fisherman had a “peaked” at the beginning
of the experiment. The interest stabilized after one month. This
can be partially explained that the novelty of the display easily
drew the interest of passers-by, although many stopped visiting
Fisherman for various reasons. Our post-questionnaire and
informal feedback shows that there are two possible reasons to
support this particular observation:
Figure 4. The number of visiting in November, 2005.
Figure 5 shows the user performance with respect to
comprehension of the display from the three questionnaires. The
results in Figure 5 show that Mean Comprehension Rate (MCR)
in each question of the questionnaire increases with time. This
result supports our hypothesis that the comprehension of an
ambient display such as Fisherman increases over time.
•
The data source does not interest users
•
Lack of reference in the visual metaphor. For example,
a typical comment from subject was: “I notice the
color, the number of trees and the position of the boat
changing but I can’t get precise information from this
change. Also I can’t tell the difference between small
percentages of change in these three metaphors.
There is a lack of reference for the difference
between heaviest and heavier fog.
In addition, a further analysis by combining comprehension and
“interest”, we conclude that the user comprehension increases
over time, and some of the participants’ interest to Fisherman
stabilized after a time period. This leads us to the following new
conclusion:
Figure 5. Results of Mean Comprehension Rate
4
Remark 1. An effective ambient display can be understood over
time, but should also retain its interest over time.
5. REFERENCES
A question in many evaluations is: “When should an evaluation
study be conducted?” Our case study shows that to measure the
true value of an ambient display such as Fisherman, one should
wait until the value of “interest” stabilizes over time. This is
because a stable “interest” value means that the display itself
integrates into the environment and will not draw unusual
attention from users. As a result, testing an ambient display
immediately after installment might significantly skew the results
due to the novelty effect.
Remark 2. Non-intrusive evaluation cannot be tested until the
display integrates into the environment and the term of “interest”
has been sufficiently stabilized.
After conducting this experiment, we also found that the
boundary between intrusive and non-intrusive evaluation
methodologies is not necessarily well defined. Our previously
described categorization is more similar to extreme endpoints on a
continuous range than separate buckets. Thus, it is possible that
an experiment that was planned to be conducted in a non-intrusive
evaluation style becomes intrusive in some way. For example, our
questionnaire interview can result in unintended attention being
paid to the Fisherman experiment itself, potentially even
influencing the results or renewing the interest in the displays.
[1]
Chalmer, A.P., The role of cognitive theory in humancomputer interface, Computer in human behavior, 2003. 19:
593-607
[2]
Cleveland, W.S. et al., Graphical perception: Theory,
experimentation, and application to the development
of graphical methods. Journal of the American
Statistical Association, 1984. 79(387), 531-546.
[3]
Disappearing Computer. Available at:
http://www.disappearing-computer.net, accessed on 23
Oct., 2007.
[4]
Future Application Lab, Available at:
http://www.viktoria.se/fal/, accessed on 23 Oct., 2007
[5]
Human Resource Committee, National statement on
ethical conduct in research involving humans: Part
18-Privacy of information”,
http://www.nhmrc.gov.au/publications/humans/part18.htm,
accessed on 23 Oct., 2007
[6]
IBM Pervasive Computing. Available at
http://wireless.ibm.com/pvc/cy/, accessed on 23 Oct.,
2007
[7]
Intel Open CV, Available at
http://www.intel.com/research/, accessed on 23 Oct.,
2007.
[8]
Ishii, H. et al. Tangible bits: towards seamless
interfaces between people, bits and atoms, in
Proceedings of CHI’97 (Atlanta, USA), ACM Press,
234-241.
[9]
Lau, A., et al. Towards a model of information
aesthetics in information visualization. In Proceedings
of the 11th international conference information
visualization 2007. 87-92
The major difference between these two evaluation
methodologies is the level of user involvement, with the intrusive
evaluation having a higher user involvement than non-intrusive
evaluation. Intrusive evaluation seems to be ideal at quantitative
measurement of parameters. As commonly applied intrusive
evaluations are task-oriented and often occur in well-controlled
laboratory environments, most existing evaluation methods in
information visualization are part of the intrusive evaluation
category. In contrast, a non-intrusive evaluation relies on tracing
users by video/image processing or alternative, unobtrusive
sensors to collect more candid results (that require less or no user
interruption). Many of these techniques are still under
development (i.e. it is even difficult to robustly distinguish two
different faces under various environment).
[10] Mankoff, J., et al., Heuristic evaluation of ambient
Remark 3. Non-intrusive evaluation is a better way to conduct in
the evaluation of ambient displays, but it may be limited by
current sensor technologies and important privacy concerns.
[11] Matthews, T., et al., A toolkit for managing user
displays, in Proceedings of CHI’03, ACM Press, 169176
attention in peripheral displays, in Proceedings of
UIST’04, p. 247-256.
4. CONCLUSION
This paper focused on how to conduct a long-term ambient
display evaluation study without requiring focused user attention.
Firstly, it made a brief discussion of the difference between
intrusive and non-intrusive evaluations, which was proposed in
our previous paper [16]. Secondly, a non-intrusive evaluation case
study was applied and its technical implementation was described
in detail. In this case study, a quantitative term of “interest”
measurement was proposed to quantify the impact of Fisherman.
The results show that the user comprehension increases over time,
but the user interest is decreasing. Finally, three remarks are
drawn based on our new results.
[12] McCrickard, D.S., et al., A model for notification
systems evaluation--Assessing user goals for
multitasking activity, ACM Transactions on
Computer-Human Interaction, ACM Press, 10(4),
312-338.
[13] Peters, B., Remote testing versus lab testing,
http://boltpeters.com/articles/versus.html, accessed on 23
Oct., 2007
[14] Pousman, Z. et al., A taxonomy of ambient
information systems: Four patterns of design. In
Proceedings of the AVI’06, ACM Press 67-74.
This work is still in progress. Our future plans include further
experiments to gain experience in both intrusive and non-intrusive
evaluation studies.
[15] Shami et al., Context of use evaluation of peripheral
displays. In Proceedings of the INTERACT’05.
Springer, 579-587.
5
st
[16] Shen, X., et al. Intrusive and Non-intrusive
[18] Weiser, M. The computer for the 21 century.
Evaluation of Ambient Displays, in workshop at
Pervasive’07: Design and Evaluating Information
Systems, 2007. p. 30-36.
Scientific American, 1991. 265(3), 66-75
[17] Somervell, J., et al. An evaluation of information
visualization in attention-limited environments. In
Proceedings of the symposium on Data Visualisation
2002. 2002: Barcelona, Spain.
6
Download