MS-Word version - Igbinedion University Okada

advertisement
HCI
A LOOK INTO THE WORLD
OF
HUMAN-COMPUTER
INTERFACE
PART ONE
Compiled by Omorogbe Harry
1
HCI
TABLE OD CONTENTS
Chapter One
Introduction and Overview
History and Background
Earlier development and Foundation of the Field
Pioneers
The need for HCI
Strategic Themes
Basic Interaction
Direct Manipulation of graphical object
Application Types
Current Development
Technological Trends
Up-and-Coming Areas
Visualization and Biological Field
Chapter Two
Concept and Design in HCI
Design and Evaluation Methods
Concepts of User Interface design
Principle of User Interface Design
Ergonomic Guidelines for User-Interface Design
General Principles to follow when designing any programme
Human Issues
Importance of HCI
Chapter Three
HCI and Web: problems and Promises
Issues in HCI design in Web Mediun
How screens Display Colours
Web-Safe Colours
Contributors to HCI
Chapter Four
Gesture Recognition
Augmented Reality
Computer Supported Cooperative Work.
Compiled by Omorogbe Harry
2
HCI
CHAPTER ONE
HUMAN AND COMPUTER INTERFACE
Connecting with your computer - Human-computer interaction and Artificial
Intelligence
INTRODUCTION AND OVERVIEW
"Computer, this is captain Jeanway. Abort the self-destruction sequence. Authorization
code, 89453432..."
"Voice, confirmed. Authorization code, confirmed. Abort the self-destruction
sequence...Unable to comply, System malfunction..."
BANG!!!
......
If you are a trekker, you will undoubtly recognize the above conversation. Yes, it is from
Star Trek, a television series spawned by one of the most popular science fiction of the
century. However, if you simply have not heard of "Star Trek", do not worry because we
only need to know that the above is a human-computer interaction, which is hopefully to
happen in the future (except for the "BANG" part). Actually, a conversation as simple as
the above between the human and the computer is far more difficult for today's
technology to accomplish than you may have imagined. It involves speech recognition,
natural language understanding, Artificial intelligence, and natural voice output, all of
which are topics in the study of Human-Computer Interaction (HCI). Simply put,
Human-Computer Interaction is a interdisciplinary study of how humans interact with
computers, which includes user interface design, human perception and cognitive
science, Artificial Intelligence, and Virtual reality. With the explosive growth of raw
computing power and accompany technologies, computers become essential to everyday
life, and because of this, HCI, the science of how humans interact with computers is
attracting more and more attention these days.
Comprehensively, Human-Computer Interaction (HCI) is the study of how people
design, implement, and use interactive computer systems, and how computers affect
individuals, organizations, and society. This encompasses not only ease of use but also
new interaction techniques for supporting user tasks, providing better access to
information, and creating more powerful forms of communication. It involves input and
output devices and the interaction techniques that use them; how information is
presented and requested; how the computer's actions are controlled and monitored; all
forms of help, documentation, and training; the tools used to design, build, test, and
evaluate user interfaces; and the processes that developers follow when creating
interfaces.
HCI is a research area of increasingly central significance to computer science, other
scientific and engineering disciplines, and an ever expanding array of application
domains. This more prominent role follows from the widely perceived need to expand
the focus of computer science research beyond traditional hardware and software issues
Compiled by Omorogbe Harry
3
HCI
to attempt to better understand how technology can more effectively support people in
accomplishing their goals.
At the same time that a human-centered approach to system development is of growing
significance, factors conspire to make the design and development of systems even more
difficult than in the past. This increased difficulty follows from the disappearance of
boundaries between applications as we start to support people's real activities; between
machines as we move to distributed computing; between media as we expand systems to
include video, sound, graphics, and communication facilities; and between people as we
begin to realize the importance of supporting organizations and group activities.
Research in Human-Computer Interaction (HCI) has been spectacularly successful,
and has fundamentally changed computing. Just one example is the ubiquitous graphical
interface used by Microsoft Windows 95, which is based on the Macintosh, which is
based on work at Xerox PARC, which in turn is based on early research at the Stanford
Research Laboratory (now SRI) and at the Massachusetts Institute of Technology.
Another example is that virtually all software written today employs user interface
toolkits and interface builders, concepts which were developed first at universities. Even
the spectacular growth of the World-Wide Web is a direct result of HCI research:
applying hypertext technology to browsers allows one to traverse a link across the world
with a click of the mouse. Interface improvements more than anything else has triggered
this explosive growth. Furthermore, the research that will lead to the user interfaces for
the computers of tomorrow is happening at universities and a few corporate research
labs.
This lecture note tries to briefly summarize many of the important research
developments in Human-Computer Interaction (HCI) technology. By "research," I mean
exploratory work at universities and government and corporate research labs (such as
Xerox PARC) that is not directly related to products. By "HCI technology," I am
referring to the computer side of HCI. A companion work on the history of the "human
side," discussing the contributions from psychology, design, human factors and
ergonomics would also be appropriate.
Figure 1 shows time lines for some of the technologies discussed in this book. Of course,
a deeper analysis would reveal much interaction between the university, corporate
research and commercial activity streams. It is important to appreciate that years of
research are involved in creating and making these technologies ready for widespread
use. The same will be true for the HCI technologies that will provide the interfaces of
tomorrow.
It is clearly impossible to list every system and source in a lecture note of this scope, but
I have tried to represent the earliest and most influential systems. Although there are a
number of other surveys of HCI topics.
The technologies covered in this material include fundamental interaction styles like
direct manipulation, the mouse pointing device, and windows; several important kinds of
application areas, such as drawing, text editing and spreadsheets; the technologies that
will likely have the biggest impact on interfaces of the future, such as gesture
recognition, multimedia, Computer supported Cooperative work, and 3D; and the
Compiled by Omorogbe Harry
4
HCI
technologies used to create interfaces using the other technologies, such as user interface
management systems, toolkits, and interface builders.
Figure 1: Approximate time lines showing where work was performed on some major
technologies discussed in this article.
Contributors to HCI
HCI is a multidisciplinary field. The main contributions come from computer science,
cognitive psychology, and ergonomics and human factors. However, other areas of
interest include artificial intelligence, (graphic) design, engineering, and even
psychology, sociology, and anthropology:
Compiled by Omorogbe Harry
5
HCI
Computer
science
Artificial
intelligence
Cognitive
psychology
HCI
Ergonomics
and human
factors
Engineering
Philosophy
Design
Sociology
Anthropology
Fig 2: Diagram of contributor to HCI
Early development
What we today take for granted were actually the accomplishments of over 30 years of
continuing research in the area. For instance, Direct Manipulation of graphical objects:
the now ubiquitous direct manipulation interface, where visible objects on the screen are
directly manipulated with a pointing device, was first demonstrated by Ivan Sutherland
in Sketchpad, which was his 1963 MIT PhD thesis. SketchPad supported the
manipulation of objects using a light-pen, including grabbing, moving objects, changing
size, and using constraints. Following that was William Newman's Reaction Handler
which was created at Imperial College, London in 1967. Reaction Handler provided
direct manipulation of graphics, and introduced "Light Handles, " a form of graphical
potentiometer, that was probably the first "widget." Another early System was AMBIT/G
(implemented at MIT's Lincoln Labs, 1968). It employed iconic representations, gesture
recognition, dynamic menus with items selected using a pointing devices, selection of
icons by pointing, and moded and mode-free styles of interaction. Many of the
interaction techniques popular indirect manipulation interfaces, such as how objects and
text are selected, opened, and manipulated, were researched at Xerox PARC in the
1970's. In particular, the idea of "WYSIWYG" (what you see is what you get) originated
there with systems such as the Bravo text editor and the Draw drawing program. The
first commercial systems to make extensive use of Direct Manipulation were the Xerox
Star (1981), the Apple Lisa (1982) and Macintosh (1984). Today, when most people take
for granted the ability of dragging an icon or dropping a file on their computer, how
many have thought that those are the efforts of 30-year global research.
Compiled by Omorogbe Harry
6
HCI
Pioneers
Major technologies emerged at the same period including Text Editing, The Mouse,
Windows, Gesture recognition and Computer Aided Design, and in most of those fields,
researchers have made astonishing progresses which we can easily discern today. Among
all facilities working on HCI, there are a few pioneers that are worth mentioning here.
Xerox PARC is one of the most innovative organizations in the early HCI research and
development. It is a major contributor to many important interface ideas such as Direct
Manipulation of graphical objects, The Mouse, Windows, etc. MIT AI Lab, IBM, AT&T
Bell lab are also among the most prominent organizations to the early HCI development.
Because of the collective efforts and contributions from various organizations and
individual, we were able to revolutionize the way humans interact with computers since
1960. However, after 30 years of research, more exciting fields are emerging day by day.
The need for HCI (Prospective)
Although one is encouraged by past research success in HCI and excited by the potential
of current research, I want to emphasize how central a strong research effort is to future
practical use of computational and network technologies. For example, popular
discussion of the National Information Infrastructure (NII) envisions the development of
an information marketplace that can enrich people's economic, social, cultural, and
political lives. For such an information marketplace, or, in fact, many other applications,
to be successful require solutions to a series of significant research issues that all revolve
around better understanding how to build effective human-centered systems. The
following sections discuss selected strategic themes, technology trends, and opportunities
to be addressed by HCI research.
Strategic Themes
If one step back from the details of current HCI research a number of themes are visible.
Although I cannot hope to do justice here to elaborating these or a number of other
themes that arose in workshop discussions, it is clear that HCI research has now started
to crystallize as a critical discipline, intimately involved in virtually all uses of computer
technologies and decisive to successful applications. Here I expand on just a few themes:
 Universal Access to Large and Complex Distributed Information: As the "global
information infrastructure" expands at unprecedented rates, there are dramatic changes
taking place in the kind of people who access the available information and the types of
information involved. Virtually all entities (from large corporations to individuals) are
engaged in activities that increasingly involve accessing databases, and their livelihood
and/or competitiveness depend heavily on the effectiveness and efficiency of that access.
As a result, the potential user community of database and other information systems is
becoming startlingly large and rather nontechnical, with most users bound to remain
permanent novices with respect to many of the diverse information sources they can
access. It is therefore urgently necessary and strategically critical to develop user
Compiled by Omorogbe Harry
7
HCI
interfaces that require minimal technical sophistication and expertise by the users and
support a wide variety of information-intensive tasks.
Information-access interfaces must offer great flexibility on how queries are expressed
and how data are visualized; they must be able to deal with several new kinds of data,
e.g., multimedia, free text, documents, the Web itself; and they must permit several new
styles of interaction beyond the typical, two-step query-specification/result-visualization
loop, e.g., data browsing, filtering, and dynamic and incremental querying. Fundamental
research is required on visual query languages, user-defined and constraint-based
visualizations, visual metaphors, and generic and customizable interfaces, and advances
seem most likely to come from collaborations between the HCI and database research
communities.
Information-discovery interfaces must support a collaboration between humans and
computers, e.g., for data mining. Because of our limited memory and cognitive abilities,
the growing volume of available information has increasingly forced us to delegate the
discovery process to computers, greatly underemphasizing the key role played by
humans. Discovery should be viewed as an interactive process in which the system gives
users the necessary support to analyze terabytes of data, and users give the system the
feedback necessary to better focus its search. Fundamental issues for the future include
how best to array tasks between people and computers, create systems that adapt to
different kinds of users, and support the changing context of tasks. Also, the system
could suggest appropriate discovery techniques depending on data characteristics, as well
as data visualizations, and help integrate what are currently different tools into a
homogeneous environment.
 Education and Life-Long Learning: Computationally assisted access to
information has important implications for education and learning as evidenced in
current discussions of "collaboratories" and "virtual universities." Education is a domain
that is fundamentally intertwined with human-computer interaction. HCI research
includes both the development and evaluation of new educational technologies such as
multimedia systems, interactive simulations, and computer-assisted instructional
materials. For example, consider distance learning situations involving individuals far
away from schools. What types of learning environments, tools, and media effectively
deliver the knowledge and understanding that these individuals seek? Furthermore, what
constitutes an effective educational technology? Do particular media or types of
simulations foster different types of learning? These questions apply not only to
secondary and university students, but also to adults through life-long learning. Virtually
every current occupation involves workers who encounter new technologies and require
additional training. How can computer-assisted instructional systems engage individuals
and help them to learn new ideas? HCI research is crucial to answering these important
questions.
 Electronic Commerce: Another important theme revolves around the increasing
role of computation in our economic life and highlights central HCI issues that go
beyond usability to concerns with privacy, security, and trust. Although currently there is
much hyperbole, as with most Internet technologies, over the next decade
commercialization of the Internet may mean that digital commerce replaces much
traditional commerce. The Internet makes possible services that could potentially be
Compiled by Omorogbe Harry
8
HCI
quite adaptive and responsive to consumer wishes. Digital commerce may require
dramatic changes to internal processes as well as the invention of new processes. For
digital commerce to be successful, the technology surrounding it will have to be
affordable, widely available, simple to use, and secure. Interface issues are, of course,
key.
 End-User Programming: An important reason that the WWW has been so
successful is that everyone can create his or her own pages. With the advent of
WYSIWYG html page-editing tools, it will be even easier. However, for "active" pages
that use forms, animations, or computation, a professional programmer is required to
write the required code in a programming language like PERL or Java. The situation is
the same for the desktop where applications are becoming increasingly programmable
(e.g, by writing Visual Basic scripts for Microsoft Word), but only to those with training
in programming. Applying the principles and methods of HCI to the design of
programming languages and programming systems for end-users should bring to
everyone the ability to program Web pages and desktop applications.
End-user programming will be increasingly important in the future. No matter how
successful interface designers are, systems will still need to be customized to the needs
of particular users. Although there will likely be generic structures, for example, in an
email filtering system, that can be shared, such systems and agents will always need to
be tailored to meet personal requirements. The use of various scripting languages to meet
such needs is widespread, but better interfaces and understandings of end-user
programming are needed.
 Information Visualization: This area focuses on graphical mechanisms designed to
show the structure of information and improve the cost structure of access. Previous
approaches have studied novel visualizations for information, such as the "Information
Visualizer", history-enriched digital objects for displaying graphical abstractions of
interaction history, and dotplots for visualizing self-similarity in millions of lines of text
and code. Other approaches provide novel techniques for displaying data, e.g., dynamic
queries, visual query languages, zoomable interfaces for supporting multiscale interfaces,
and lenses to provide alternative views of information. Another branch of research is
studying automatic selection of visualizations based on properties of the data and the
user's tasks.
The importance of information visualization will increase as people have access to larger
and more diverse sources of information (e.g., digital libraries, large databases), which
are becoming universally available with the WWW. Visualizing the WWW itself and
other communication networks is also an important aim of information visualization
systems. The rich variety of information may be handled by giving the users the ability
to tailor the visualization to a particular application, to the size of the data set, or to the
device (e.g., 2D vs. 3D capabilities, large vs. small screens). Research challenges include
making the specification, exploration, and evolution of visualizations interactive and
accessible to a variety of users. Tools should be designed that support a range of tailoring
capabilities: from specifying visualizations from scratch to minor adaptations of existing
visualizations. Incorporating automatic generation of information visualization with userdefined approaches is another interesting open problem, for example when the userdefined visualization is underconstrained.
Compiled by Omorogbe Harry
9
HCI
One fundamental issue for information visualization is how to characterize the
expressiveness of visualization and judge its adequacy to represent a data set. For
example, the "readability" of a visualization of a graph may depend on (often conflicting)
aesthetic criteria, such as the minimization of edge crossings and of the area of the graph,
and the maximization of symmetries. For other types of visualization, the criteria are
quite ad hoc. Therefore, more foundation work is needed for establishing general
principles.
 Computer-Mediated Communication: Examples of computer-mediated
communication range from work that led to extraordinarily successful applications such
as email to that involved in newer forms of communication via computers, such as realtime video and audio interactions. Research in Computer Supported Cooperative Work
(CSCW) confronts complex issues associated with integration of several technologies
(e.g., telephone, video, 3D graphics, cable, modem, fax, email), support for multi-person
activities (which have particularly difficult interface development challenges), and issues
of security, privacy, and trust.
The unpredicted shift of focus to the Internet, intranets, and the World-Wide Web has
ended a period in which the focus was on the interaction between an individual and a
computer system, with relatively little attention to group and organizational contexts.
Computer-mediated human communication raises a host of new interface issues.
Additional challenges arise in coordinating the activities of computer-supported group
members, either by providing shared access to common on-line resources and letting
people structure their work around them, or by formally representing work processes to
enable a system to guide the work. The CSCW subcommunity of human-computer
interaction has grown rapidly, drawing from diverse disciplines. Social theory and social
science, management studies, communication studies, education, are among the relevant
areas of knowledge and expertise. Techniques drawn from these areas, including
ethnographic approaches to understanding group activity, have become important
adjuncts to more familiar usability methods.
Mounting demands for more function, greater availability, and interoperability affect
requirements in all areas. For example, the great increase in accessible information shifts
the research agenda toward more sophisticated information retrieval techniques.
Approaches to dealing with the new requirements through formal or de facto standards
can determine where research is pointless, as well as where it is useful. As traditional
applications are integrated into the Web, social aspects of computing are extended.
Basic Interactions
Direct Manipulation of graphical objects: The now ubiquitous direct manipulation
interface, where visible objects on the screen are directly manipulated with a pointing
device, was first demonstrated by Ivan Sutherland in Sketchpad, which was his 1963
MIT PhD thesis. Sketchpad supported the manipulation of objects using a light-pen,
including grabbing objects, moving them, changing size, and using constraints. It
contained the seeds of myriad important interface ideas. The system was built at Lincoln
Labs with support from the Air Force and NSF. William Newman's Reaction Handler,
created at Imperial College, London (1966-67) provided direct manipulation of graphics,
Compiled by Omorogbe Harry
10
HCI
and introduced "Light Handles," a form of graphical potentiometer, that was probably the
first "widget." Another early system was AMBIT/G (implemented at MIT's Lincoln
Labs, 1968, ARPA funded). It employed, among other interface techniques, iconic
representations, gesture recognition, dynamic menus with items selected using a pointing
device, selection of icons by pointing, and moded and mode-free styles of interaction.
David Canfield Smith coined the term "icons" in his 1975 Stanford PhD thesis on
Pygmalion (funded by ARPA and NIMH) and Smith later popularized icons as one of
the chief designers of the Xerox Star. Many of the interaction techniques popular in
direct manipulation interfaces, such as how objects and text are selected, opened, and
manipulated, were researched at Xerox PARC in the 1970's. In particular, the idea of
"WYSIWYG" (what you see is what you get) originated there with systems such as the
Bravo text editor and the Draw drawing program. The concept of direct manipulation
interfaces for everyone was envisioned by Alan Kay of Xerox PARC in a 1977 article
about the "Dynabook". The first commercial systems to make extensive use of Direct
Manipulation were the Xerox Star (1981), the Apple Lisa (1982) and Macintosh (1984).
Ben Shneiderman at the University of Maryland coined the term "Direct Manipulation"
in 1982 and identified the components and gave psychological foundations.
The Mouse: The mouse was developed at Stanford Research Laboratory (now SRI) in
1965 as part of the NLS project (funding from ARPA, NASA, and Rome ADC) to be a
cheap replacement for light-pens, which had been used at least since 1954. Many of the
current uses of the mouse were demonstrated by Doug Engelbart as part of NLS in a
movie created in 1968. The mouse was then made famous as a practical input device by
Xerox PARC in the 1970's. It first appeared commercially as part of the Xerox Star
(1981), the Three Rivers Computer Company's PERQ (1981), the Apple Lisa (1982), and
Apple Macintosh (1984).
Windows: Multiple tiled windows were demonstrated in Engelbart's NLS in 1968.
Early research at Stanford on systems like COPILOT (1974) and at MIT with the
EMACS text editor (1974) also demonstrated tiled windows. Alan Kay proposed the idea
of overlapping windows in his 1969 University of Utah PhD thesis and they first
appeared in 1974 in his Smalltalk system at Xerox PARC, and soon after in the InterLisp
system. Some of the first commercial uses of windows were on Lisp Machines Inc.
(LMI) and Symbolic Lisp Machines (1979), which grew out of MIT AI Lab projects.
The Cedar Window Manager from Xerox PARC was the first major tiled window
manager (1981), followed soon by the Andrew window manager by Carnegie Mellon
University's Information Technology Center (1983, funded by IBM). The main
commercial systems popularizing windows were the Xerox Star (1981), the Apple Lisa
(1982), and most importantly the Apple Macintosh (1984). The early versions of the Star
and Microsoft Windows were tiled, but eventually they supported overlapping windows
like the Lisa and Macintosh. The X Window System, a current international standard,
was developed at MIT in 1984.
Application Types
Drawing programs: Much of the current technology was demonstrated in Sutherland's
1963 Sketchpad system. The use of a mouse for graphics was demonstrated in NLS
(1965). In 1968 Ken Pulfer and Grant Bechthold at the National Research Council of
Compiled by Omorogbe Harry
11
HCI
Canada built a mouse out of wood patterned after Engelbart's and used it with a keyframe animation system to draw all the frames of a movie. A subsequent movie,
"Hunger" in 1971 won a number of awards, and was drawn using a tablet instead of the
mouse (funding by the National Film Board of Canada). William Newman's Markup
(1975) was the first drawing program for Xerox PARC's Alto, followed shortly by
Patrick Baudelaire's Draw which added handling of lines and curves. The first computer
painting program was probably Dick Shoup's "Superpaint" at PARC (1974-75).
Text Editing: In 1962 at the Stanford Research Lab, Engelbart proposed, and later
implemented a word processor with automatic word wrap, search and replace, userdefinable macros, scrolling text, and commands to move, copy, and delete characters,
words, or blocks of text. Stanford's TV Edit (1965) was one of the first CRT-based
display editors that was widely used. The Hypertext Editing System from Brown
University had screen editing and formatting of arbitrary-sized strings with a light pen in
1967 (funding from IBM). NLS demonstrated mouse-based editing in 1968. TECO from
MIT was an early screen-editor (1967) and EMACS developed from it in 1974. Xerox
PARC's Bravo was the first WYSIWYG editor-formatter (1974). It was designed by
Butler Lampson and Charles Simonyi who had started working on these concepts around
1970 while at Berkeley. The first commercial WYSIWYG editors were the Star, Lisa
Write and then Mac Write.
Spreadsheets: The initial spreadsheet was VisiCalc which was developed by Frankston
and Bricklin (1977-8) for the Apple II while they were students at MIT and the Harvard
Business School. The solver was based on a dependency-directed backtracking algorithm
by Sussman and Stallman at the MIT AI Lab.
Hypertext: The idea for hypertext (where documents are linked to related documents)
is credited to Vannevar Bush's famous MEMEX idea from 1945. Ted Nelson coined the
term "hypertext" in 1965. Engelbart's NLS system at the Stanford Research Laboratories
in 1965 made extensive use of linking (funding from ARPA, NASA, and Rome ADC).
The "NLS Journal" was one of the first on-line journals, and it included full linking of
articles (1970). The Hypertext Editing System, jointly designed by Andy van Dam, Ted
Nelson, and two students at Brown University (funding from IBM) was distributed
extensively. The University of Vermont's PROMIS (1976) was the first Hypertext
system released to the user community. It was used to link patient and patient care
information at the University of Vermont's medical center. The ZOG project (1977) from
CMU was another early hypertext system, and was funded by ONR and DARPA. Ben
Shneiderman's Hyperties was the first system where highlighted items in the text could
be clicked on to go to other pages (1983, Univ. of Maryland). HyperCard from Apple
(1988) significantly helped to bring the idea to a wide audience. There have been many
other hypertext systems through the years. Tim Berners-Lee used the hypertext idea to
create the World Wide Web in 1990 at the government-funded European Particle Physics
Laboratory (CERN). Mosaic, the first popular hypertext browser for the World-Wide
Web was developed at the Univ. of Illinois' National Center for Supercomputer
Applications (NCSA).
Computer Aided Design (CAD): The same 1963 IFIPS conference at which Sketchpad
was presented also contained a number of CAD systems, including Doug Ross's
Computer-Aided Design Project at MIT in the Electronic Systems Lab and Coons' work
Compiled by Omorogbe Harry
12
HCI
at MIT with Sketchpad. Timothy Johnson's pioneering work on the interactive 3D CAD
system Sketchpad 3 was his 1963 MIT MS thesis (funded by the Air Force). The first
CAD/CAM system in industry was probably General Motor's DAC-1 (about 1963).
Video Games: The first graphical video game was probably Spaceward by Slug
Russell of MIT in 1962 for the PDP-1 including the first computer joysticks. The early
computer Adventure game was created by Will Crowther at BBN, and Don Woods
developed this into a more sophisticated Adventure game at Stanford in 1966. Conway's
game of LIFE was implemented on computers at MIT and Stanford in 1970. The first
popular commercial game was Pong (about 1976).
UIMSs and Toolkits: The first User Interface Management System (UIMS) was
William Newman's Reaction Handler created at Imperial College, London (1966-67 with
SRC funding). Most of the early work took place at universities (University of Toronto
with Canadian government funding; George Washington University with NASA, NSF,
DOE, and NBS funding; Brigham Young University with industrial funding). The term
UIMS was coined by David Kasik at Boeing (1982). Early window managers such as
Smalltalk (1974) and InterLisp, both from Xerox PARC, came with a few widgets, such
as popup menus and scrollbars. The Xerox Star (1981) was the first commercial system
to have a large collection of widgets and to use dialog boxes. The Apple Macintosh
(1984) was the first to actively promote its toolkit for use by other developers to enforce
a consistent interface. An early C++ toolkit was InterViews, developed at Stanford
(1988, industrial funding). Much of current research is now being performed at
universities, including Garnet and Amulet at CMU (ARPA funded), MasterMind at
Georgia Tech (ARPA funded), and Artkit at Georgia Tech (funding from NSF and Intel).
There are, of course, many other examples of HCI research that should be included in a
complete history, including work that led to drawing programs, paint programs,
animation systems, text editing, spreadsheets, multimedia, 3D, virtual reality, interface
builders, event-driven architectures, usability engineering, and a very long list of other
significant developments. Although our brief history here has had to be selective, what
we hope is clear is that there are many years of productive HCI research behind our
current interfaces and that it has been research results that have led to the successful
interfaces of today.
For the future, HCI researchers are developing interfaces that will greatly facilitate
interaction and make computers useful to a wider population. These technologies
include: handwriting and gesture recognition, speech and natural language
understanding, multiscale zoomable interfaces, "intelligent agents" to help users
understand systems and find information, end-user programming systems so people can
create and tailor their own applications, and much, much more. New methods and tools
promise to make the process of developing user interfaces significantly easier but the
challenges are many as we expand the modalities that interface designers employ and as
computing systems become an increasingly central part of virtually every aspect of our
lives.
As HCI has matured as a discipline, a set of principles is emerging that are generally
agreed upon and that are taught in courses on HCI at the undergraduate and graduate
level. These principles should be taught to every CS undergraduate, since virtually all
Compiled by Omorogbe Harry
13
HCI
programmers will be involved in designing and implementing user interfaces during their
careers. These principles are described in other publications, such as, and include task
analysis, user-centered design, and evaluation methods.
Technological Trends
Again, the number and variety of trends identified in this discussions outstrip the space I
have here for reporting. One can see large general trends that are moving the field from
concerns about connectivity, as the networked world becomes a reality, to compatibility,
as applications increasingly need to run across different platforms and code begins to
move over networks as easily as data, to issues of coordination, as we understand the
need to support multiperson and organization activities. I will limit the discussion here to
a few instances of these general trends.
 Computational Devices and Ubiquitous Computing: One of the most notable
trends in computing is the increase in the variety of computational devices with which
users interact. In addition to workstations and desktop personal computers, users are
faced with (to mention only a few) laptops, PDAs, and LiveBoards. In the near future,
Internet telephony will be universally available, and the much-heralded Internet
appliance may allow interactions through the user's television and local cable connection.
In the more distant future, wearable devices may become more widely available. All
these technologies have been considered under the heading of "Ubiquitous Computing"
because they involve using computers everywhere, not just on desks.
The introduction of such devices presents a number of challenges to the discipline of
HCI. First, there is the tension between the design of interfaces appropriate to the device
in question and the need to offer a uniform interface for an application across a range of
devices. The computational devices differ greatly, most notably in the sizes and
resolutions of displays, but also in the available input devices, the stance of the user (is
the user standing, sitting at a desk, or on a couch?), the physical support of the device (is
the device sitting on a desk, mounted on a wall, or held by the user, and is the device
immediately in front of the user or across the room?), and the social context of the
device's use (is the device meant to be used in a private office, a meeting room, a busy
street, or a living room?). On the other hand, applications offered across a number of
devices need to offer uniform interfaces, both so that users can quickly learn to use a
familiar application on new devices, and so that a given application can retain its identity
and recognizability, regardless of the device on which it is operating.
Development of systems meeting the described requirements will involve user testing
and research into design of displays and input devices, as well as into design of effective
interfaces, but some systems have already begun to address these problems. Some
browsers for the World-Wide Web attempt to offer interfaces that are appropriate to the
devices on which they run and yet offer some uniformity. At times this can be difficult.
For example, the frames feature of HTML causes a browser to attempt to divide up a
user's display without any knowledge of the characteristics of that display. Although
building applications that adapt their interfaces to the characteristics of the device on
which they are running is one potential direction of research in this area, perhaps a more
promising one is to separate the interface from the application and give the responsibility
Compiled by Omorogbe Harry
14
HCI
of maintaining the interface to the device itself. A standard set of protocols would allow
the application to negotiate the setup of an interface, and later to interact with that
interface and, indirectly, with the user. Such multimodal architectures could address the
problems of generating an appropriate interface, as well as providing better support for
users with specific disabilities. The architectures could also be distributed, and the
building blocks of forthcoming distributed applications could become accessible from
assorted computational devices.
 Speed, Size, and Bandwidth: The rate of increase of processor speed and storage
(transistor density of semiconductor chips doubles roughly every 18 months according to
Moore's law) suggests a bright future for interactive technologies. An important
constraint on utilizing the full power afforded by these technological advances, however,
may be network bandwidth. Given the overwhelming trends towards global networked
computing, and even the network as computer, the implications of limited bandwidth
deserves careful scrutiny. The bottleneck is the "last mile" connecting the Internet to
individual homes and small offices. Individuals who do not get access through large
employers may be stuck at roughly the present bandwidth rate (28,800 kilobits per
second) at least until the turn of the century. The rate needed for delivery of televisionquality video, one of the promises of the National Information Infrastructure, is 4-6
megabits, many times that amount. What are the implications for strategic HCI research
of potentially massive local processing power together with limited bandwidth?
Increases in processor speed and memory suggest that if the information can be collected
and cached from the network and/or local sources, local interactive techniques based on
signal processing and work context could be utilized to the fullest. With advances in
speech and video processing, interfaces that actively watch, listen, catalog, and assist
become possible. With increased CPU speed we might design interactive techniques
based on work context rather than isolated event handling. Fast event dispatch becomes
less important than helpful action. Tools might pursue multiple redundant paths, leaving
the user to choose and approve rather than manually specify. We can afford to "waste"
time and space on indexing information and tasks that may never be used, solely for the
purpose of optimizing user effort. With increased storage capacity it becomes potentially
possible to store every piece of interactive information that a user or even a virtual
community ever sees. The processes of sifting, sorting, finding and arranging increase in
importance relative to the editing and browsing that characterizes today's interfaces.
When it is physically possible to store every paper, e-mail, voice-mail and phone
conversation in a user's working life, the question arises of how to provide effective
access.
 Speech, Handwriting, Natural Language, and Other Modalities: The use of speech
will increase the need to allow user-centered presentation of information. Where the
form and mode of the output generated by computer-based systems is currently defined
by the system designer, a new trend may be to increasingly allow the user to determine
the way in which the computer will interact and to support multiple modalities at the
same time. For instance, the user may determine that in a given situation, textual natural
language output is preferred to speech, or that pictures may be more appropriate than
words. These distinctions will be made dynamically, based on the abilities of the user or
the limitations of the presentation environment. As the computing environment used to
present data becomes distinct from the environment used to create or store information,
Compiled by Omorogbe Harry
15
HCI
interface systems will need to support information adaptation as a fundamental property
of information delivery.
 3D and Virtual Reality: Another trend is the migration from two-dimensional
presentation space (or a 2 1/2 dimensional space, in the case of overlapping windows) to
three dimensional spaces. The beginning of this in terms of a conventional presentation
environment is the definition of the Virtual Reality Modeling Language (VRML). Other
evidences are the use of integrated 3D input and output control in virtual reality systems.
The notions of selecting and interacting with information will need to be revised, and
techniques for navigation through information spaces will need to be radically altered
from the present page-based models. Three-dimensional technologies offer significant
opportunities for human-computer interfaces. Application areas that may benefit from
three-dimensional interfaces include training and simulation, as well as interactive
exploration of complex data environments.
A central aspect of three-dimensional interfaces is "near-real-time" interactivity, the
ability for the system to respond quickly enough that the effect of direct manipulation is
achieved. Near-real-time interactivity implies strong performance demands that touch on
all aspects of an application, from data management through computation to graphical
rendering. Designing interfaces and applications to meet these demands in an
application-independent manner presents a major challenge to the HCI community.
Maintaining the required performance in the context of an unpredictable user-configured
environment implies a "time-critical" capability, where the system automatically
gracefully degrades quality in order to maintain performance. The design of general
algorithms for time-critical applications is a new area and a significant challenge.
Compiled by Omorogbe Harry
16
HCI
CHAPTER TWO
CURRENT DEVELOPMENT
The current development of HCI is focused on advanced user interface design, human
perception and cognitive science, Artificial Intelligence, and virtual reality, etc.
Human Perception and Cognitive Science
Why do we always need to type into the computer in order for it to do something for us?
A very active subfield of HCI these days is human perception and cognitive science. The
goal is to enable computer to recognize human actions the same way human perceive
things. The focused subfields include Natural language and speech recognition, gesture
recognition, etc. Natural language interfaces enable the user to communicate with the
computer in their natural languages. Some applications of such interfaces are database
queries, information retrieval from texts and so-called expert systems. Current advances
in recognition of spoken language improve the usability of many types of natural
language systems. Communication with computers using spoken language will have a
lasting impact upon the work environment, opening up completely new areas of
application for information technology. In recent years a substantial amount of research
has been invested in applying the computer science tool of computational complexity
theory to natural language and linguistic theory, and scientists have found that Word
Grammar Recognition is computationally intractable (NP-hard, in fact). Thus, we still
have a long way to go before we can conquer this important field of study.
Reasoning, Intelligence Filtering, Artificial Intelligence
To realize the full potential of HCI, the computer has to share the reasoning involved in
interpreting and intelligently filtering the input provided by the human to the computer
or, conversely, the information presented to the human. Currently, many scientists and
researchers are involved in developing the scientific principles underlying the reasoning
mechanism. The approaches used varied widely, but all of them are based on the
fundamental directions such as case-based reasoning, learning, computer-aided
instruction, natural language processing and expert systems. Among those, computeraided instruction (CAI) has its origins in the 1960s too. These systems were designed to
tutor users, thus augmenting, or perhaps substituting for human teachers. Expert systems
are software tools that attempt to model some aspect of human reasoning within a
domain of knowledge. Initially, expert systems rely on human experts for their
knowledge (an early success in this field was MYCIN [11], developed in the early 1970s
under Edward Shortliffe. Now, scientists are focusing on building an expert system that
does not rely on human experts.
Compiled by Omorogbe Harry
17
HCI
Virtual Reality
From the day we used wires and punch cards to input data to the computer and received
output via blinking lights, to nowadays easy-to-use, easy-to-manipulate GUI, the
advancement in the user interface is astonishing; however, many novice computer users
still find that computers are hard to access; moreover, even to the experienced user,
current computer interface is still restricting in some sense, that is, one cannot
communicate with computers in all the way he/she wants. A complete theory of
communication must be able to account for all the ways that people communicate, not
just natural language. Therefore, virtual reality becomes the ultimate goal of computer
interface design. Virtual reality has its origins in the 1950s, when the first video-based
flight simulator systems were developed for the military. These days, it receives more
and more attention from not only the scientists but the mass population. (The popularity
of the movie "Matrix" is a demonstration)
Up-and-Coming Areas
Gesture Recognition: The first pen-based input device, the RAND tablet, was funded
by ARPA. Sketchpad used light-pen gestures (1963). Teitelman in 1964 developed the
first trainable gesture recognizer. A very early demonstration of gesture recognition was
Tom Ellis' GRAIL system on the RAND tablet (1964, ARPA funded). It was quite
common in light-pen-based systems to include some gesture recognition, for example in
the AMBIT/G system (1968 -- ARPA funded). A gesture-based text editor using proofreading symbols was developed at CMU by Michael Coleman in 1969. Bill Buxton at the
University of Toronto has been studying gesture-based interactions since 1980. Gesture
recognition has been used in commercial CAD systems since the 1970s, and came to
universal notice with the Apple Newton in 1992.
Multi-Media: The FRESS project at Brown used multiple windows and integrated text
and graphics (1968, funding from industry). The Interactive Graphical Documents
project at Brown was the first hypermedia (as opposed to hypertext) system, and used
raster graphics and text, but not video (1979-1983, funded by ONR and NSF). The
Diamond project at BBN (starting in 1982, DARPA funded) explored combining
multimedia information (text, spreadsheets, graphics, speech). The Movie Manual at the
Architecture Machine Group (MIT) was one of the first to demonstrate mixed video and
computer graphics in 1983 (DARPA funded).
3-D: The first 3-D system was probably Timothy Johnson's 3-D CAD system
mentioned above (1963, funded by the Air Force). The "Lincoln Wand" by Larry
Roberts was an ultrasonic 3D location sensing system, developed at Lincoln Labs (1966,
ARPA funded). That system also had the first interactive 3-D hidden line elimination. An
early use was for molecular modeling. The late 60's and early 70's saw the flowering of
3D raster graphics research at the University of Utah with Dave Evans, Ivan Sutherland,
Romney, Gouraud, Phong, and Watkins, much of it government funded. Also, the
military-industrial flight simulation work of the 60's - 70's led the way to making 3-D
real-time with commercial systems from GE, Evans&Sutherland, Singer/Link (funded by
Compiled by Omorogbe Harry
18
HCI
NASA, Navy, etc.). Another important center of current research in 3-D is Fred Brooks'
lab at UNC.
Virtual Reality and "Augmented Reality": The original work on VR was performed
by Ivan Sutherland when he was at Harvard (1965-1968, funding by Air Force, CIA, and
Bell Labs). Very important early work was by Tom Furness when he was at WrightPatterson AFB. Myron Krueger's early work at the University of Connecticut was
influential. Fred Brooks' and Henry Fuch's groups at UNC did a lot of early research,
including the study of force feedback (1971, funding from US Atomic Energy
Commission and NSF). Much of the early research on head-mounted displays and on the
Data Glove was supported by NASA.
Computer Supported Cooperative Work. Doug Engelbart's 1968 demonstration of
NLS included the remote participation of multiple people at various sites (funding from
ARPA, NASA, and Rome ADC). Licklider and Taylor predicted on-line interactive
communities in a 1968 article and speculated about the problem of access being limited
to the privileged. Electronic mail, still the most widespread multi-user software, was
enabled by the ARPAnet, which became operational in 1969 and by the Ethernet from
Xerox PARC in 1973. An early computer conferencing system was Turoff's EIES system
at the New Jersey Institute of Technology (1975).
Natural language and speech: The fundamental research for speech and natural
language understanding and generation has been performed at CMU, MIT, SRI, BBN,
IBM, AT&T Bell Labs and Bell Core, much of it government funded. See, for example,
for a survey of the early work.
New Frontiers
Now let us take a look of some of the newest developments in HCI today.
Intelligent Room
The Intelligent room is a project of MIT Artificial Intelligence Lab. The goal for the
project is, said by Michael H. Coen from MIT AIL, is "creating spaces in which
computation is seamless used to enhance ordinary, everyday activities." They want to
incorporate computers into the real world by embedding them in regular environments,
such as homes and offices, and allow people to interact with them the way they do with
other people. The user interfaces of these systems are not menus, mice, and keyboards
but instead gesture, speech, affect, context, and movement. Their applications are not
word processors and spreadsheets, but smart homes and personal assistants. "Instead of
making computer-interface for people, it is of more fundamental value to make peopleinterfaces for computers."
They have built two Intelligent Rooms in the laboratory. They give the rooms cameras
for eyes and microphones for ears to make accessible the real-world phenomena
occurring within them. A multitude of computer vision and speech understanding
systems then help interpret human-level phenomena, such as what people are saying,
Compiled by Omorogbe Harry
19
HCI
where they are standing, etc. By embedding user-interfaces this way, the fact that people
tend to point at what they are speaking about is no longer meaningless from a
computational viewpoint and they can then use build systems that make use of the
information. Coupled with their natural interfaces is the expectation that these systems
are not only highly interactive, they talk back when spoken to, but more importantly, that
they are useful during ordinary activities. They enable talks historically outside the
normal range of human-computer interaction by connecting computers to phenomena
(such as someone sneezing or walking into a room) that have traditionally been outside
the purview of contemporary user-interfaces. Thus, in the future, you can imagine that
elderly people's homes would call an ambulance if they saw anyone fall down. Similarly,
you can also imagine kitchen cabinets that automatically lock when young children
approach them.
Brain-Machine Interfaces
Scientists are not satisfied with communicating with computers using natural language or
gestures and movements. Instead, they ask a question why can not computers just do
what people have in mind. Out of questions like this, there come brain-machine
interfaces. Miguel Nicolelis, a Duke University neurobiologist, is one of the leading
researchers in this competitive and highly significant field. There are only about a halfdozen teams around the world are pursuing the same goals: gaining a better
understanding of how the mind works and then using that knowledge to build implant
systems that would make brain control of computers and other machines possible.
Nicolelis terms such systems "hybrid brain-machine interfaces" (HBMIs) Recently,
working with the Laboratory for Human and Machine Haptics at MIT, he was able to
send signals from individual neurons in Belle's, a nocturnal owl monkey, brain to a robot,
which used the data to mimic the monkey's arm movements in real time. Scientists
predict that Brain-Machine Interfaces will allow human brains to control artificial
devices designed to restore lost sensory and motor functions. Paralysis sufferers, for
example, might gain control over a motorized wheelchair or a prosthetic arm, or perhaps,
even regain control over their own limbs. They believe the brain will prove capable of
readily assimilating human-made devices in much the same way that a musician grows to
feel that here instrument is a part of his/her own body. Ongoing experiments in other labs
are showing that the idea is credible. At Emory University, neurologist Phillip Kennedy
has helped severely paralyzed people communicate via a brain implant that allows them
to move a cursor on a computer screen. However, scientists still know relatively little
about how the electrical and chemical signals emitted by the brain's millions of neurons
let us perceive color and smell, or give rise to the precise movements of professional
dancers. Numerous stumbling blocks remain to be overcome before human brains can
interface reliably and comfortably with artificial devices or making mind-controlled
prosthetic limbs. Among the key challenges is developing electrode devices and surgical
methods that will allow safe, long-term recording of neuronal activities.
Conclusion - a look at the future
In conclusion, Human Computer Interaction holds great promise. Exploiting this
tremendous potential can bring profound benefits in all areas of human concern. Just
imagine that one day, we will be able to tell computers to do what we want them to do,
Compiled by Omorogbe Harry
20
HCI
use gestures and hand signals to command them, or directly invoke them through our
thoughts. One day, we will be able to call out an artificial intelligence from the computer
or better yet, a hologram (YES! I am a diehard startrek fan) to perform the tasks that we
can not accomplish, to solve aid in the emergency situations, or simply, to have someone
that can listen to talk to. How bright a future that is shown to as, all thanks for the
research that is going to be done in the Human-Computer Interaction field.
Compiled by Omorogbe Harry
21
HCI
CHAPTER THREE
CONCEPT AND DESIGN IN HCI
Design and Evaluation Methods
Design and evaluation methods have evolved rapidly as the focus of human-computer
interaction has expanded. Contributing to this are the versatility of software and the
downward price and upward performance spiral, which continually extend the
applications of software. The challenges overshadow those faced by designers using
previous media and assessment methods. Design and evaluation for a monochrome,
ASCII, stand-alone PC was challenging, and still does not routinely use more than ad
hoc methods and intuition. New methods are needed to address the complexities of
multimedia design, of supporting networked group activities, and of responding to
routine demands for ever-faster turnaround times.
More rapid evaluation methods will remain a focus, manifest in recent work on cognitive
walkthrough, heuristic evaluation, and other modifications of earlier cognitive modeling
and usability engineering approaches. Methods to deal with the greater complexity of
assessing use in group settings are moving from research into the mainstream.
Ethnographic observation, participatory design, and scenario-based design are being
streamlined. Contextual inquiry and design is an example of a method intended to
quickly obtain a rich understanding of an activity and transfer that understanding to all
design team members.
As well as developing and refining the procedures of design and evaluation methods, we
need to understand the conditions under which they work. Are some better for individual
tasks, some excellent for supporting groupware? Are some useful very early in the
conceptual phase of design, others best when a specific interface design has already been
detailed, and some restricted to when a prototype is in existence? In addition, for proven
and promising techniques to become widespread, they need to be incorporated into the
education of UI designers. Undergraduate curricula should require such courses for a
subset of their students; continuing education courses need to be developed to address
the needs of practicing designers.
Tools
All the forms of computer-human interaction discussed here will need to be supported by
appropriate tools. The interfaces of the future will use multiple modalities for input and
output (speech and other sounds, gestures, handwriting, animation, and video), multiple
screen sizes (from tiny to huge), and have an "intelligent" component ("wizards" or
"agents" to adapt the interface to the different wishes and needs of the various users).
The tools used to construct these interfaces will have to be substantially different from
those of today. Whereas most of today's tools well support widgets such as menus and
dialog boxes, these will be a tiny fraction of the interfaces of the future. Instead, the tools
will need to access and control in some standard way the main application data structures
Compiled by Omorogbe Harry
22
HCI
and internals, so the speech system and agents can know what the user is talking about
and doing. If the user says "delete the red truck," the speech system needs access to the
objects to see which one is to be deleted. Otherwise, each application will have to deal
with its own speech interpretation, which is undesirable. Furthermore, an agent might
notice that this is the third red truck that was deleted, and propose to delete the rest. If
confirmed, the agent will need to be able to find the rest of the trucks that meet the
criteria. Increasingly, future user interfaces will be built around standardized data
structures or "knowledge bases" to make these facilities available without requiring each
application to rebuild them.
These procedures should be supported by the system-building tools themselves. This
would make the evaluation of ideas extremely easy for designers, allowing ubiquitous
evaluation to become a routine aspect of system design.
Concepts of User Interface Design
Learnability vs. Usability
Many people consider the primary criterion for a good user interface to be the degree to
which it is easy to learn. This is indeed a laudable quality of any user interface, but it is
not necessarily the most important.
The goal of the user interface should be foremost in the design process. Consider the
example of a visitor information system located on a kiosk. In this case it makes perfect
sense that the primary goal for the interface designers should be ease of operation for the
first-time user. The more the interface walks the user through the system step by step, the
more successful the interface would be.
In contrast, consider a data entry system used daily by an office of heads-down
operators. Here the primary goal should be that the operators can input as much
information as possible as efficiently as possible. Once the users have learned how to use
the interface, anything intended to make first-time use easier will only get in the way.
User interface design is not a "one size fits all" process. Every system has its own
considerations and accompanying design goals. The Requirements Phase is designed to
elicit from the design team the kind of information that should make these goals clear.
Metaphors and Idioms
The True Role of Metaphors in the GUI
When the GUI first entered the market, it was heralded most of all for its use of
metaphors. Careful consideration of what really made the GUI successful, however,
would appear to indicate that the use of metaphors was actually a little further down in
the list. Metaphors were really nothing new. The term computer "file" was chosen as a
metaphor for a collection of separate but related items held in a single container. This
term dates back to the very early days of computers.
The single most significant aspect of the GUI was the way in which it presented all
possible options to the users rather than requiring them to memorize commands and enter
Compiled by Omorogbe Harry
23
HCI
them without error. This has nothing to do with metaphor and everything to do with
focusing the user interface on the needs of the user rather than mandating that the user
conform to the needs of the computer. The visual aspect of the GUI was also a
tremendous advancement. People often confuse this visual presentation with pure
metaphor, but closer inspection reveals that this is not necessarily the case. The
"desktop" metaphor was the first thing to hit users of the GUI. Since it was a global
metaphor and the small pictures of folders, documents, and diskettes played directly into
it, people bought the entire interface as one big metaphor. But there are significant
aspects of the GUI that have nothing to do with metaphor.
Metaphors vs Idioms
If someone says that a person "wants to have his cake and eat it too," we can intuit the
meaning of the expression through its metaphoric content. The cake is a metaphor for
that which we desire, and the expectation of both possessing it and consuming it is
metaphoric for the assumption that acquisition of our desires comes at no cost. But if
someone says that his pet turtle "croaked," it is not possible to intuit the meaning through
the metaphoric content of the expression. The expression "croaked" is an idiom. We
know instantly that the turtle didn't make a funny noise but rather that it died. The
meaning of the idiom must be learned, but it is learned quickly and, once learned,
retained indefinitely.
Most visual elements of the GUI are better thought of as idioms. A scroll bar, for
example, is not a metaphor for anything in the physical world. It is an entirely new
construct, yet it performs an obvious function, its operation is easily mastered, and users
easily remember how it works. It is the visual aspect of the scroll bar that allows it to be
learned so quickly. Users operate it with visual clues rather than remembering the keys
for line up, line down, page up, page down, etc.
Metaphors Can Hinder As Well As Help
The use of metaphor can be helpful when it fits well into a situation, but it is not a
panacea and is not guaranteed to add value. The use of icons as metaphors for functions
is a good example. It can be a gamble if someone will understand the connection
between an icon and the function. Anyone who has played Pictionary knows that the
meaning of a picture is not always clear.
Consider the Microsoft Word 5.0 toolbar. Some icons area readily identifiable, some are
not. The meaning of the identifiable icons will likely be gleaned from the icon, but is still
not a guarantee. The unidentifiable icons, however, can be utterly perplexing, and rather
than helping they can create confusion and frustration. And with so many pictographs
crammed into such a small space, the whole thing reads like a row of enigmatic, ancient
Egyptian hieroglyphs.
The Netscape toolbar, by contrast, can be considered to be much more graceful and
useful. The buttons are a bit larger, which makes them generally more readable. Their
added size also allows the inclusion of text labels indicating the command to which the
icon corresponds. Once the meaning of each icon has become learned the icon can serve
as a visual mnemonic, but until then the text label clearly and unambiguously relays the
function the button will initiate.
Compiled by Omorogbe Harry
24
HCI
The Netscape toolbar admittedly consumes more valuable window real estate than the
Microsoft Word toolbar does. There are keystroke shortcuts for every button, however,
and users who have mastered them can easily hide the toolbar from view. Users who
prefer to use the toolbar are probably willing to sacrifice that small bit of real estate in
order to have a toolbar that is presentable and easy to use.
The "Global Metaphor" Quagmire
One major pitfall into which metaphors can lead us is the "Global Metaphor," which is a
metaphor that is intended to encompass an entire application. The "desktop" concept is
an example of a global metaphor.
The global metaphor becomes a quagmire when reality begins to diverge from the
metaphor. Consider carefully the desktop metaphor. It can be seen how it deviates from
reality immediately. The trash can is a wonderful metaphor for the deletion function, but
trash cans are generally not situated on the top of a desk.
The use of the trash can to eject a disk is a perfect example of contorting the metaphor to
accommodate the divergence from reality. The expectation is that "trashing" a disk will
delete its contents, yet the interface designers needed a way to eject a disk and the trash
can came closer than anything else. Once learned it becomes an idiom that works fine,
but it is initially counter-intuitive to the point that it is shocking.
The vertical aspect of the desktop also subverts the metaphor. It's closer to a refrigerator
on which one can randomly place differently shaped magnets, or the old-fashioned
displays on which TV weathermen placed various symbols. The fact that the desktop
metaphor has to be explained to first-time users is an indication that it might not be
terribly intuitive.
The global metaphor is an example of the "bigger is better" mentality. Metaphors are
perceived as being useful, so some people assume that the more all-encompassing a
metaphor is the more useful it will be. As in all other situations, the usefulness of a
global metaphor is dictated by the overall goals of the interface. If the goal of the
interface is to present a non-threatening face on a system that will be used primarily by
non-technical first-time users, a global metaphor might be useful. But if the goal of the
interface is to input large quantities of data quickly and effectively, a global interface
might be an enormous hindrance.
Don't Throw The Baby Out With The Bath Water
While metaphors aren't always as useful as other solutions, it is important to note that in
the right situation they can be a vital part of a quality user interface. The folder is a
particularly useful and successful metaphor. Its purpose is immediately apparent, and by
placing one folder inside another the user creates a naturally intuitive hierarchy. The
counterpart in the character user interface is the directory/subdirectory construct. This
has no clear correspondence to anything in the physical world, and many non-technical
people have difficulty grasping the concept.
The bottom line is that if a metaphor works naturally, use it by all means. But at the first
hint that the metaphor is not clearly understood or has to be contorted in order to
Compiled by Omorogbe Harry
25
HCI
accommodate reality, it should be strongly considered as to whether it will really help or
not.
Intuitiveness
It is generally perceived that the most fundamental quality of any good user interface
should be that it is intuitive. The problem is that "intuitive" means different things to
different people. To some an intuitive user interface is one that users can figure out for
themselves. There are some instances where this is helpful, but generally the didactic
elements geared for the first-time user will hamper the effectiveness of intermediate or
advanced users.
A much better definition of an intuitive user interface is one that is easy to learn. This
does not mean that no instruction is required, but that it is minimal and that users can
"pick it up" quickly and easily. First-time users might not intuit how to operate a scroll
bar, but once it is explained they generally find it to be an intuitive idiom.
Icons, when clearly unambiguous, can help to make a user interface intuitive. But the
user interface designer should never overlook the usefulness of good old-fashioned text
labels. Icons depicting portrait or landscape orientation, for example, are clearly
unambiguous and perhaps more intuitive than the labels themselves, but without the label
of "orientation," they could make no sense at all.
Labels should be concise, cogent, and unambiguous. A good practice is to make labels
conform to the terminology of the business that the application supports. This is a good
way to pack a lot of meaning into a very few words.
Designing intuitive user interfaces is far more an art than a science. It draws more upon
skills of psychology and cognitive reasoning than computer engineering or even graphic
design. The process of Usability Testing, however, can assess the intuitiveness of a user
interface in an objective manner. Designing an intuitive user interface is like playing a
good game of tennis. Instructors can tell you how to do it, but it can only be achieved
through hard work and practice with a lot of wins and losses on the way.
Consistency
Consistency between applications is always good, but within an application it is
essential. The standard GUI design elements go a long way to bring a level of
consistency to every panel, but "look and feel" issues must be considered as well. The
use of labels and icons must always be consistent. The same label or icon should always
mean the same thing, and conversely the same thing should always be represented by the
same label or icon.
In addition to consistency of labeling, objects should also be placed in a consistent
manner. Consider the example of the Employee Essentials Address Update panels
(available through Bear Access).
Compiled by Omorogbe Harry
26
HCI
There is a different panel for every address that can be updated, each with its own set of
fields to be displayed and modified. Note that each panel is clearly labeled, with the label
appearing in the same location on every panel. A button bank appears in the same place
along the left side of every panel. Some buttons must change to accommodate the needs
of any given panel, but positionality was used consistently. The closer buttons are to the
top the less likely they are to change, and the closer to the bottom the more likely.
Note especially the matrix of buttons at the top left corner of every panel. These buttons
are the same in every panel of the entire Employee Essentials application. They are
known as "permanent objects." Early navigators used stars and constellations as
unchanging reference points around which they could plot their courses. Similarly,
modern aviation navigators use stationary radar beacons. They know that wherever the
plane is, they can count on the radar beacon always being in the same place.
User interface designers should always provide permanent objects as unchanging
reference points around which the users can navigate. If they ever get lost or disoriented,
they should be able to quickly find the permanent objects and from there get to where
they need to be. On the Macintosh, the apple menu and applications menu are examples
of permanent objects. No matter what application the user is in, those objects will appear
on the screen.
Most all Macintosh applications provide "File" and "Edit" as the first two pull-down
menus. The "File" menu generally has "New" "Open" "Close" "Save" and "Save As" as
the first selections in the menu, and "Quit" as the last selection. The "Edit" menu
generally has "Cut," "Copy," and "Paste" as the first selections. The ubiquity of these
conventions has caused them to become permanent objects. The users can count on
finding them in virtually all circumstances, and from there do what they need to do.
Bear Access itself is becoming a permanent object at Cornell. If a user is at an unfamiliar
workstation, all he or she needs to do is locate Bear Access, and from there an extensive
suite of applications will be available.
Simplicity
The complexity of computers and the information systems they support often causes us
to overlook Occam's Razor, the principle that the most graceful solution to any problem
is the one which is the most simple.
A good gauge of simplicity is often the number of panels that must be displayed and the
number of mouse clicks or keystrokes that are required to accomplish a particular task.
All of these should be minimized. The fewer things users have to see and do in order to
get their work done, the happier and more effective they will be.
A good example of this is the way in which the user sets the document type in Microsoft
Word version 5.0 as compared to version 4.0. In version 4.0, the user clicks a button on
the save dialog that presents another panel in which there is a selection of radio buttons
indicating all the valid file types. In version 5.0, there is simply a popup list on the save
dialog. This requires fewer panels to be displayed and fewer mouse clicks to be made,
and yet accomplishes exactly the same task.
Compiled by Omorogbe Harry
27
HCI
A pitfall that should be avoided is "featuritis," providing an over-abundance of features
that do not add value to the user interface. New tools that are available to developers
allow all kinds of things to be done that weren't possible before, but it is important not to
add features just because it's possible to do so. The indiscriminate inclusion of features
can confuse the users and lead to "window pollution." Features should not be included on
a user interface unless there is a compelling need for them and they add significant value
to the application.
Prevention
A fundamental tenet of graphic user interfaces is that it is preferable to prevent users
from performing an inappropriate task in the first place rather than allowing the task to
be performed and presenting a message afterwards saying that it couldn't be done. This is
accomplished by disabling, or "graying out" certain elements under certain conditions.
Consider the average save dialog. A document can not be saved if it has not been given a
name. Note how the Save button is disabled when the name field is blank, but is enabled
when a name has been entered.
Forgiveness
One of the advantages of graphic user interfaces is that with all the options plainly laid
out for users, they are free to explore and discover things for themselves. But this
requires that there always be a way out if they find themselves somewhere they realize
they shouldn't be, and that special care is taken to make it particularly difficult to "shoot
themselves in the foot." A good tip to keep users from inadvertently causing damage is to
avoid the use of the Okay button in critical situations. It is much better to have button
labels that clearly indicate the action that will be taken.
Consider the example when the user closes a document that contains changes that have
not been saved. It can be very misleading to have a message that says "Continue without
saving?" and a default button labeled "Okay." It is much better to have a dialog that says
"Document has been changed" and a default button labeled "Save", with a "Don't save"
button to allow the user not to save changes if that is, in fact, the desired action.
Likewise, it can be helpful in potentially dangerous situations to have the Cancel button
be the default button so that it must be a deliberate action on the part of the user to
execute the function. An example is a confirmation dialog when a record is being
deleted.
Aesthetics
Finally, it is important that a user interface be aesthetically pleasing. It is possible for a
user interface to be intuitive, easy to use, and efficient and still not be terribly nice to
look at. While aesthetics do not directly impact the effectiveness of a user interface, users
will be happier and therefore more productive if they are presented with an attractive
user interface.
Compiled by Omorogbe Harry
28
HCI
CHAPTER FOUR
Principles for User-Interface Design.
This section represents a compilation of fundamental principles for designing user
interfaces, which have been drawn from various books on interface design, as well as my
own experience. Most of these principles can be applied to either command-line or
graphical environments. I welcome suggestions for changes and additions -- I would like
this to be viewed as an "open-source" evolving section.
The principle of user profiling
-- Know who your user is.
Before we can answer the question "How do we make our user-interfaces better", we
must first answer the question: Better for whom? A design that is better for a technically
skilled user might not be better for a non-technical businessman or an artist.
One way around this problem is to create user models. [TOG91] has an excellent chapter
on brainstorming towards creating "profiles" of possible users. The result of this process
is a detailed description of one or more "average" users, with specific details such as:
 What are the user's goals?
 What are the user's skills and experience?
 What are the user's needs?
Armed with this information, we can then proceed to answer the question: How do we
leverage the user's strengths and create an interface that helps them achieve their goals?
In the case of a large general-purpose piece of software such as an operating system,
there may be many different kinds of potential users. In this case it may be more useful
to come up with a list of user dichotomies, such as "skilled vs. unskilled", "young vs.
old", etc., or some other means of specifying a continuum or collection of user types.
Another way of answering this question is to talk to some real users. Direct contact
between end-users and developers has often radically transformed the development
process.
The principle of metaphor
-- Borrow behaviors from systems familiar to your users.
Frequently a complex software system can be understood more easily if the user
interface is depicted in a way that resembles some commonplace system. The ubiquitous
"Desktop metaphor" is an overused and trite example. Another is the tape deck metaphor
seen on many audio and video player programs. In addition to the standard transport
controls (play, rewind, etc.), the tape deck metaphor can be extended in ways that are
quite natural, with functions such as time-counters and cueing buttons. This concept of
"extendibility" is what distinguishes a powerful metaphor from a weak one.
There are several factors to consider when using a metaphor:
Compiled by Omorogbe Harry
29
HCI
 Once a metaphor is chosen, it should be spread widely throughout the interface,
rather than used once at a specific point. Even better would be to use the same
metaphor spread over several applications (the tape transport controls described
above is a good example.) Don't bother thinking up a metaphor which is only
going to apply to a single button.
 There's no reason why an application cannot incorporate several different
metaphors, as long as they don't clash. Music sequencers, for example, often
incorporate both "tape transport" and "sheet music" metaphors.
 Metaphor isn't always necessary. In many cases the natural function of the
software itself is easier to comprehend than any real-world analog of it. Don't
strain a metaphor in adapting it to the program's real function. Nor should you
strain the meaning of a particular program feature in order to adapt it to a
metaphor.
 Incorporating a metaphor is not without certain risks. In particular, whenever
physical objects are represented in a computer system, we inherit not only the
beneficial functions of those objects but also the detrimental aspects.
 Be aware that some metaphors don't cross cultural boundaries well. For example,
Americans would instantly recognize the common U.S. Mailbox (with a rounded
top, a flat bottom, and a little red flag on the side), but there are no mailboxes of
this style in Europe.
The principle of feature exposure
-- Let the user see clearly what functions are available
Software developers tend to have little difficulty keeping large, complex mental models
in their heads. But not everyone prefers to "live in their heads" -- instead, they prefer to
concentrate on analyzing the sensory details of the environment, rather than spending
large amounts of time refining and perfecting abstract models. Both type of personality
(labeled "Intuitive" and "Sensable" in the Myers-Briggs personality classification) can be
equally intelligent, but focus on different aspects of life. It is to be noted that according
to some psychological studies "Sensables" outnumber "Intuitives" in the general
population by about three to one.
Intuitives prefer user interfaces that utilize the power of abstract models -- command
lines, scripts, plug-ins, macros, etc. Sensables prefer user interfaces that utilize their
perceptual abilities -- in other words, they like interfaces where the features are "up
front" and "in their face". Toolbars and dialog boxes are an example of interfaces that are
pleasing to this personality type.
This doesn't mean that you have to make everything a GUI. What it does mean, for both
GUI and command line programs, is that the features of the program need to be easily
exposed so that a quick visual scan can determine what the program actually does. In
some cases, such as a toolbar, the program features are exposed by default. In other
cases, such as a printer configuration dialog, the exposures of the underlying printer state
(i.e. the buttons and controls which depict the conceptual printing model) are contained
in a dialog box which is brought up by a user action (a feature which is itself exposed in
a menu).
Compiled by Omorogbe Harry
30
HCI
Of course, there may be cases where you don't wish to expose a feature right away,
because you don't want to overwhelm the beginning user with too much detail. In this
case, it is best to structure the application like the layers of an onion, where peeling away
each layer of skin reveals a layer beneath. There are various levels of "hiding": Here's a
partial list of them in order from most exposed to least exposed:







Toolbar (completely exposed)
Menu item (exposed by trivial user gesture)
Submenu item (exposed by somewhat more involved user gesture)
Dialog box (exposed by explicit user command)
Secondary dialog box (invoked by button in first dialog box)
"Advanced user mode" controls -- exposed when user selects "advanced" option
Scripted functions
The above notwithstanding, in no case should the primary interface of the application be
a reflection of the true complexity of the underlying implementation. Instead, both the
interface and the implementation should strive to match a simplified conceptual model
(in other words, the design) of what the application does. For example, when an error
occurs, the explanation of the error should be phrased in a way that relates to the current
user-centered activity, and not in terms of the low-level fault that caused there error.
The principle of coherence
-- The behavior of the program should be internally and externally consistent
There's been some argument over whether interfaces should strive to be "intuitive", or
whether an intuitive interface is even possible. However, it is certainly arguable that an
interface should be coherent -- in other words logical, consistent, and easily followed.
("Coherent" literally means "stick together", and that's exactly what the parts of an
interface design should do.)
Internal consistency means that the program's behaviors make "sense" with respect to
other parts of the program. For example, if one attribute of an object (e.g. color) is
modifiable using a pop-up menu, then it is to be expected that other attributes of the
object would also be editable in a similar fashion. One should strive towards the
principle of "least surprise".
External consistency means that the program is consistent with the environment in which
it runs. This includes consistency with both the operating system and the typical suite of
applications that run within that operating system. One of the most widely recognized
forms of external coherence is compliance with user-interface standards. There are many
others, however, such as the use of standardized scripting languages, plug-in
architectures or configuration methods.
The principle of state visualization
-- Changes in behavior should be reflected in the appearance of the program
Each change in the behavior of the program should be accompanied by a corresponding
change in the appearance of the interface. One of the big criticisms of "modes" in
Compiled by Omorogbe Harry
31
HCI
interfaces is that many of the classic "bad example" programs have modes that are
visually indistinguishable from one another.
Similarly, when a program changes its appearance, it should be in response to a behavior
change; A program that changes its appearance for no apparent reason will quickly teach
the user not to depend on appearances for clues as to the program's state.
One of the most important kinds of state is the current selection, in other words the
object or set of objects that will be affected by the next command. It is important that this
internal state be visualized in a way that is consistent, clear, and unambiguous. For
example, one common mistake seen in a number of multi-document applications is to
forget to "dim" the selection when the window goes out of focus. The result of this is that
a user, looking at several windows at once, each with a similar-looking selection, may be
confused as to exactly which selection will be affected when they hit the "delete" key.
This is especially true if the user has been focusing on the selection highlight, and not on
the window frame, and consequently has failed to notice which window is the active one.
(Selection rules are one of those areas that are covered poorly by most UI style
guidelines, which tend to concentrate on "widgets", although the Mac and Amiga
guidelines each have a chapter on this topic.)
The principle of shortcuts
-- Provide both concrete and abstract ways of getting a task done
Once a user has become experienced with an application, she will start to build a mental
model of that application. She will be able to predict with high accuracy what the results
of any particular user gesture will be in any given context. At this point, the program's
attempts to make things "easy" by breaking up complex actions into simple steps may
seem cumbersome. Additionally, as this mental model grows, there will be less and less
need to look at the "in your face" exposure of the application's feature set. Instead, prememorized "shortcuts" should be available to allow rapid access to more powerful
functions.
There are various levels of shortcuts, each one more abstract than its predecessor. For
example, in the emacs editor commands can be invoked directly by name, by menu bar,
by a modified keystroke combination, or by a single keystroke. Each of these is more
"accelerated" than its predecessor.
There can also be alternate methods of invoking commands that are designed to increase
power rather than to accelerate speed. A "recordable macro" facility is one of these, as is
a regular-expression search and replace. The important thing about these more powerful
(and more abstract) methods is that they should not be the most exposed methods of
accomplishing the task. This is why emacs has the non-regexp version of search assigned
to the easy-to-remember "C-s" key.
The principle of focus
-- Some aspects of the UI attract attention more than others do
The human eye is a highly non-linear device. For example, it possesses edge-detection
hardware, which is why we see Mach bands whenever two closely matched areas of
Compiled by Omorogbe Harry
32
HCI
color come into contact. It also has motion-detection hardware. As a consequence, our
eyes are drawn to animated areas of the display more readily than static areas. Changes
to these areas will be noticed readily.
The mouse cursor is probably the most intensely observed object on the screen -- it's not
only a moving object, but mouse users quickly acquire the habit of tracking it with their
eyes in order to navigate. This is why global state changes are often signaled by changes
to the appearance of the cursor, such as the well-known "hourglass cursor". It's nearly
impossible to miss.
The text cursor is another example of a highly eye-attractive object. Changing its
appearance can signal a number of different and useful state changes.
The principle of grammar
-- A user interface is a kind of language -- know what the rules are
Many of the operations within a user interface require both a subject (an object to be
operated upon), and a verb (an operation to perform on the object). This naturally
suggests that actions in the user interface form a kind of grammar. The grammatical
metaphor can be extended quite a bit, and there are elements of some programs that can
be clearly identified as adverbs, adjectives and such.
The two most common grammars are known as "Action->Object" and "Object->Action".
In Action->Object, the operation (or tool) is selected first. When a subsequent object is
chosen, the tool immediately operates upon the object. The selection of the tool persists
from one operation to the next, so that many objects can be operated on one by one
without having to re-select the tool. Action->Object is also known as "modality",
because the tool selection is a "mode" which changes the operation of the program. An
example of this style is a paint program -- a tool such as a paintbrush or eraser is
selected, which can then make many brush strokes before a new tool is selected.
In the Object->Action case, the object is selected first and persists from one operation to
the next. Individual actions are then chosen which operate on the currently selected
object or objects. This is the method seen in most word processors -- first a range of text
is selected, and then a text style such as bold, italic or a font change can be selected.
Object->Action has been called "non-modal" because all behaviors that can be applied to
the object are always available. One powerful type of Object->Action is called "direct
manipulation", where the object itself is a kind of tool -- an example is dragging the
object to a new position or resizing it.
Modality has been much criticized in user-interface literature because early programs
were highly modal and had hideous interfaces. However, while non-modality is the clear
winner in many situations, there are a large number of situations in life that are clearly
modal. For example, in carpentry, it’s generally more efficient to hammer in a whole
bunch of nails at once than to hammer in one nail, put down the hammer, pick up the
measuring tape, mark the position of the next nail, pick up the drill, etc.
Compiled by Omorogbe Harry
33
HCI
The principle of help
-- Understand the different kinds of help a user needs
In an essay in [LAUR91] it states that there are five basic types of help, corresponding to
the five basic questions that users ask:





Goal-oriented: "What kinds of things can I do with this program?"
Descriptive: "What is this? What does this do?"
Procedural: "How do I do this?"
Interpretive: "Why did this happen?"
Navigational: "Where am I?"
The essay goes on to describe in detail the different strategies for answering these
questions, and shows how each of these questions requires a different sort of help
interface in order for the user to be able to adequately phrase the question to the
application.
For example, "about boxes" is one way of addressing the needs of question of type 1.
Questions of type 2 can be answered with a standard "help browser", "tool tips" or other
kinds of context-sensitive help. A help browser can also be useful in responding to
questions of the third type, but these can sometimes be more efficiently addressed using
"cue cards", interactive "guides", or "wizards" which guide the user through the process
step-by-step. The fourth type has not been well addressed in current applications,
although well-written error messages can help. The fifth type can be answered by proper
overall interface design, or by creating an application "roadmap". None of the solutions
listed in this paragraph are final or ideal; they are simply the ones in common use by
many applications today.
The principle of safety
-- Let the user develop confidence by providing a safety net
Ted Nelson once said "Using DOS is like juggling with straight razors. Using a Mac is
like shaving with a bowling pin."
Each human mind has an "envelope of risk", that is to say a minimum and maximum
range of risk-levels which they find comfortable. A person who finds herself in a
situation that is too risky for her comfort will generally take steps to reduce that risk.
Conversely, when a person's life becomes too safe -- in other words, when the risk level
drops below the minimum threshold of the risk envelope -- she will often engage in
actions that increase their level of risk.
This comfort envelope varies for different people and in different situations. In the case
of computer interfaces, a level of risk that is comfortable for a novice user might make a
"power-user" feel uncomfortably swaddled in safety.
It's important for new users that they feel safe. They don't trust themselves or their skills
to do the right thing. Many novice users think poorly not only of their technical skills,
but of their intellectual capabilities in general (witness the popularity of the "...for
Dummies" series of tutorial books.) In many cases these fears are groundless, but they
Compiled by Omorogbe Harry
34
HCI
need to be addressed. Novice users need to be assured that they will be protected from
their own lack of skill. A program with no safety net will make this type of user feel
uncomfortable or frustrated to the point that they may cease using the program. The "Are
you sure?" dialog box and multi-level undo features are vital for this type of user.
At the same time, an expert user must be able to use the program as a virtuoso. She must
not be hampered by guard rails or helmet laws. However, expert users are also smart
enough to turn off the safety checks -- if the application allows it. This is why "safety
level" is one of the more important application configuration options.
Finally, it should be noted that many things in life are not meant to be easy. Physical
exercise is one -- "no pain, no gain". A concert performance in Carnegie Hall, a
marathon, or the Guinness World Record would be far less impressive if anybody could
do it. This is especially pertinent in the design of computer game interfaces, which
operate under somewhat different principles than those listed here (although many of the
principles in fact do apply).
The principle of context
-- Limit user activity to one well-defined context unless there's a good reason not to
Each user action takes place within a given context -- the current document, the current
selection, the current dialog box. A set of operations that is valid in one context may not
be valid in another. Even within a single document, there may be multiple levels -- for
example, in a structured drawing application, selecting a text object (which can be
moved or resized) is generally considered a different state from selecting an individual
character within that text object.
It's usually a good idea to avoid mixing these levels. For example, imagine an application
that allows users to select a range of text characters within a document, and also allows
them to select one or more whole documents (the latter being a distinct concept from
selecting all of the characters in a document). In such a case, it's probably best if the
program disallows selecting both characters and documents in the same selection. One
unobtrusive way to do this is to "dim" the selection that is not applicable in the current
context. In the example above, if the user had a range of text selected, and then selecting
a document, the range of selected characters could become dim, indicating that the
selection was not currently pertinent. The exact solution chosen will of course depend on
the nature of the application and the relationship between the contexts.
Another thing to keep in mind is the relationship between contexts. For example, it is
often the case that the user is working in a particular task-space, when suddenly a dialog
box will pop up asking the user for confirmation of an action. This sudden shift of
context may leave the user wondering how the new context relates to the old. This
confusion is exacerbated by the terseness of writing style that is common amongst
application writers. Rather than the "Are you sure?" confirmation mentioned earlier,
something like "There are two documents unsaved. Do you want to quit anyway?" would
help to keep the user anchored in their current context.
Compiled by Omorogbe Harry
35
HCI
The principle of aesthetics
-- Create a program of beauty
It's not necessary that each program be a visual work of art. But it's important that it not
be ugly. There are a number of simple principles of graphical design that can easily be
learned, the most basic of which was coined by artist and science fiction writer William
Rotsler: "Never do anything that looks to someone else like a mistake." The specific
example Rotsler used was a painting of a Conan-esque barbarian warrior swinging a
mighty broadsword. In this picture, the tip of the broadsword was just off the edge of the
picture. "What that looks like", said Rotsler, "is a picture that's been badly cropped. They
should have had the tip of the sword either clearly within the frame or clearly out of it."
An interface example can be seen in the placement of buttons -- imagine five buttons,
each with five different labels that are almost the same size. Because the buttons are
packed using an automated-layout algorithm, each button is almost but not exactly the
same size. As a result, though the author has placed much care into his layout, it looks
carelessly done. A solution would be to have the packing algorithm know that buttons
that are almost the same size look better if they are exactly the same size -- in other
words, to encode some of the rules of graphical design into the layout algorithm. Similar
arguments hold for manual widget layout.
Another area of aesthetics to consider is the temporal dimension. Users don't like using
programs that feel sluggish or slow. There are many tricks that can be used to make a
slow program "feel" snappy, such as the use of off-screen bitmaps for rendering, which
can then be blitted forward in a single operation. (A pet peeve of this particular author is
buttons that flicker when the button is being activated or the window is being resized.
Multiply redundant refreshing of buttons when changing state is one common cause of
this.)
The principle of user testing
-- Recruit help in spotting the inevitable defects in your design
In many cases a good software designer can spot fundamental defects in a user interface.
However, there are many kinds of defects which are not so easy to spot, and in fact an
experienced software designer is often less capable of spotting them than the average
person. In other cases, a bug can only be detected while watching someone else use the
program.
User-interface testing, that is, the testing of user-interfaces using actual end-users, has
been shown to be an extraordinarily effective technique for discovering design defects.
However, there are specific techniques that can be used to maximize the effectiveness of
end-user testing. These are outlined in both [TOG91] and [LAUR91] and can be
summarized in the following steps:
 Set up the observation. Design realistic tasks for the users, and then recruit endusers that have the same experience level as users of your product (Avoid
recruiting users who are familiar with your product however).
 Describe to the user the purpose of the observation. Let them know that you're
testing the product, not them, and that they can quit at any time. Make sure that
Compiled by Omorogbe Harry
36
HCI







they understand if anything bad happens, it's not their fault, and that it's helping
you to find problems.
Talk about and demonstrate the equipment in the room.
Explain how to "think aloud". Ask them to verbalize what they are thinking about
as they use the product, and let them know you'll remind them to do so if they
forget.
Explain that you will not provide help.
Describe the tasks and introduce the product.
Ask if there are any questions before you start; then begin the observation.
Conclude the observation. Tell them what you found out and answer any of their
questions.
Use the results.
User testing can occur at any time during the project, however, it's often more efficient to
build a mock-up or prototype of the application and test that before building the real
program. It's much easier to deal with a design defect before it's implemented than after.
Tognazzini suggests that you need no more than three people per design iteration -- any
more than that and you are just confirming problems already found.
The principle of humility
-- Listen to what ordinary people have to say
Some of the most valuable insights can be gained by simply watching other people
attempt to use your program. Others can come from listening to their opinions about the
product. Of course, you don't have to do exactly everything they say. It's important to
realize that each of you, user and developer, has only part of the picture. The ideal is to
take a lot of user opinions, plus your insights as a developer and reduce them into an
elegant and seamless whole -- a design which, though it may not satisfy everyone, will
satisfy the greatest needs of the greatest number of people.
One must be true to one's vision. A product built entirely from customer feedback is
doomed to mediocrity, because what users want most are the features that they cannot
anticipate.
But a single designer's intuition about what is good and bad in an application is
insufficient. Program creators are a small, and not terribly representative, subset of the
general computing population.
Some things designers should keep in mind about their users:
Most people have a biased idea as to the what the "average" person is like. This is
because most of our interpersonal relationships are in some way self-selected. It's a rare
person whose daily life brings them into contact with other people from a full range of
personality types and backgrounds. As a result, we tend to think that others think "mostly
like we do." Designers are no exception. Most people have some sort of core
competency, and can be expected to perform well within that domain.
The skill of using a computer (also known as "computer literacy") is actually much
harder than it appears.
Compiled by Omorogbe Harry
37
HCI
The lack of "computer literacy" is not an indication of a lack of basic intelligence. While
native intelligence does contribute to one's ability to use a computer effectively, there are
other factors which seem to be just as significant, such as a love of exploring complex
systems, and an attitude of playful experimentation. Much of the fluency with computer
interfaces derives from play -- and those who have dedicated themselves to "serious"
tasks such as running a business, curing disease, or helping victims of tragedy may lack
the time or patience to be able to devote effort to it.
A high proportion of programmers are introverts, compared to the general population.
This doesn't mean that they don't like people, but rather that there are specific social
situations that make them uncomfortable. Many of them lack social skills, and retreat
into the world of logic and programming as an escape; As a result, they are not
experienced people-watchers.
The best way to avoid misconceptions about users is to spend some time with them,
especially while they are actually using a computer. Do this long enough, and eventually
you will get a "feel" for how the average non-technical person thinks. This will increase
your ability to spot defects, although it will never make it 100%, and will never be a
substitute for user-testing.
ERGONOMIC GUIDELINES FOR USER-INTERFACE DESIGN
The following points are guidelines to good software interface design, not an absolute set
of rules to be blindly followed. These guidelines apply to the content of screens. In
addition to following these guidelines, effective software also necessitates using
techniques, such as 'storyboarding', to ensure that the flow of information from screen to
screen is logical, follows user expectations, and follows task requirements.
Consistency ("Principle of least astonishment")
 certain aspects of an interface should behave in consistent ways at all times for all
screens
 terminology should be consistent between screens
 icons should be consistent between screens
 colors should be consistent between screens of similar function
Simplicity
 break complex tasks into simpler tasks
 break long sequences into separate steps
 keep tasks easy by using icons, words etc.
 use icons/objects that are familiar to the user
Human Memory Limitations
 organize information into a small number of "chunks"
 try to create short linear sequences of tasks
 don't flash important information onto the screen for brief time periods
 organize data fields to match user expectations, or to organize user input (e.g.
auto formatting phone numbers)
 provide cues/navigation aids for the user to know where they are in the software
or at what stage they are in an operation
Compiled by Omorogbe Harry
38
HCI




provide reminders, or warnings as appropriate
provide ongoing feedback on what is and/or just has happened
let users recognize rather than recall information
minimize working memory loads by limiting the length of sequences and quantity
of information - avoid icon mania!
Cognitive Directness
 minimize mental transformations of information (e.g. using 'control+shift+esc+8'
to indent a paragraph)
 use meaningful icons/letters
 use appropriate visual cues, such as direction arrows
 use 'real-world' metaphors whenever possible (e.g. desktop metaphor, folder
metaphor, trash can metaphor etc.)
Feedback
 provide informative feedback at the appropriate points
 provide appropriate articulatory feedback - feedback that confirms the physical
operation you just did (e.g. typed 'help' and 'help' appear on the screen). This
includes all forms of feedback, such as auditory feedback (e.g. system beeps,
mouse click, key clicks etc.)
 provide appropriate semantic feedback - feedback that confirms the intention of
an action (e.g. highlighting an item being chosen from a list)
 provide appropriate status indicators to show the user the progress with a lengthy
operation (e.g. the copy bar when copying files, an hour glass icon when a
process is being executed etc.)
System messages
 provide user-centered wording in messages (e.g. "there was a problem in copying
the file to your disk" rather than "execution error 159")
 avoid ambiguous messages (e.g. hit 'any' key to continue - there is no 'any' key
and there's no need to hit a key, reword to say 'press the return key to continue)
 avoid using threatening or alarming messages (e.g. fatal error, run aborted, kill
job, catastrophic error)
 use specific, constructive words in error messages (e.g. avoid general messages
such as 'invalid entry' and use specifics such as 'please enter your name')
 make the system 'take the blame' for errors (e.g. "illegal command" versus
"unrecognized command")
Anthropomorphization
 don't anthropomorphize (i.e. don't attribute human characteristics to objects) avoid the "Have a nice day" messages from your computer
Modality
 use modes cautiously - a mode is an interface state where what the user does has
different actions than in other states (e.g. changing the shape of the cursor can
indicate whether the user is in an editing mode or a browsing mode)
 minimize preemptive modes, especially irreversible preemptive modes - a
preemptive mode is one where the user must complete one task before
Compiled by Omorogbe Harry
39
HCI
proceeding to the next. In a preemptive mode other software functions are
inaccessible (e.g. file save dialog boxes)
 make user actions easily reversible - use 'undo' commands, but use these
sparingly
 allow escape routes from operations
Attention
 use attention grabbing techniques cautiously (e.g. avoid overusing 'blinks' on web
pages, flashing messages, 'you have mail', bold colors etc.)
 don't use more than 4 different font sizes per screen
 use serif or sans serif fonts appropriately as the visual task situation demands
 don't use all uppercase letters - use and uppercase/lowercase mix
 don't overuse audio or video
 use colors appropriately and make use of expectations (e.g. don't have an OK
button colored red! use green for OK, yellow for 'caution, and red for 'danger' or
'stop')
 don't use more than 4 different colors on a screen
 don't use blue for text (hard to read), blue is a good background color
 don't put red text on a blue background
 use high contrast color combinations
 use colors consistently
 use only 2 levels of intensity on a single screen
 Use underlining, bold, inverse video or other markers sparingly
 on text screens don't use more than 3 fonts on a single screen
Display issues
 maintain display inertia - make sure the screen changes little from one screen to
the next within a functional task situation
 organize screen complexity
 eliminate unnecessary information
 use concise, unambiguous wording for instructions and messages
 use easy to recognize icons
 use a balanced screen layout - don't put too much information at the top of the
screen - try to balance information in each screen quadrant
 use plenty of 'white space' around text blocks - use at least 50% white space for
text screens
 group information logically
 structure the information rather than just presenting a narrative format
(comprehension can be 40% faster for a structured format)
Individual differences
 accommodate individual differences in user experience (from the novice to the
computer literate)
 accommodate user preferences by allowing some degree of customization of
screen layout, appearance, icons etc.
 allow alternative forms for commands (e.g. key combinations through menu
selections)
Compiled by Omorogbe Harry
40
HCI
Web page design
Download speed is a critical aspect of web page design. Remember that when you check
your pages locally in your browser you aren't experiencing normal web delays!
Regardless of your modem speed, pages will only download at the fastest rate of the
slowest link in the 'chain' from a server to the browser. The following tips will help to
speed downloads and aid comprehension of your web page materials:
 avoid using 'blinks' unless these are absolutely necessary - blinks are distracting,
use fonts, sizes, colors to attract attention
 keep backgrounds simple and muted
 minimize audio and video use, this really slows download time
 use animated files (e.g. animated .GIFs) sparingly
 use thumbnail .GIFs linked to larger .GIFs
 specify .GIF size (HEIGHT, WIDTH) - this speeds download times
 use 'ALTs' for .GIFs where only the .GIF provides the link - this provides linked
text information to those only browsing in text mode
 use image maps sparingly - they are slow and can be annoying - using an
invisible table can often give similar results with much faster downloads
 use frames sparingly and consistently - use absolute widths for frames, scroll
bars, avoid menus for small numbers of items, also check that users don't get
stuck in a frame
 avoid 'construction signs' - web pages are meant to be dynamic and therefore
should be changed/updated regularly - they are always under construction - try to
tell users when content was last changed and what changes were made
 minimize use of Java, Javascript, Applets (e.g. ticker tape status bars) - they are
cute but often provide little useful information content and slow downloads
 remember that 50% of users have monitors 15" or less and at 640 x 480
resolution, so use a maximum window width of 620 pixels or flexible window
widths and test your pages in your browser at low screen resolutions and limited
colors (256 or less)
 provide contact information at the home page location in your site
General principles to follow when designing any programme.
A good interface will fade into the background and the user will focus on the task at
hand.
Human Issues
Baeker and Buxton (pg. 40) state that the "beliefs and expectations with which she (the
computer user) sits down at her terminal or personal computer are a direct result of her
concept of what the computer is like and what the computer has become.", thus Hansen
(cited in Shneiderman, 1986) states that one should "know the user". This includes all
aspects of the user's experience of computerized systems as well as their personal
preferences.
Compiled by Omorogbe Harry
41
HCI
Previous computer experience and design expectations.
For example a user who has only had experience in the windows environment is unlikely
to benefit from a DOS look and feel, even if the programme is functionally adequate for
all their programming needs. This is vitally important when one remembers that the
computer, for most users, is simply one of an array of tools that can be used to perform a
certain task. If the tool is not readily accessible and easy to use it will be discarded in
preference of another.
Cultural Issues
Certain images, graphics and language may be offensive to one group of users, and care
must be taken to avoid inadvertently offending any one on the basis of culture, race,
creed, gender or sexual orientation. Muslim users may be offended (or alienated) by
popping champagne bottles, whilst indirectly comparing a Zulu user to an animal
(cartoon of a monkey) would equally offend and alienate this group. Language should be
inoffensive, and gender neutral.
Differently abled users
Any computer programme may be used by people with physical challenges e.g. the blind
and deaf. Even in areas where it is unlikely for the physically disabled to be accepted,
there may be occasions when a user is temporarily disabled and still needs access to the
equipment. For instance if a hand is in plaster cast would the user still be able to access
the information. Sound should include textual alternatives, and visual graphics should
have descriptions.
Colour Vision Deficiency (Colour blindness) is more prevalent that one realizes, make
sure that any important colour coding and contrasts take this into account. Table 1
outlines the more common discrimination confusions in fairly technical terms whilst
Fowler and Stanwick (1995 pgs. 309, 310) state that "Color blindness or weakness has
four basic varieties.
(i)
Green blindness - individuals confuse greens, yellows, and reds (6.39
percent)
(ii) Red blindness - individuals confuse various shades of red (2.04 percent)
(iii) Blue blindness - individuals confuse blues (0.0003 percent)
(iv) Total color blindness, which affects no more than 0.005 percent of both
sexes."
The Macintosh Human Interface Guidelines also warns against this problem stating
"people with color-deficient vision wouldn't recognize the use of color to indicate
selection. Therefore, you shouldn't use color as the only means of communicating
important information. Color should be used redundantly. It shouldn't be the only thing
that distinguishes two objects; there should be other cues, such as text labels, shape,
location, pattern, or sound." and suggests that all images should be developed in black
and white first. (For more information about the use of colour see the section heading
"Colour".)
Learning Time
Nelson (cited in Baeker and Buxton, 1987) stated that "any system which cannot be well
taught to a layman in ten minutes, by a tutor in the presence of a responding set-up, is too
complicated". Factors that lead to the shortening of the learning time include familiarity,
Compiled by Omorogbe Harry
42
HCI
consistency and the use of an accessible metaphor. If a user can visualize the structure of
a system and is able to predict the outcome of interactions, they will have more
confidence with quicker interactions and a lower error rate.
Menus and selection objects
Menu systems and graphical iconic symbolization are not necessarily universally
understood. Various authors point to the following guidelines when creating selection
items:(i)
All graphic representation should have textual descriptions.
(ii) Consistency of terminology should apply to all options throughout the
system.
(iii) Avoid the use of jargon and keep phrasing concise.
(iv) Keywords should be scanned by the user first.
(v) Group similar items in a menu map, or if this is not possible use other
instinctive alternatives such as alphabetic order.
(vi) Avoid multiple screen transversals for selection purposes.
(vii) Avoid ambiguity.
(viii) Consistency throughout is vital.
Icon Tips
Pictorial literacy is not a given. Interpretations of graphics are often dependant on
culture, experience and exposure to a specific medium (see Amory and Mars, 1994 and
Andrews, 1994). One pertinent example is that arrows are not a universal symbol of
direction. It is for this reason that most authorities in Interface design recommend that all
buttons, icons etc be labeled.
Fowler and Stanwick (pages 57, 58) suggest that there are two standard sizes for icons,
16 pixels square and 32 pixels square, They quote William Horton's book "The Icon
Book" as suggesting that "Design is easier if the grid has an odd number of pixels along
each side. This is because an odd number provides a central pixel around which to focus
design". They go on to state that each icon should have a label, which should be the same
as (or an abbreviation of) the title of the corresponding window.
Navigation Issues
Navigation issues vary between Multimedia and WebPages but the common issues
include links to the first screen/page, next screen/page, backtrack facilities and every
system should have a quick exit button. See the section on the use of metaphor for
commonly used buttons. All applications should have short cuts for expert users.
Sound
All aspects of design should adhere to the concept of adding meaning, if there is no
enhancement of accessibility for the user, then there is no need for the information,
graphic or media to be added. Similarly sound should only be inserted if it enhances
meaning and it should not distract the users attention.
Where ever possible allow the user interactive control to play, stop, rewind and pause. It
is also useful to be aware that some users may be disturbed by a faceless voice. Many
applications display a picture or video of a person when a voice recording is played.
Compiled by Omorogbe Harry
43
HCI
Mixed Media
When using a combination of media e.g. sound, text, animation and video, be careful that
the users attention is not distracted by one or other of the media. e.g. animation and
sound can work well together, but animation and text presented simultaneously is likely
to be distracting.
Messages and Status reports
Concise, brief, unambiguous, clearly visible and consistently placed on screen.
Feedback
Immediate, positive and instructional
Tone
Respect for the user and subject material is imperative. Avoid slang, misplaced humour
and potentially offensive insinuations.
Screen Layout and Design
The layout of the screen is a controversial issue; what is aesthetically pleasing to one
person may be considered dull and boring or, conversely, garish to another. The
following locally designed pages may best illustrate this:
Novice designers should aim for elegant simplicity and consistency. It helps to divide the
screen into a grid where similar types of information are consistently placed. This helps
the designer form a visual sense of balance across screens, and the consistency will aid
the user to quickly locate the important information. Users typically suffer from
"cognitive overload" from too much information and too many diverse media used
simultaneously.
Font should be legible, and care must be taken to ensure that the users machine is likely
to have a similar font to the one selected so that there is a level of predictability in the
final display. A mixture of too many fonts detracts from legibility, rather use a maximum
of two fonts and vary the size and weights to change the emphasis or draw attention to
different areas of information. All screens should be titled, and the titles should match
the names of the interaction that brought the user to the screen. White space consistently
used can separate the screen into logical groups of information and make it more legible.
Colour
Most people involved with the development of interactive course material cannot afford
the expertise and skills of a graphic design artist. This is often obvious in the end results
and if at all possible it is recommended that a graphic artist be included in a team of
developers. However, for those that are in the unfortunate position of a "do or die"
scenario the following advise may assist. Most authors suggest the use of a maximum of
four colours.
Use colours to colour code similar items, but remember that colour coding is only useful
if the user knows the code (red=stop, green=go); the metaphor should be a familiar one
to the users otherwise lengthy explanations are necessary and counter productive. Also
colours are often used to depict various items (e.g. in medical illustrations red are used to
Compiled by Omorogbe Harry
44
HCI
depict arteries and yellow to depict nerves), switching or changing these colours could be
confusing for the user.
In dense screens colour coding can assist the user to identify grouped material - choose
your colours carefully so as to accommodate people with Colour Discrimination
Deficiencies as far as possible.
If material is to be printed by the user, remember to design graphics with patterns as well
as colour coding. Most people only have access to black and white printers.
Consider contrasts carefully. If you have a dark background, use light foregrounds (this
combination is good for long-distance viewing such as slide shows or projected
computer screens). Use light backgrounds and dark foregrounds for situations with high
ambient light e.g. overhead projectors.
Note that different wavelengths of colour come into focus at different points in the eye
(See Figure 3). It is difficult for people to focus on red and blue simultaneously.
Colour confusions
commonly perceived by people suffering from
colour vision deficiencies.
adapted from Travis (1991) pg. 59
Type of
Defect
Achromatopsia
Incidence
in %
Typical
Confusions
White
Matches
0.003
All colours look like shades of grey.
Many colours
1
Bluish-green & brown green, olive,
tan & red-orange,
blue & red-purple,
violet & purple
Blue-green
Deuteranopia
1
Dull green & pink
Olive & brown
Yellow-green & red-orange
greenish-blue, dull blue & purple
Blue-green
Tritanopia
0.004
Green & greenish blue
oranges & red-purples
Yellow-orange
Protanopia
The use of metaphor in the interface design
Imposing a metaphor on a virtual world, allows the user to be better able to predict the
outcomes of new interactions. It also allows the designer to work with a model which
will guide the development in a consistency of interactions and representations. Obvious
metaphors are those of the "desktop" for office automation software, and the "paint brush
and easel" for graphics packages. Care should be taken that the analogy is familiar to the
users' experience of the "real world" and similar enough to be incorporated without
excessive explanation.
Another common metaphor for navigational buttons is the VCR or tape deck buttons,
which are familiar to most users. e.g.
Compiled by Omorogbe Harry
45
HCI
Forward
Back
Fast forward
Rewind
Stop
Interactivity
Interactivity has been lauded as the most promising development in CAL since the
euphoria of AI collapsed. However, interactivity should be more than a simple point and
click scenario. Truly interactive systems based on a constructiveness approach would
include drag and drop, text entries and other forms of interaction to develop a user’s
knowledge of the subject material.
Learning Styles
Individuals typically have their own preferences in the way that they perceive, collect
and process information. These methods are referred to as "Learning Styles". The
Academic Skills Center at Western Michigan University offers the following breakdown
of learning styles:
 Print - learns through reading (Allow printouts for these students).
 Aural - learns by listening - will enjoy audio tapes and listening to what other





learners have to say. (Voice over will assist these users.)
Interactive - enjoys discussions with other students on a one-to-one basis or in
small groups. (CMC would assist many of these students).
Visual - learns by looking at pictures, graphs, slides, demonstrations and films.
(Colour coding will work well with these types of students.)
Hap tic - learn through the sense of touch. (Drag and Drop interactions could help
here.)
Kinesthetic - learns through movement. (Animation could help students with this
type of preference).
Olfactory - uses the sense of smell in learning. (Any ideas?)
Learners will not typically use only one of the above list but a combination of them,
favouring one method over another e.g. some learners work well in a group environment
using visual and interactive learning styles whilst others prefer to learn on their own, but
still use a visual style.
Many, although not all of the above can be used in the development of Interactive
Multimedia Course Material.
Instructional Events
Gagne (1973, p. 303) states that "control of the external events in the learning situation is
what is typically meant by the word 'instruction'". He then lists these events as:
 Gaining and controlling attention.
 Informing the learner of expected outcomes
 Stimulating recall of relevant prerequisite capabilities.
 Presenting the stimuli inherent to the learning task.
Compiled by Omorogbe Harry
46
HCI





Offering guidance for learning.
Providing feedback.
Appraising performance
Making provisions for transferability.
Insuring retention.
Importance of HCI
Users expect highly effective and easy-to-learn interfaces and developers now realize the
crucial role the interface plays. Surveys show that over 50% of the design and
programming effort on projects is devoted to the user interface portion. The humancomputer interface is critical to the success of products in the marketplace, as well as the
safety, usefulness, and pleasure of using computer-based systems.
There is substantial empirical evidence that employing the processes, techniques, and
tools developed by the HCI community can dramatically decrease costs and increase
productivity. For example, one study reported savings due to the use of usability
engineering of $41,700 in a small application used by 23,000 marketing personnel, and
$6,800,000 for a large business application used by 240,000 employees. Savings were
attributed to decreased task time, fewer errors, greatly reduced user disruption, reduced
burden on support staff, elimination of training, and avoidance of changes in software
after release. Another analysis estimates the mean benefit for finding each usability
problem at $19,300. A usability analysis of a proposed workstation saved a telephone
company $2 million per year in operating costs. A mathematical model based on eleven
studies suggests that using software that has undergone thorough usability engineering
will save a small project $39,000, a medium project $613,000 and a large project
$8,200,000. By estimating all the costs associated with usability engineering, another
study found that the benefits can be up to 5000 times the cost.
There are also well-known catastrophes that have resulted from not paying enough
attention to the human-computer interface. For example, the complicated user interface
of the Aegis tracking system was a contributing cause to the erroneous downing of an
Iranian passenger plane, and the US Stark's inability to cope with Iraqi Exocet missiles
was partly attributed to the human-computer interface. Problems with the interfaces of
military and commercial airplane cockpits have been named as a likely cause for several
crashes, including the Cali crash of December 1995. Sometimes the implementation of
the user interface can be at fault. A number of people died from radiation overdoses
partially as a result of faulty cursor handling code in the Therac-25.
Effective user interfaces to complex applications are indispensable. The recognition of
their importance in other disciplines is increasing and with it the necessary
interdisciplinary collaboration needed to fully address many challenging research
problems. For example, for artificial intelligence technologies such as agents, speech,
and learning and adaptive systems, effective interfaces are fundamental to general
acceptance. HCI sub disciplines such as information visualization and algorithm
animation are used in computational geometry, databases, information retrieval, parallel
and distributed computation, electronic commerce and digital libraries, and education.
HCI requirements resulting from multimedia, distributed computing, real-time graphics,
multimodal input and output, ubiquitous computing, and other new interface
Compiled by Omorogbe Harry
47
HCI
technologies shape the research problems currently being investigated in disciplines such
as operating systems, databases, and networking. New programming languages such as
Java result from the need to program new types of distributed interfaces on multiple
platforms. As more and more of software designers' time and code are devoted to the
user interface, software engineering must increase its focus on HCI.
Differences between locally presented multimedia course material & World Wide
Web delivered material
There are a number of subtle differences between the Interface for locally presented
multimedia course material and that that is delivered via the WWW.
Response Time
Probably the most significant comes about as a result of the difference in response time.
Locally delivered material can usually rely on a quick response and display time, whilst
internet delivered material has a slow response time. As users generally do not like to
wait for information, internet material should be more detailed and lengthy than locally
delivered material. This has particular relevance to menu's and navigational issues;
Shneiderman (page 106) states that
"deep menu trees or complex traversals become annoying to the user if the systems
response time is slow, resulting in long and multiple delays. With slow display rates,
lengthy menus become annoying because of the volume of text that must be displayed. In
positive terms, if the response time is long, then create menus with more items on each
menu to reduce the number of menus necessary. If the display rate is slow, create menus
with fewer items to reduce the display time."
It is important to ensure that colour graphics do not unnecessarily slow down the display
of information. Web pages are particularly prone to slow response rates if large graphic
files are necessary. Similarly in the development of multimedia CAL, care should be
taken to reduce the number of colours in the graphic file to 256 as this allows quicker
display times and compatibility with most computer colour monitors. However, the
interpretations of colours vary from monitor to monitor and the visual implications
should be tested on as many different display screens as possible.
Compiled by Omorogbe Harry
48
HCI
CHAPTER FIVE
HCI AND WEB DESIGN
Problems and Promises
In this section, we will examine the relationship between the activity of designing
information sites for the World Wide Web and the field of Human Computer
Interactions. From the perspective of HCI, web site design offers some interesting
problems that are not present in the creation of traditional, stand-alone software products.
Because of the present development of the WWW's rise to prominence, HCI is only now
beginning to address these new issues. A challenge for the field, therefore, will be to
rigorously examine the process of web site design and offer recommendations and
guidelines, as it has with the areas of software and hypermedia publishing. That such
counsel is needed by web site designers becomes readily apparent when looking at the
multitude of badly conceived and poorly designed sites now populating the web. As
Borges and his collaborators point out, "the proliferation of pages with poor usability
suggests that most of the designers of WWW pages have little knowledge of user
interface design and usability engineering. This is a serious problem that needs to be
addressed...". There are, in fact, any great numbers of guidelines currently published on
the WWW offering advice on how to design effective and pleasing sites. Unfortunately,
very few of these are grounded in the theories or empirical data that have been developed
in HCI. In fact, as Morris notes, "at this point, HCI as a discipline has had a relatively
limited impact upon the development of the web". It is my contention, however, that it is
precisely the field of HCI that has the most to offer web site designers. Therefore, part of
this paper will be devoted to examining different areas within the HCI literature that
might be of most use to the individuals who are creating and maintaining web sites.
This section is divided into two main parts. In the first section, we will identify some of
the new and unique issues that designing for the medium of the web present to the field
of HCI. In the second section, we will discuss areas of the HCI literature that are
particularly useful to web designers and propose a method for web site design that is
based upon these project work.
Issues in HCI design in Web Medium
Web and Traditional Software Design
The question can be raised as to how similar the activity of designing World Wide Web
sites is to the design of more "traditional" software and hypermedia products. The very
fact that we are doing a project that attempts to relate web design to the established HCI
literature suggests that we believe there are important similarities between designing for
the web and designing other types of software. Yet there are obviously some important
differences as well -- differences that the field of HCI is only beginning to consider. The
most obvious dissimilarities involve the levels of technical knowledge necessary for
Compiled by Omorogbe Harry
49
HCI
design, and the types of entities that carry out the design process. While the creation of
traditional "stand-alone" software applications requires extensive technical expertise,
and is the largely the province of specialized companies, designing web sites requires
relatively little technical knowledge, and can easily be done by almost anyone. But such
surface distinctions, while important to note, are not what primarily concerns me.
Rather, we are more interested in how the medium of the World Wide Web presents a set
of challenges and issues to designers that are different to those presented to creators of
traditional software products. Although there are undoubtedly some similarities in the
process of creating web sites and stand-alone software, there are also some significant
variations that result from the distinct characteristics of the mediums they are intended
for. Put simply, the WWW is a very different environment from a single computer
system or limited network, and designing applications to be displayed on it presents the
designer with a number of unique issues that they must consider.
Perhaps the most fundamental aspect of the web medium that designers must come to
terms with is that it is platform independent, which means that materials on the web can
be accessed by a wide variety of computer systems and browser software. Because the
WWW is system and browser independent, and because the different systems/browsers
have varying capabilities and features, the designer of a web site does not know and
cannot control:
1) How their pages will be visibly rendered for any particular user (e.g., a
pleasing and coherent layout on one system/browser may look terrible and be
confusing on another), nor
2) What functionality of the site will be supported by the configurations of
different users (e.g., important layout features like tables may not work in all
browsers).
Thus, designers of web sites have to account for the fact that they will have only a
limited amount of control over the interface that their site will present to a visitor. As
Simon Shum notes, "there has never been a hypertext system that was so large that noone could be sure what hardware or software the end users might be using. The user
interface design community has had to get to grips with the concept of designing with
this uncertainty. Creator s of sites who want their work to be accessible and usable to a
wide audience either have to design it in a way that will allow all major
systems/browsers to view it effectively (designing for the "lowest common
denominator"), or they have to consider pro viding different versions of the same site that
are optimized for different types of users]. While the former option may be unacceptable
for designers who want to incorporate the latest technological advances into their sites,
and the latter option requires extra work on the part of designers (who would have to
present multiple versions of the same site), these are really the only options for dealing
with the uncertainty caused by the independent nature of the WWW.
Level of Interface to User
A second unique feature that has to be considered by designers is that web pages
represent "third-level interfaces" for a user. Above the level of the individual web page, a
Compiled by Omorogbe Harry
50
HCI
user is also interacting with browser software and an operating system, each which
provide their own interfaces to the user. The most important levels to focus on, for my
purposes, are those of the browser and of the individual web sites/pages. A web site, as it
is experienced by a visitor, really has a dual-interface: one that is provided by their
browser software, and the other which is provided by the site designer. Both the browser
and the site levels are important, in that each provide mechanisms that determine how a
user will interact with the site and how they will navigate the site. Browsers, for their
part, display the individual web pages, and provide at least a minimal set of navigation
options for the user. Different browsers, however, vary in their capabilities for visually
rendering pages and supporting other features -- ranging from the text-only capabilities
of the Lynx browser to more advanced software packages like the latest versions of
Netscape Navigator and Microsoft Explorer, which support a wide variety of media types
(text, images, video, audio) and features (Java, JavaScript, Vbscript, tables, etc.).
Browsers also vary in the navigation mechanisms that they offer to users. While all
browsers support basic backtracking and jumping movements, the more advanced
browsers also incorporate features identified in hypertext literature as aiding navigation - features like history lists, bookmarking, and footprinting]. At the level of the individual
web site, user navigation is affected by the access mechanisms that are presented (e.g.,
site overview maps, tables of contents, navigation bars, etc.), as well as the hypertext
links embedded within the pages. Because the "dual-interface" will affect user's
interaction with and navigation through a web site, and because the platformindependent nature of the WWW means that site designers cannot know which types of
systems and browsers will access their sites, the creators of these sites have only limited
control over the user interface that will be presented to a visitor. Site designers need to
carefully consider a number of issues, therefore, regarding the functionality and
navigation facilities which their site provides, and how these will relate to and be
affected by a variety of browser platforms.
Access Speed
A third unique issue that confronts designers of web sites relates to the question of
access speed. Because assess to a web sites comes via a connection to the global Internet,
and is therefore affected by bandwidth constraints and network traffic , users of the
WWW will likely experience some (greater or lesser) delay in the system's response to
their actions. This can cause a number of problems. Slow connections, whatever their
cause, not only serve to frustrate a user -- and increase the chance that they will abandon
a site if it is responding too slowly -- but it also delays feedback to the user as well.
Because connections to web sites are typically asynchronous, the system will respond to
a user only after she takes some action. And if there is too great a delay between action
and reaction, confusion, anxiety, or frustration may result. Discussing hypermedia
systems, Jakob Nielsen notes that."...the response time for the display of the destination
node is critical for the user's feeling of navigating an information space freely". And if
the connection to a particular site is slow, users may feel that they are not fully in
control. While the need for adequate speed is largely taken for grant ed in most software
application development, and in usability research on these products, it is an important
issue that faces web users and designers alike. Although some of the factors that affect
access time, such as user's connection speed and network traffic levels, are beyond the
control of web designers, there are obviously some steps that site creators can take to
Compiled by Omorogbe Harry
51
HCI
minimize the potential difficulties. In general, web pages that are smaller and less
graphically-intensive will load faster than those which are larger and more graphically
rich. Web designers, therefore, can insure that their sites will be accessed as quickly as
possible by keeping the file size of their pages fairly low. But such a solution may not
always be considered optimal for designers, who might want to capitalize upon the
multimedia capabilities that the WWW offers. Thus, trade-offs are inevitable, and there
is no single best solution for any case. Such trade-offs between access speed and
presentation are much less important of an issue for developers of other software
products, and as Shum notes, "web designers must therefore priorities different criteria to
ones they might use in designing a smaller scale hypertext or multimedia CD-ROM, in
order to balance interactivity with acceptable speed of access".
Interface Tools
The issues of platform independence, dual user interfaces, and access time all pose
challenges to web authors, who must carefully consider the issues raised by these factors
when deciding how to best design their sites. Unfortunately, they are also faced with the
additional problem of having a much more limited set of interface tools to work with.
Compared to the range of potential tools and techniques available to authors of standalone software applications, the web designer has a relatively primitive set of resources
to work with. According to Richard Miller, "HTML's limited set of objects and
interaction styles is a step backwards for interface design compared to the growth of
interactive computing over the last 30 years". Not only do web designers have fewer
interface widgets as their disposal, but nature of the web medium also makes it difficult
or impossible to tightly couple relationships between interface elements], or to utilize
some navigation aids identified as beneficial in hypermedia research (such as multiple
windowing, user annotation, zooming, etc.). Thus, web site designers are faced not only
with a lack of control over their interfaces that their sites present, but they also have
fewer resources to draw upon to maximize the potential of these interfaces.
Nature of the Web
The final special issue regarding web site design to be discussed is how the dynamic
nature of web sites affects their creation. Whereas the first four issues that have been
examined all present problems to the web designer, the dynamism inherent in the WWW
may actually prove advantageous for these authors. In the case of traditional software
development, the design cycle is fairly well bounded, and when the product is released to
the public, there is little or nothing that can be done to change it. This places a great
burden on the development team, who much ensure that the product meets all of its
predefined requirements and is relatively bug-free before it can be released. If problems
arise afterwards, they can only be remedied through costly and time-consuming methods,
and significant changes to the product may have to wait until the next version is
developed. Web sites, however, are much easier to change after they have been
"released" to the public. While this does not mean that site creators can afford to be lax
in their initial design efforts, it does mean that if problems with the site become apparent
after it has been mounted, they are relatively easy to change. This means that the iterative
design cycle for web sites can be much less bounded, and may continue after the site is
Compiled by Omorogbe Harry
52
HCI
implemented in order modify problem areas. In fact, because of the dynamic nature of
the web medium, it is probable that a site will undergo constant revision and change.
While this offers site designers a greater degree of flexibility, some care needs to be
taken to make sure that the site is not changed so often or so much as to create confusion
among repeat visitors.
The five issues that were discussed above all relate to differences that exist between
designing web sites and traditional software applications. Although these issues may
present special conditions that web site designers must consider, the discussion w as not
intended to imply that designing for the web is a more difficult process than creating
other forms of software. In fact, by almost any measure, web authoring is a much simpler
task than creating stand-alone software products. The above discussion was merely
intended to highlight the fact that the process of creating web sites is in some ways
unique, and that designers in this medium are faced with different types of considerations
than those faced by individuals in the software industry. To be sure, there are also some
common considerations that creators in both field face, such as how to structure the
design process, how to construct a meaningful navigation system for a hyperspace, and
how to create a usable interface. This discussion of the differences between web
authoring and traditional software publishing, therefore, should not suggest that the
existing areas of the HCI literature which are oriented toward "traditional" software
issues are not useful to web designers. In fact, there are many areas within the HCI field
that have a great deal to offer web designers. And with the growing importance of the
WWW, more attention within the HCI community has been directed at this new medium.
Too many individuals who are currently producing web sites seem to feel that this
activity is somehow suigeneris, and has little to learn from the body of accumulated
knowledge about such issues as design methodology, hypermedia development, and
interface design. While I agree that the web medium is in some ways unique, I would
reject any contention that designing for the web is so different as to render existing work
in the field of HCI irrelevant to it. In fact, it is apparent to me that individuals who
produce web sites should be more familiar with what the field of HCI has to offer. The
question can be raised, then, as to what areas of HCI are most relevant to web designers.
The following section of this chapter will address this issue.
Areas in HCI that is important to Web Design
A blanket statement such as "the HCI literature is important for web designers" is not
very useful because the field itself is so broad and varied. Although arguments could be
made for including many different strands of HCI into a discussion of relevant areas for
web design, we will discuss only four areas that we think are particularly significant: the
literatures on software design methodology, hypermedia, user interface design, and
usability. Before moving on to discuss these areas, we feel that a few caveats are in
order. First, given the context of the assignment and the fact that we are addressing
several different segments of HCI, my review of the literature in these areas will be fairly
selective. we make no pretensions of having thoroughly surveyed these four areas. Also,
we have attempted to the degree that is possible include fairly recent works that are
explicitly oriented toward issues involving the WWW. Finally, we have included a few
relevant works that are outside of the HCI field, strictly defined. My discussion of the
literature will not be formally segmented into different sections. Instead, we will
Compiled by Omorogbe Harry
53
HCI
examine various relevant threads in the course of proposing a method for designing web
sites that is based upon my interpretation of these literatures.
Although we have spend a considerable amount of time identifying some ways that web
design differs from other types of software development, the general processes involved
in this activity can be similar to those employed by authors of traditional software . Levi
and Conrad argue that building web sites "can and should be viewed as a major software
development effort.... The life cycle of web creation is identical to that of traditional
software: requirements gathering, analysis, design, implementation, testing, and
deployment". Although they do not identify it specifically by name, it seems apparent
that the general type of methodology that they see being suited for web design is the
User Centered Design (UCD) approach. we would concur that a design effort for the web
would be well suited by employing a UCD perspective, but would argue that it should be
specifically tailored to take into account specific types of tasks required for authoring a
hypermedia application. While t he general UCD approach is fairly generic, therefore
lending itself to a wide range of projects and design sequences, we believe that it is also
flexible enough to be applied different types of design efforts. Before suggesting such
specifications for a UCD approach to be used in the context of web development,
however, we will identify the basic aspects of the user-centered design process that we
feel make it particularly valuable for web site creators. Then we will examine in greater
detail the specific stages of the web design approach that we believe is most valuable,
drawing on the different areas of the HCI literature that were identified above for
support.
The main strength of the UCD approach, in my opinion, is that it represents a set of
general principles that underline the process of design rather than any specific sequence
of tasks to be carried out. These general principles include an early and continuous focus
upon users and their requirements, an iterative approach that intersperses design efforts
and user testing throughout various stages of the development cycle, and an that
emphasis upon operational criteria for usability assessments. While such a philosophical
underpinning can lend itself to different types of design-phase sequences, the UCD
approach is often used with the fairly standard software design process of requirements
analysis, design, implementation, testing, and maintenance. In general, this type of
process can be useful to employ in the task of designing web sites. Some modifications
should be made in a few areas, however, to recognize the specific challenges involved in
creating a hypermedia information product, to emphasize the value of user testing
throughout the design process, and to recognize that web design is often carried out in
different contexts and by different types of individuals than is the case with traditional
software products.
The earliest stages of designing a web site should involve a modified type of
requirements analysis suggested by the basic software design model. As the general
principles of the UCD approach suggest, much of the emphasis here should be devoted to
identifying the prospective audience for the site and specifying what their needs may be.
Given the distributed nature of the WWW, and the fact that the audience for a particular
site can conceivably be very broad, it is likely that this task can be carried out only at the
level of generalities. But as Shneiderman points out, even when broad user communities
are anticipated, there are usually underlying assumptions about who the primary
audiences may be, and it is helpful to make these assumptions explicit. After identifying
Compiled by Omorogbe Harry
54
HCI
potential users, it is also helpful to assess what kinds of tasks they will likely want or
need to perform when visiting a web site. How this is to be done is a matter of some
controversy. In the development of many traditional software products, a formal task
analysis is carried out, and some authors writing about web site design, such as Rice et
al., seem to favor such an approach. Other works on software development, however,
believe that task analysis can be carried out in a more informal manner, utilizing methods
such as user observation or imagined scenarios. I believe that a formalized approach to
task analysis is unlikely to be widely practical or appealing to the web design
community. As Dillon and McKnight note, "...the fact that hypermedia-based interfaces
are frequently being used in novel applications renders it very difficult to perform formal
task analysis, specific ally in the context of usage, or elicit user requirements to any
degree of precision". While Dillon and McKnight were not discussing the WWW
specifically, the extremely distributed nature of the web's user population should only
amplify their sentiments. Beyond the fact that the potentially broad nature of web site
audiences makes it hard of impossible to conduct formal task analysis upon users, the
level of specialized knowledge required for utilizing this method is likely to be absent in
many real-world cases of web design. Thus, more informal methods to identify user
requirements may be a more realistic alternative.
While the identification of potential users and their tasks should be an important element
of the early stage of web site development, care must be taken to also consider the goals
and requirements of the site's stakeholders as well Taken in tandem with the information
gained through an analysis of users and their tasks, the articulation of the site owner's
purposes should help designers identify the basic information content to be included in
the site and the types of features that w ill need to be incorporated into the design. Such
preparatory work is important to provide a firm foundation for the subsequent design
phases in the site's development.
The actual stage of design for a web site should be carried out in line with the general
principles of the UCD approach. In other words, the process should be an iterative one
that involves developing and testing prototypes at various stages, and the results of these
tests should be fed back into the design efforts. But the generalized model of software
development identified above, which portrays design as a sort of undifferentiated stage,
is not very helpful here, as it provides little guidance about what types of tasks need to be
carried out to effectively design a web site's architecture and interface. It is in this respect
that the Object-Oriented Hypermedia Design Method (OOHDM) proposed by Schwabe
et al. seems to be particularly useful to consider.
Schwabe and his collaborators contend that designing a web site is tantamount to
designing a hypermedia application, and believe that their OOHDM model is directly
applicable to this process. Their model is partially compatible with a UCD approach, in
that the different stages of the design process are "performed in a mix of incremental,
iterative, and prototype-based development styles". The central core of their
methodology, however, is based upon formal modeling, and the y eschew the type of
user testing that we believe is important to include in web site development. Nonetheless,
the fact that this model is based explicitly upon the specific requirements of hypermedia
development, and the general structure of the design process that they set out makes this
method important to consider. For my purposes, the most valuable and interesting aspect
of OOHDM is that it breaks the design process into separate "activities," each of which
Compiled by Omorogbe Harry
55
HCI
focuses on a different aspect of an application's architecture or interface: concept design,
navigational design, and abstract interface design (which are followed by
implementation). This general structure, and the specific types of concerns and
"products" that they identify as the foci of their different "activities," are very useful to a
web design process, and can, I believe, be incorporated within a generalized usercentered approach. But the specific modeling techniques which they employ are probably
less practical in the context of the web design community, which seems to be largely
comprised of individuals who are not HCI experts. Therefore, we would propose to keep
the "outer shell" of the OOHDM model and incorporate it within a user-centered design
approach, while jettisoning the methodological core of formal modeling. It should be
recognized, therefore, that the discussion of the various phases of the design process that
follows represent my own adaptation of the basic OOHDM structure within a usercentered approach.
The first activity suggested by the OOHDM model is conceptual design. In this stage of
design, the basic topography of the web site will begin to be specified. The earlier work
carried out in the requirements analysis stage should have identified the basic
information content of the web site. The primary task at this stage is to organize this
content into meaningful and understandable categories. General issues that need to be
addresses in this phase are what types of information should be grouped together and
how to organize these groupings within some coherent categorization scheme. More
specific issues may involve decisions on page length (whether to divide related content
into fewer but longer pages, or shorter pages) and labels to be applied to the categories
that have been identified. The product of these efforts will be the identification and
specification of the information nodes that will constitute the core of the web site.
Even in this early stage of design, it is a good idea to conduct user tests, for as Miller
notes, "...the earlier one starts [testing], the larger the payoffs in time savings and user
satisfaction". It is quite possible that the designers may have grouped information and
created categories in ways that do not make sense to potential users, and their
assumptions should therefore be tested. The terminology adopted by designers also needs
to be examined, because as researchers like Gray have found, users often understand
categories and words to have meanings other than the ones the author intended them to
have. One method that can be used as a test of conceptual clarity and terminology is card
sorting. According to Nielsen, "card sorting is a common usability technique that is often
used to discover user's mental models of an information space",. In designing the internal
web site for Sun Microsystems, Nielsen has used this method with a small number of
individuals to examine how they think information should be grouped together and what
labels they feel should be applied to the groupings. If users have rather different ideas
from the designers about how information should be organized within the site (or how
items should be labeled), the designers should reconsider their initial categorizations and
redesign as they feel necessary.
While the concept design phase begins to provide the site with some organization, by
virtue of preparing the information nodes that will be offered, the second stage of
structural and navigational design shapes the way that these nodes will be related to
each other and identifies the means by which the site's structure will be made apparent
and accessible to visitors. There are two primary types of tasks that should be carried out
in this stage. First, the designers need to establish the basic structure and relationships of
Compiled by Omorogbe Harry
56
HCI
the information categories identified in the concept design phase, and determine how the
various nodes will be connected. Decisions have to be made about what type of
organizational structure will be imposed upon the site, whether it be linear, hierarchical,
or some other form. Such decisions may be influenced by the predetermined purposes of
the site and the expected types of tasks that prospective users will perform, as different
kinds of structures lend themselves better to different tasks. Identifying the basic
structure of the site will also allow designers to plan the relationships of categories at
both the global level (relations between different levels of categories) and the local level
(relations between nodes within similar levels), and connect them accordingly.
After the primary structural framework of the site has been specified, the designers then
need to decide how the topography of the information space will be made apparent and
accessible to visitors. From analysis, once site creators have developed a model of the
information in a site, they should begin to prepare navigation tools that will clarify it's
organization. This is a critical task, because as research has shown, users of hypertext
systems can often suffer from the problems of disorientation and large cognitive
overhead. Since users may have trouble understanding the structure of the hyper-space
they are in, and since electronic text often suffers from a problem of homogeneity [61],
designers need to take care to make the organization of their site explicit to visitors, and
to provide mechanisms that will allow users to understand their present location and
successfully navigate throughout the site. These issues can be addressed by determining
what types of access structures and navigation tools will be provided to visitors. As was
mentioned earlier, all web browsers provide at least minimal navigation support
(backtracking and jumping), and some of the more popular versions also provide more
advanced options as well (history lists, bookmarking). While these mechanisms can be of
use to a visitor, the site designer can not count on any particular range of features (except
for the most basic ones offered in all browsers) being available or understandable to the
individuals who are viewing their site. Designers must focus on what they can control,
and therefore must develop a suite o f access structures and navigation aids that are clear
and accessible to all visitors, independent of the particular software they are using to
access the site. Consulting the HCI literature, particularly in the areas of hypertext
development, can offer some guidance to site creators on what types of mechanisms can
be adopted. Thuring et al., for example, argue that designers can help increase the
coherence of a site for the user and convey the structure of a hyperspace by providing a
graphical overview. Such a mechanism is widely cited in the literature as being of value.
But because not all users will be able (or choose) to view graphics, other types of access
mechanisms should also be provided. Suggestion that designers employ an array of
navigation devices, including detailed, text-based tables of contents and topical indexes
should be use. Overviews, tables of contents, and indices can help a visitor develop a
sense of the site's organization and structure, and provide means for them to navigate to
desired locations. Designers should also consider how to develop more localized
navigation tools to be used on individual pages as well. Providing users with well
designed navigation bars on pages can help them maintain a sense of location and
context, while also providing them with an important means to move freely throughout
the information space. In order to be useful to visitors, however, such tools need to be
created so that they are predictable and consistent in the ways that they can be used and
the results that they produce.
Compiled by Omorogbe Harry
57
HCI
As was the case with the first stage of the design process, the structural and navigational
design phase should be accompanied by user testing. Basic questions that should be
addressed in the testing are whether potential users understand the overall structure of the
site, whether they can find information in the site, and whether they can effectively
navigate between different sections of the site. The tests might be conducted in a free,
exploratory fashion, in which users are allowed to determine their own course of action,
and designers look for areas of user confusion, slow-down, or mistakes. Or the users can
be given specific scenarios and tasks to accomplish, with designers gauging how well
they performed. In either case, designers will probably want to ask that users to "think
aloud" while they work so that their thoughts are made explicit. Because the site's
interface has not yet been developed, the tests will likely have to be conducted through
the use of paper prototypes. The use of such "lo-fi" prototypes is widely accepted as
being a valid technique for usability testing, and as Nielsen points out, "for some
projects, user reactions to prototypes with few or even no working features can give you
significant insight into the usability of your design". When using paper prototypes,
however, testers must take care to "...explain the limitations and missing features to
users. Once this is clear, you can learn a lot from user interaction with what is there -and learn what their expectations are for what's not". If users experience significant
problems with the design that is presented to them in these prototypes, the creators of the
site need to make necessary adjustments and test their revisions accordingly.
The final stage of the design process is interface design. While the earlier phases of the
site's development have specified it's content, organization, and structure, the site still
does not have a "face" to present to a visitor. Developing the "look and feel" of the site
takes place in this stage. There are actually a number of different types of tasks that have
to be performed here: interface elements (including things like icons, buttons, graphics,
etc.) have to be created and selected, basic features of the site (forms, search engines,
applets, etc.) have to be incorporated, and all of these things -- along with the basic
information content -- need to be combined in detailed page lay outs. It is likely that the
stage will be carried out in an iterative fashion, in which successively more detailed and
specified interfaces are developed, instead of trying to produce a "final" interface all at
once. As Nielsen notes, "current practice in usability engineering is to refine user
interfaces iteratively since one cannot design them exact right the first time around".
In this chapter, we have examined the activity of designing World Wide Web sites and
how this relates to the field of Human Computer Interaction. Although I discussed at
some length the ways in which the medium of the web presents unique challenges to
designers -- challenges not yet adequately addressed in the HCI literature -- we have also
attempted to demonstrate that the process of developing web sites can be grounded
within the existing body of work in this field. In doing so, we have proposed a method
for creating web sites that builds upon the several strands from within the HCI literature.
(A short summary of this design method is included below) Whether or not this
particular model is useful to the people who are actually designing web sites, it is
important that these individuals become more aware of what the field of HCI has to offer
them. For the vast potential of this exciting new medium is being threatened by the
proliferation of confusing and unusable sites. Simon Buckingham Shum feels that "...the
Web, as the fastest growing interactive system in the world, offers a golden opportunity
for HCI to make a difference". And as the web becomes increasingly important as a
means of communication, information sharing, and commerce, we believe that HCI will
Compiled by Omorogbe Harry
58
HCI
begin to have a larger impact upon the web design community. The stakes will be too
high for this field to be ignored.
Compiled by Omorogbe Harry
59
HCI
CHAPTER SIX
CURRENT RESEARCH (UP-AND-COMING AREA)
Gesture Recognition
A primary goal of gesture recognition research is to create a system which can identify
specific human gestures and use them to convey information or for device control.
Also, the primary goal of virtual environments (VE) is to provide natural, efficient,
powerful, and flexible interaction. Gesture as an input modality can help meet these
requirements. Human gestures are certainly natural and flexible, and may often be
efficient and powerful, especially as compared with alternative interaction modes. This
section will cover automatic gesture recognition, particularly computer vision based
techniques that do not require the user to wear extra sensors, clothing or equipment.
The traditional two-dimensional (2D), keyboard- and mouse- oriented graphical user
interface (GUI) is not well suited for virtual environments. Synthetic environments
provide the opportunity to utilize several different sensing modalities and technologies
and to integrate them into the user experience. Devices which sense body position and
orientation, direction of gaze, speech and sound, facial expression, galvanic skin
response, and other aspects of human behavior or state can be used to mediate
communication between the human and the environment. Combinations of
communication modalities and sensing devices can produce a wide range of unimodal
and multimodal interface techniques. The potential for these techniques to support
natural and powerful interfaces for communication in VEs appears promising.
If interaction technologies are overly obtrusive, awkward, or constraining, the user’s
experience with the synthetic environment is severely degraded. If the interaction itself
draws attention to the technology, rather than the task at hand, or imposes a high
cognitive load on the user, it becomes a burden and an obstacle to a successful VE
experience. Therefore, there is focused interest in technologies that are unobtrusive and
passive.
To support gesture recognition, human position and movement must be tracked and
interpreted in order to recognize semantically meaningful gestures. While tracking of a
user’s head position or hand configuration may be quite useful for directly controlling
objects or inputting parameters, people naturally express communicative acts through
higher-level constructs. The output of position (and other) sensing must be interpreted to
allow users to communicate more naturally and effortlessly through gesture.
Gesture is used for control and navigation in CAVEs (Cave Automatic Virtual
Environments) and in other VEs, such as smart rooms, virtual work environments, and
performance spaces. In addition, gesture may be perceived by the environment in order
to be transmitted elsewhere (e.g., as a compression technique, to be reconstructed at the
receiver). Gesture recognition may also influence – intentionally or unintentionally – a
system’s model of the user’s state. For example, a look of frustration may cause a system
Compiled by Omorogbe Harry
60
HCI
to slow down its presentation of information, or the urgency of a gesture may cause the
system to speed up. Gesture may also be used as a communication backchannel (i.e.,
visual or verbal behaviors such as nodding or saying “uh-huh” to indicate “I’m with you,
continue”, or raising a finger to indicate the desire to interrupt) to indicate agreement,
participation, attention, conversation turn taking, etc.
Given that the human body can express a huge variety of gestures, what is appropriate to
sense? Clearly the position and orientation of each body part – the parameters of an
articulated body model – would be useful, as well as features that are derived from those
measurements, such as velocity and acceleration. Facial expressions are very expressive.
More subtle cues such as hand tension, overall muscle tension, locations of self-contact,
and even pupil dilation may be of use.
To help understand what gestures are, an examination of how other researchers view
gestures is useful. How do biologists and sociologists define "gesture"? How is
information encoded in gestures? We also explore how humans use gestures to
communicate with and command other people. Furthermore, engineering researchers
have designed a variety of "gesture" recognition systems - how do they define and use
gestures?
Biological and Sociological Definition and Classification of Gestures
From a biological and sociological perspective, gestures are loosely defined, thus,
researchers are free to visualize and classify gestures as they see fit. Speech and
handwriting recognition research provide methods for designing recognition systems and
useful measures for classifying such systems. Gesture recognition systems which are
used to control memory and display, devices in a local environment, and devices in a
remote environment are examined for the same reason.
People frequently use gestures to communicate. Gestures are used for everything from
pointing at a person to get their attention to conveying information about space and
temporal characteristics. Evidence indicates that gesturing does not simply embellish
spoken language, but is part of the language generation process.
Biologists define "gesture" broadly, stating, "the notion of gesture is to embrace all
kinds of instances where an individual engages in movements whose communicative
intent is paramount, manifest, and openly acknowledged". Gestures associated with
speech are referred to as gesticulation. Gestures which function independently of speech
are referred to as autonomous. Autonomous gestures can be organized into their own
communicative language, such as American Sign Language (ASL). Autonomous
gestures can also represent motion commands. In the following subsections, some
various ways in which biologists and sociologists define gestures are examined to
discover if there are gestures ideal for use in communication and device control.
Gesture Dichotomies
One classification method categorizes gestures using four dichotomies: act-symbol,
opacity-transparency, autonomous semiotic (semiotic refers to a general philosophical
Compiled by Omorogbe Harry
61
HCI
theory of signs and system that deals with their function in both artificially constructed
and natural languages) -multisemiotic, and centrifugal-centripetal (intentional).
The act-symbol dichotomy refers to the notion that some gestures are pure actions, while
others are intended as symbols. For instance, an action gesture occurs when a person
chops wood or counts money, while a symbolic gesture occurs when a person makes the
"okay" sign or puts their thumb out to hitchhike. Naturally, some action gestures can also
be interpreted as symbols (semiogenesis), as illustrated in a spy novel, when an agent
carrying an object in one hand has important meaning. This dichotomy shows that
researchers can use gestures which represent actual motions for use in controlling
devices.
The opacity-transparency dichotomy refers to the ease with which others can interpret
gestures. Transparency is often associated with universality, a belief which states that
some gestures have standard cross-cultural meanings. In reality, gesture meanings are
very culturally dependent. Within a society, gestures have standard meanings, but no
known body motion or gesture has the same meaning in all societies. Even in ASL, few
signs are so clearly transparent that a non-signer can guess their meaning without
additional clues. Fortunately, this means that gestures used for device control can be
freely chosen. Additionally, gestures can be culturally defined to have specific meaning.
The centrifugal-centripetal dichotomy refers to the intentionality of a gesture. Centrifugal
gestures are directed toward a specific object, while centripetal gestures are not.
Researchers usually are concerned with gestures which are directed toward the control of
a specific object or the communication with a specific person or group of people.
Gestures which are elements of an autonomous semiotic system are those used in a
gesture language, such as ASL. On the other hand, gestures which are created as partial
elements of multisemiotic activity are gestures which accompany other languages, such
as oral ones. Gesture recognition researchers are usually concerned with gestures which
are created as their own independent, semiotic language, though there are some
exceptions.
The Nature of Gesture
Gestures are expressive, meaningful body motions – i.e., physical movements of the
fingers, hands, arms, head, face, or body with the intent to convey information or interact
with the environment. Cadoz (1994) described three functional roles of human gesture:
 Semiotic – to communicate meaningful information.
 Ergotic – to manipulate the environment.
 Epistemic – to discover the environment through tactile experience.
Gesture recognition is the process by which gestures made by the user are made known
to the system. One could argue that in GUI-based systems, standard mouse and keyboard
actions used for selecting items and issuing commands are gestures; here the interest is in
less trivial cases. While static position (also referred to as posture, configuration, or
pose) is not technically considered gesture, it is included for the purposes of this section.
Compiled by Omorogbe Harry
62
HCI
In VEs users need to communicate in a variety of ways, to the system itself and also to
other users or remote environments. Communication tasks include specifying commands
and/or parameters for:
 navigating through a space;
 specifying items of interest;
 manipulating objects in the environment;
 changing object values;
 controlling virtual objects; and
 issuing task-specific commands.
In addition to user-initiated communication, a VE system may benefit from observing a
user’s behavior for purposes such as:
 analysis of usability;
 analysis of user tasks;
 monitoring of changes in a user’s state;
 better understanding a user’s intent or emphasis; and
 communicating user behavior to other users or environments.
Messages can be expressed through gesture in many ways. For example, an emotion such
as sadness can be communicated through facial expression, a lowered head position,
relaxed muscles, and lethargic movement. Similarly, a gesture to indicate “Stop!” can be
simply a raised hand with the palm facing forward, or an exaggerated waving of both
hands above the head. In general, there exists a many-to-one mapping from concept to
gesture (i.e., gestures are ambiguous); there is also a many-to-one mapping from gesture
to concept (i.e., gestures are not completely specified). And, like speech and
handwriting, gestures vary among individuals, they vary from instance to instance for a
given individual, and they are subject to the effects of co-articulation.
An interesting real-world example of the use of gestures in visual communications is a
U.S. Army field manual (Anonymous, 1987) that serves as a reference and guide to
commonly used visual signals, including hand and arm gestures for a variety of
situations. The manual describes visual signals used to transmit standardized messages
rapidly over short distances.
Despite the richness and complexity of gestural communication, researchers have made
progress in beginning to understand and describe the nature of gesture. Kendon (1972)
described a “gesture continuum,” depicted in Figure 2, defining five different kinds of
gestures:





Gesticulation. Spontaneous movements of the hands and arms that accompany
speech.
Language-like gestures. Gesticulation that is integrated into a spoken utterance,
replacing a particular spoken word or phrase.
Pantomimes. Gestures that depict objects or actions, with or without
accompanying speech.
Emblems. Familiar gestures such as “V for victory”, “thumbs up”, and assorted
rude gestures (these are often culturally specific).
Compiled by Omorogbe Harry
63
HCI

Sign languages. Linguistic systems, such as American Sign Language, which are
well defined.
As the list progresses, the association with speech declines, language properties increase,
spontaneity decreases, and social regulation increases.
Within the first category – spontaneous, speech-associated gesture – McNeill (1992)
defined four gesture types:





Iconic. Representational gestures depicting some feature of the object, action or
event being described.
Metaphoric. Gestures that represent a common metaphor, rather than the object
or event directly.
Beat. Small, formless gestures, often associated with word emphasis.
Deictic. Pointing gestures that refer to people, objects, or events in space or time.
These types of gesture modify the content of accompanying speech and may often help
to disambiguate speech – similar to the role of spoken intonation. Cassell et al. (1994)
describe a system that models the relationship between speech and gesture and generates
interactive dialogs between three-dimensional (3D) animated characters that gesture as
they speak. These spontaneous gestures (gesticulation in Kendon’s Continuum) make up
some 90% of human gestures. People even gesture when they are on the telephone, and
blind people regularly gesture when speaking to one another. Across cultures, speechassociated gesture is natural and common. For human-computer interaction (HCI) to be
truly natural, technology to understand both speech and gesture together must be
developed.
Despite the importance of this type of gesture in normal human-to-human interaction,
most research to date in HCI, and most VE technology, focuses on the right side, where
gestures tend to be less ambiguous, less spontaneous and natural, more learned, and more
culture-specific..Emblematic gestures and gestural languages, although perhaps less
spontaneous and natural, carry more clear semantic meaning and may be more
appropriate for the kinds of command-and-control interaction that VEs tend to support.
The main exception to this is work in recognizing and integrating deictic (mainly
pointing) gestures, beginning with the well-known Put That There system by Bolt
(1980). Some part of this section will focus on symbolic gestures (which include
emblematic gestures and predefined gesture languages) and deictic gestures.
Representations of Gesture
The concept of gesture is loosely defined, and depends on the context of the interaction.
Recognition of natural, continuous gestures requires temporally segmenting gestures.
Automatically segmenting gestures is difficult, and is often finessed or ignored in current
systems by requiring a starting position in time and/or space. Similar to this is the
problem of distinguishing intentional gestures from other “random” movements. There is
no standard way to do gesture recognition – a variety of representations and
classification schemes are used. However, most gesture recognition systems share some
common structure.
Compiled by Omorogbe Harry
64
HCI
Gestures can be static, where the user assumes a certain pose or configuration, or
dynamic, defined by movement. McNeill (1992) defines three phases of a dynamic
gesture: pre-stroke, stroke, and post-stroke. Some gestures have both static and dynamic
elements, where the pose is important in one or more of the gesture phases; this is
particularly relevant in sign languages. When gestures are produced continuously, each
gesture is affected by the gesture that preceded it, and possibly by the gesture that
follows it. These co-articulations may be taken into account as a system is trained.
There are several aspects of a gesture that may be relevant and therefore may need to be
represented explicitly. Hummels and Stappers (1998) describe four aspects of a gesture
which may be important to its meaning:
 Spatial information – where it occurs, locations a gesture refers to.
 Pathic information – the path that a gesture takes.
 Symbolic information – the sign that a gesture makes.
 Affective information – the emotional quality of a gesture.
In order to infer these aspects of gesture, human position, configuration, and movement
must be sensed. This can be done directly with sensing devices such as magnetic field
trackers, instrumented gloves, and datasuits, which are attached to the user, or indirectly
using cameras and computer vision techniques. Each sensing technology differs along
several dimensions, including accuracy, resolution, latency, range of motion, user
comfort, and cost. The integration of multiple sensors in gesture recognition is a complex
task, since each sensing technology varies along these dimensions.
Although the output from these sensors can be used to directly control parameters such
as navigation speed and direction or movement of a virtual object, here the interest is
primarily in the interpretation of sensor data to recognize gestural information.
The output of initial sensor processing is a time-varying sequence of parameters
describing positions, velocities, and angles of relevant body parts and features. These
should (but often do not) include a representation of uncertainty that indicates limitations
of the sensor and processing algorithms. Recognizing gestures from these parameters is a
pattern recognition task that typically involves transforming input into the appropriate
representation (feature space) and then classifying it from a database of predefined
gesture representations. The parameters produced by the sensors may be transformed
into a global coordinate space, processed to produce sensor-independent features, or used
directly in the classification step.
Because gestures are highly variable, from one person to another and from one example
to another within a single person, it is essential to capture the essence of a gesture – its
invariant properties – and use this to represent the gesture. Besides the choice of
representation itself, a significant issue in building gesture recognition systems is how to
create and update the database of known gestures. Hand-coding gestures to be
recognized only works for trivial systems; in general, a system needs to be trained
through some kind of learning. As with speech recognition systems, there is often a
tradeoff between accuracy and generality – the more accuracy desired, the more userspecific training is required. In addition, systems may be fully trained when in use, or
they may adapt over time to the current user.
Compiled by Omorogbe Harry
65
HCI
Static gesture, or pose, recognition can be accomplished by a straightforward
implementation, using template matching, geometric feature classification, neural
networks, or other standard pattern recognition techniques to classify pose. Dynamic
gesture recognition, however, requires consideration of temporal events. This is typically
accomplished through the use of techniques such as time-compressing templates,
dynamic time warping, hidden Markov models (HMMs), and Bayesian networks. Some
examples will be presented in the following sections.
Gesture Typologies
Another standard gesture classification scheme uses three categories: arbitrary, mimetic,
and deictic.
In mimetic gestures, motions form an object's main shape or representative feature. For
instance, a chin sweeping gesture can be used to represent a goat by alluding to its beard.
These gestures are intended to be transparent. Mimetic gestures are useful in gesture
language representations.
Deictic gestures are used to point at important objects, and each gesture is transparent
within its given context. These gestures can be specific, general, or functional. Specific
gestures refer to one object. General gestures refer to a class of objects. Functional
gestures represent intentions, such as pointing to a chair to ask for permission to sit.
Deictic gestures are also useful in gesture language representations.
Arbitrary gestures are those whose interpretation must be learned due to their opacity.
Although they are not common in a cultural setting, once learned they can be used and
understood without any complimentary verbal information. An example is the set of
gestures used for crane operation. Arbitrary gestures are useful because they can be
specifically created for use in device control. These gesture types are already arbitrarily
defined and understood without any additional verbal information.
Voice and Handwriting Recognition: Parallel Issues for Gesture Recognition
Speech and handwriting recognition systems are similar to gesture recognition systems,
because all of these systems perform recognition of something that moves, leaving a
"trajectory" in space and time. By exploring the literature of speech and handwriting
recognition, classification and identification schemes can be studied which might aid in
developing a gesture recognition system.
Typical speech recognition systems match transformed speech against a stored
representation. Most systems use some form of spectral representation, such as spectral
templates or hidden Markov models (HMM). Speech recognition systems are classified
along the following dimensions:
 Speaker dependent versus Independent: Can the system recognize the speech of
many different individuals without training or does it have to be trained for a
specific voice? Currently, speaker dependent systems are more accurate, because
they do not need to account for large variations in words.
Compiled by Omorogbe Harry
66
HCI
 Discrete or Continuous: Does the speaker need to separate individual words by
short silences or can the system recognize continuous sentences? Isolated-word
recognition systems have a high accuracy rate, in part because the systems know
when each word has ended.
 Vocabulary size: This is usually a task dependent vocabulary. All other things
being equal, a small vocabulary is easier to recognize than a large one.
 Recognition Rate: Commercial products strive for at least a 95% recognition
rate. Although this rate seems very high, these results occur in laboratory
environments. Also, studies have shown that humans have an individual word
recognition rate of 99.2%.
State of the art speech recognition systems, which have the capability to understand a
large vocabulary, use HMMs. HMMs are also used by a number of gesture recognition
systems (you can also research on Control of Memory and Display). In some speech
recognition systems, the states of an HMM represent phonetic units. A state transition
defines the probability of the next state's occurrence. The term hidden refers to the type
of Markov model in which the observations are a probabilistic function of the current
state. A complete specification of a hidden Markov model requires the following
information: the state transition probability distribution, the observation symbol
probability distribution, and the initial state distribution. An HMM is created for each
word (string of phonemes) in a given lexicon. One of the tasks in isolated speech
recognition is to measure an observed sequence of phonetic units and determine which
HMM was most likely to generate such a sequence.
From some points of view, handwriting can be considered a type of gesture. On-line
(also called "real time" or "dynamic") recognition machines identify handwriting as a
user writes. On-line devices have the advantage of capturing the dynamic information of
writing, including the number of strokes, the ordering of strokes, and the direction and
velocity profile of each stroke. On-line recognition systems are also interactive, allowing
users to correct recognition errors, adapt to the system, or see the immediate results of an
editing command.
Most on-line tablets capture writing as a sequence of coordinate points. Recognition is
complicated in part, because there are many different ways of generating the same
character. For example, the letter E's four lines can be drawn in any order.
Handwriting tablets must take into account character blending and merging, which is
similar to the continuous speech problem. Also, different characters can look quite
similar. To tackle these problems, handwriting tablets pre-process the characters, and
then perform some type of shape recognition. Preprocessing typically involves properly
spacing the characters and filtering out noise from the tablet. The more complicated
processing occurs during character recognition.
Features based on both static and dynamic character information can be used for
recognition. Some systems using binary decision trees prune possible characters by
examining simple features first, such as searching for the dots above the letters "i" and
"j". Other systems create zones which define the directions a pen point can travel
Compiled by Omorogbe Harry
67
HCI
(usually eight), and a character is defined in terms of a connected set of zones. A lookup
table or a dictionary is used to classify the characters.
Another scheme draws its classification method from signal processing, in which curves
from unknown forms are matched against prototype characters. They are matched as
functions of time or as Fourier coefficients. To reduce errors, an elastic matching scheme
(stretching and bending drawn curves) is used. These methods tend to be
computationally intensive.
Alternatively, pen strokes can be divided into basic components, which are then
connected by rules and matched to characters. This method is called Analysis-bySynthesis. Similar systems use dynamic programming methods to match real and
modeled strokes.
This examination of handwriting tablets reveals that the dynamic features of characters
make on-line recognition possible and, as in speech, it is easier to recognize isolated
characters. Most systems lag in recognition by more than a second, and the recognition
rates are not very high. They reach reported rates of 95% due only to very careful
writing. They are best used for filling out forms which have predefined prototypes and
set areas for characters. For a more detailed overview of handwriting tablets, consult.
Pen-based gesture recognition
Recognizing gestures from 2D input devices such as a pen or mouse has been considered
for some time. The early Sketchpad system in 1963 used light-pen gestures, for example.
Some commercial systems have used pen gestures since the 1970s. There are examples
of gesture recognition for document editing, for air traffic control, and for design tasks
such as editing splines. More recently, systems such as the OGI QuickSet system have
demonstrated the utility of pen-based gesture recognition in concert with speech
recognition to control a virtual environment. QuickSet recognizes 68 pen gestures,
including map symbols, editing gestures, route indicators, area indicators, and taps.
Oviatt (1996) has demonstrated significant benefits of using speech and pen gestures
together in certain tasks. Zeleznik (1996) and Landay and Myers (1995) developed
interfaces that recognize gestures from pen-based sketching.
A significant benefit of pen-based gestural systems is that sensing and interpretation is
relatively straightforward as compared with vision-based techniques. There have been
commercially available Personal Digital Assistants (PDAs) for several years, starting
with the Apple Newton, and more recently the 3Com PalmPilot and various Windows
CE devices. These PDAs perform handwriting recognition and allow users to invoke
operations by various, albeit quite limited, pen gestures. Long, Landay, and Rowe (1998)
survey problems and benefits of these gestural interfaces and provide insight for interface
designers.
Although pen-based gesture recognition is promising for many HCI environments, it
presumes the availability of, and proximity to, a flat surface or screen. In VEs, this is
often too constraining – techniques that allow the user to move around and interact in
more natural ways are more compelling. The next two sections cover two primary
Compiled by Omorogbe Harry
68
HCI
technologies for gesture recognition in virtual environments: instrumented gloves and
vision-based interfaces.
Tracker-based gesture recognition
There are a number of commercially available tracking systems, which can be used as
input to gesture recognition, primarily for tracking eye gaze, hand configuration, and
overall body position. Each sensor type has its strengths and weaknesses in the context of
VE interaction. While eye gaze can be useful in a gestural interface, the focus here is on
gestures based on input from tracking the hands and body.
Instrumented gloves
People naturally use their hands for a wide variety of manipulation and communication
tasks. Besides being quite convenient, hands are extremely dexterous and expressive,
with approximately 29 degrees of freedom (including the wrist). In his comprehensive
thesis on whole hand input, Sturman (1992) showed that the hand can be used as a
sophisticated input and control device in a wide variety of application domains,
providing real-time control of complex tasks with many degrees of freedom. He analyzed
task characteristics and requirements, hand action capabilities, and device capabilities,
and discussed important issues in developing whole-hand input techniques. Sturman
suggested a taxonomy of whole-hand input that categorizes input techniques along two
dimensions:


Classes of hand actions: continuous or discrete.
Interpretation of hand actions: direct, mapped,orsymbolic.
The resulting six categories describe the styles of whole-hand input. A given interaction
task, can be evaluated as to which style best suits the task. Mulder (1996) presented an
overview of hand gestures in human-computer interaction, discussing the classification
of hand movement, standard hand gestures, and hand gesture interface design.
For several years, commercial devices have been available which measure, to various
degrees of precision, accuracy, and completeness, the position and configuration of the
hand. These include “data gloves” and exoskeleton devices mounted on the hand and
fingers (the term “instrumented glove” is used to include both types).
Some advantages of instrumented gloves include:
 direct measurement of hand and finger parameters (joint angles, 3D spatial
information, wrist rotation);
 provides data at a high sampling frequency;.easy to use;
 no line-of-sight occlusion problems;
 relatively low cost versions available; and
 data is translation-independent (within the range of motion).
Disadvantages of instrumented gloves include:
 calibration can be difficult;
 tethered gloves reduce range of motion and comfort;
 data from inexpensive systems can be very noisy;
Compiled by Omorogbe Harry
69
HCI


accurate systems are expensive; and
the user is forced to wear a somewhat cumbersome device.
Many projects have used hand input from instrumented gloves for “point, reach, and
grab” operations or more sophisticated gestural interfaces. Latoschik and Wachsmuth
(1997) present a multi-agent architecture for detecting pointing gestures in a multimedia
application. Väänänen and Böhm (1992) developed a neural network system that
recognized static gestures and allows the user to interactively teach new gestures to the
system. Böhm et al. (1994) extend that work to dynamic gestures using a Kohohen
Feature Map (KFM) for data reduction.
Baudel and Beaudouin-Lafon (1993) developed a system to provide gestural input to a
computer while giving a presentation – this work included a gesture notation and set of
guidelines for designing gestural command sets. Fels and Hinton (1995) used an adaptive
neural network interface to translate hand gestures to speech. Kadous (1996) used glove
input to recognize Australian sign language; Takahashi and Kishino (1991) for the
Japanese Kana manual alphabet. The system of Lee and Xu (1996) could learn and
recognize new gestures online.
Despite the fact that many, if not most, gestures involve two hands, most of the research
efforts in glove-based gesture recognition use only one glove for input. The features that
are used for recognition, and the degree to which dynamic gestures are considered vary
quite a bit.
The HIT Lab at the University of Washington developed GloveGRASP, a C/C++ class
library that allows software developers to add gesture recognition capabilities to SGI
systems, including user-dependent training and one- or two-handed gesture recognition.
A commercial version of this system is available from General Reality.
Body suits
It is well known that by viewing only a small number of strategically placed dots on the
human body; people can easily perceive complex movement patterns such as the
activities, gestures, identities, and other aspects of bodies in motion. One way to
approach the recognition of human movements and postures is to optically measure the
3D positions of several such markers attached to the body and then recover the timevarying articulated structure of the body. The articulated structure may also be measured
more directly by sensing joint angles and positions using electromechanical body
sensors. Although some of the optical systems only require dots or small balls to be
placed on top of a user’s clothing, all of these body motion capture systems are referred
to herein generically as “body suits.”
Body suits have advantages and disadvantages that are similar to those of instrumented
gloves: they can provide reliable data at a high sampling rate (at least for electromagnetic
devices), but they are expensive and very cumbersome. Calibration is typically nontrivial. The optical systems typically use several cameras and process their data offline –
their major advantage is the lack of wires and a tether.
Compiled by Omorogbe Harry
70
HCI
Body suits have been used, often along with instrumented gloves, in several gesture
recognition systems. Wexelblat (1994) implemented a continuous gesture analysis
system using a data suit, “data gloves,” and an eye tracker. In this system, data from the
sensors is segmented in time (between movement and inaction), key features are
extracted, motion is analyzed, and a set of special-purpose gesture recognizers look for
significant changes. Marrin and Picard (1998) have developed an instrumented jacket for
an orchestral conductor that includes physiological monitoring to study the correlation
between affect, gesture, and musical expression.
Although current optical and electromechanical tracking technologies are cumbersome
and therefore contrary to the desire for more natural interfaces, it is likely that advances
in sensor technology will enable a new generation of devices (including stationary field
sensing devices, gloves, watches, and rings) that are just as useful as current trackers but
much less obtrusive. Similarly, instrumented body suits, which are currently exceedingly
cumbersome, may be displaced by sensing technologies embedded in belts, shoes,
eyeglasses, and even shirts and pants. While sensing technology has a long way to go to
reach these ideals, passive sensing using computer vision techniques is beginning to
make headway as a user-friendly interface technology.
Note that although some of the body tracking methods in this section uses cameras and
computer vision techniques to track joint or limb positions, they require the user to wear
special markers. In the next section only passive techniques that do not require the user
to wear any special markers or equipment are considered.
Passive vision-based gesture recognition
The most significant disadvantage of the tracker-based systems is that they are
cumbersome. This detracts from the immersive nature of a VE by requiring the user to
don an unnatural device that cannot easily be ignored, and which often requires
significant effort to put on and calibrate. Even optical systems with markers applied to
the body suffer from these shortcomings, albeit not as severely. What many have wished
for is a technology that provides real-time data useful for analyzing and recognizing
human motion that is passive and non-obtrusive. Computer vision techniques have the
potential to meet these requirements.
Vision-based interfaces use one or more cameras to capture images, at a frame rate of 30
Hz or more, and interpret those images to produce visual features that can be used to
interpret human activity and recognize gestures. Typically the camera locations are fixed
in the environment, although they may also be mounted on moving platforms or on other
people. For the past decade, there has been a significant amount of research in the
computer vision community on detecting and recognizing faces, analyzing facial
expression, extracting lip and facial motion to aid speech recognition, interpreting human
activity, and recognizing particular gestures.
Unlike sensors worn on the body, vision approaches to body tracking have to contend
with occlusions. From the point of view of a given camera, there are always parts of the
user’s body that are occluded and therefore not visible – e.g., the backside of the user is
not visible when the camera is in front. More significantly, self-occlusion often prevents
Compiled by Omorogbe Harry
71
HCI
a full view of the fingers, hands, arms, and body from a single view. Multiple cameras
can be used, but this adds correspondence and integration problems.
The occlusion problem makes full body tracking difficult, if not impossible, without a
strong model of body kinematics and perhaps dynamics. However, recovering all the
parameters of body motion may not be a prerequisite for gesture recognition. The fact
that people can recognize gestures leads to three possible conclusions:
(1) The parameters that cannot be directly observed are inferred.
(2) These parameters are not needed to accomplish the task.
(3) Some are inferred and others are ignored.
It is a mistake to consider vision and tracking devices (such as instrumented gloves and
body suits) as alternative paths to the same end. Although there is overlap in what they
can provide, these technologies in general produce qualitatively and quantitatively
different outputs which enable different analysis and interpretation. For example,
tracking devices can in principle detect fast and subtle movements of the fingers while a
user is waving his hands, while human vision in that case will at best get a general sense
of the type of finger motion. Similarly, vision can use properties like texture and color in
its analysis of gesture, while tracking devices do not. From a research perspective, these
observations imply that it may not be an optimal strategy to merely substitute vision at a
later date into a system that was developed to use an instrumented glove or a body suit –
or vice versa.
Unlike special devices that measure human position and motion, vision uses a
multipurpose sensor; the same device used to recognize gestures can be used to
recognize other objects in the environment and also to transmit video for
teleconferencing, surveillance, and other purposes. There is a growing interest in CMOSbased cameras, which promise miniaturized, low cost, low power cameras integrated
with processing circuitry on a single chip. With its integrated processing, such a sensor
could conceivably output motion or gesture parameters to the virtual environment.
Currently, most computer vision systems for recognition look something like Figure 3.
Analog cameras feed their signal into a digitizer board, or framegrabber, which may do a
DMA transfer directly to host memory. Digital cameras bypass the analog-to-digital
conversion and go straight to memory. There may be a preprocessing step, where images
are normalized, enhanced, or transformed in some manner, and then a feature extraction
step. The features – which may be any of a variety of 2D or 3D features, statistical
properties, or estimated body parameters – are analyzed and classified as a particular
gesture if appropriate.
Vision-based systems for gesture recognition vary along a number of dimensions, most
notably:
 Number of cameras. How many cameras are used? If more than one, are they
combined early (stereo) or late (multi-view)?
 Speed and latency. Is the system real-time (i.e., fast enough, with low enough
latency, to support interaction)?
 Structured environment. Are there restrictions on the background, the lighting,
the speed of movement, etc.?
Compiled by Omorogbe Harry
72
HCI
 User requirements. Must the user wear anything special (e.g., markers, gloves,
long sleeves)? Is anything disallowed (e.g., glasses, beard, rings)?
 Primary features. What low-level features are computed (edges, regions,
silhouettes, moments, histograms, etc.)?
 Two- or three-dimensional representation. Does the system construct a 3D model
of the body part(s), or is classification done on some other (view-based)
representation?
 Representation of time. How is the temporal aspect of gesture represented and
used in recognition (e.g., via a state machine, dynamic time warping, HMMs,
time-compressed template)?
Head and face gestures
When people interact with one another, they use an assortment of cues from the head and
face to convey information. These gestures may be intentional or unintentional, they may
be the primary communication mode or backchannels, and they can span the range from
extremely subtle to highly exaggerate. Some examples of head and face gestures include:
 nodding or shaking the head;
 direction of eye gaze;
 raising the eyebrows;
 opening the mouth to speak;
 winking;
 flaring the nostrils; and
 looks of surprise, happiness, disgust, anger, sadness, etc.
People display a wide range of facial expressions. Ekman and Friesen (1978) developed
a system called FACS for measuring facial movement and coding expression; this
description forms the core representation for many facial expression analysis systems.
A real-time system to recognize actions of the head and facial features was developed by
Zelinsky and Heinzmann (1996), who used feature template tracking in a Kalman filter
framework to recognize thirteen head/face gestures. Moses et al. (1995) used fast contour
tracking to determine facial expression from a mouth contour. Essa and Pentland (1997)
used optical flow information with a physical muscle model of the face to produce
accurate estimates of facial motion. This system was also used to generate spatiotemporal motion-energy templates of the whole face for each different expression – these
templates were then used for expression recognition. Oliver et al. (1997) describe a realtime system for tracking the face and mouth that recognized facial expressions and head
movements. Otsuka and Ohya (1998) model coarticulation in facial expressions and use
an HMM for recognition.
Black and Yacoob (1995) used local parametric motion models to track and recognize
both rigid and non-rigid facial motions. Demonstrations of this system show facial
expressions being detected from television talk show guests and news anchors (in nonreal time). La Cascia et al. (1998) extended this approach using texture-mapped surface
models and non-planar parameterized motion models to better capture the facial motion.
Compiled by Omorogbe Harry
73
HCI
Hand and Arm gestures
Hand and arm gestures receive the most attention among those who study gesture – in
fact, many (if not most) references to gesture recognition only consider hand and arm
gestures. The vast majority of automatic recognition systems are for deictic gestures
(pointing), emblematic gestures (isolated signs), and sign languages (with a limited
vocabulary and syntax). Some are components of bimodal systems, integrated with
speech recognition. Some produce precise hand and arm configuration while others only
coarse motion.
Stark and Kohler (1995) developed the ZYKLOP system for recognizing hand poses and
gestures in real-time. After segmenting the hand from the background and extracting
features such as shape moments and fingertip positions, the hand posture is classified.
Temporal gesture recognition is then performed on the sequence of hand poses and their
motion trajectory. A small number of hand poses comprises the gesture catalog, while a
sequence of these makes a gesture. Similarly, Maggioni and Kämmerer (1998) described
the GestureComputer, which recognized both hand gestures and head.movements. Other
systems that recognize hand postures amidst complex visual backgrounds are reported by
Weng and Cui (1998) and Triesch and von der Malsburg (1996).
There has been a lot of interest in creating devices to automatically interpret various sign
languages to aid the deaf community. One of the first to use computer vision without
requiring the user to wear anything special was built by Starner (1995), who used HMMs
to recognize a limited vocabulary of ASL sentences. A more recent effort, which uses
HMMs to recognize Sign Language of the Netherlands is described by Assan and Grobel
(1997).
The recognition of hand and arm gestures has been applied to entertainment applications.
Freeman et al. (1996) developed a real-time system to recognize hand poses using image
moments and orientation histograms, and applied it to interactive video games. Cutler
and Turk (1998) described a system for children to play virtual instruments and interact
with lifelike characters by classifying measurements based on optical flow. A nice
overview of work up to 1995 in hand gesture modeling, analysis, and synthesis is
presented by Huang and Pavlovic (1995).
Body gestures
This section includes tracking full body motion, recognizing body gestures, and
recognizing human activity. Activity may be defined over a much longer period of time
than what is normally considered a gesture; for example, two people meeting in an open
area, stopping to talk and then continuing on their way may be considered a recognizable
activity. Bobick (1997) proposed taxonomy of motion understanding in terms of:




Movement. The atomic elements of motion.
Activity. A sequence of movements or static configurations.
Action. High-level description of what is happening in context.
Most research to date has focused on the first two levels.
Compiled by Omorogbe Harry
74
HCI
The Pfinder system (Wren et al., 1996) developed at the MIT Media Lab has been used
by a number of groups to do body tracking and gesture recognition. It forms a 2D
representation of the body, using statistical models of color and shape. The body model
provides an effective interface for applications such as video games, interpretive dance,
navigation, and interaction with virtual characters. Lucente et al. (1998) combined
Pfinder with speech recognition in an interactive environment called.Visualization
Space, allowing a user to manipulate virtual objects and navigate through virtual worlds.
Paradiso and Sparacino (1997) used Pfinder to create an interactive performance space
where a dancer can generate music and graphics through their body movements – for
example, hand and body gestures can trigger rhythmic and melodic changes in the music.
Systems that analyze human motion in VEs may be quite useful in medical rehabilitation
(see Chapter 46, this Volume) and athletic and military training (see Chapter 43, this
Volume). For example, a system like the one developed by Boyd and Little (1998) to
recognize human gaits could potentially be used to evaluate rehabilitation progress.
Yamamoto et al. (1998) describe a system that used computer vision to analyze body
motion in order to evaluate the performance of skiers.
Davis and Bobick (1997) used a view-based approach by representing and recognizing
human action based on “temporal templates,” where a single image template captures the
recent history of motion. This technique was used in the KidsRoom system, an
interactive, immersive, narrative environment for children.
Video surveillance and monitoring of human activity has received significant attention in
recent years. For example, the W4 system developed at the University of Maryland
(Haritaoglu et al., 1998) tracks people and detects patterns of activity.
System Architecture Concepts for Gesture Recognition Systems
Based on the use of gestures by humans, the analysis of speech and handwriting
recognizing systems and the analysis of other gesture recognition systems requirements
for a gesture recognition system can be detailed. Some requirements and tasks are:
 Choose gestures which fit a useful environment.
 Create a system which can recognize non-perfect human created gestures.
 Create a system which can use a both a gesture's static and dynamic
 Information components.
 Perform gesture recognition with image data presented at field rate (or as fast as
possible).
 Recognize the gesture as quickly as possible, even before the full gesture is
completed.
 Use a recognition method which requires a small amount of computational time
and memory.
 Create an expandable system which can recognize additional types of gestures.
 Pair gestures with appropriate responses (language definitions or device
command responses).
 Create an environment which allows the use of gestures for remote control of
devices.
Conclusions
Compiled by Omorogbe Harry
75
HCI
Although several research efforts have been referenced in this chapter, these are just a
sampling; many more have been omitted for the sake of brevity. Good sources for much
of the work in gesture recognition can be found in the proceedings of the Gesture
Workshops and the International Conference on Automatic Face and Gesture
Recognition.
There is still much to be done before gestural interfaces, which track and recognize
human activities, can become pervasive and cost-effective for the masses. However,
much progress has been made in the past decade and with the continuing march towards
computers and sensors that are faster, smaller, and more ubiquitous, there is cause for
optimism. As PDAs and pen-based computing continue to proliferate, pen-based 2D
gestures should become more common, and some of the technology will transfer to 3D
hand, head, and body gestural interfaces. Similarly, technology developed in surveillance
and security areas will also find uses in gesture recognition for virtual environments.
There are many open questions in this area. There has been little activity in evaluating
usability (see Chapter 34, this Volume) and understanding performance requirements and
limitations of gestural interaction. Error rates are reported from 1% to 50%, depending
on the difficulty and generality of the scenario. There are currently no common databases
or metrics with which to compare research results. Can gesture recognition systems adapt
to variations among individuals, or will extensive individual training be required? What
about individual variation due to fatigue and other factors? How good do gesture
recognition systems need to be to become truly useful in mass applications?
Each technology discussed in this chapter has its benefits and limitations. Devices that
are worn or held – pens, gloves, body suits – are currently more advanced, as evidenced
by the fact that there are many commercial products available. However, passive sensing
(using cameras or other sensors) promises to be more powerful, more general, and less
obtrusive than other technologies. It is likely that both camps will continue to improve
and co-exist, often be used together in systems, and that new sensing technologies will
arise to give even more choice to VE developers.
Augmented Reality
Overview
This section surveys the field of Augmented Reality, in which 3-D virtual objects are
integrated into a 3-D real environment in real time. It describes the medical,
manufacturing, visualization, path planning, entertainment and military applications that
have been explored. This paper describes the characteristics of Augmented Reality
systems, including a detailed discussion of the tradeoffs between optical and video
blending approaches. Registration and sensing errors are two of the biggest problems in
building effective Augmented Reality systems, so this paper summarizes current efforts
to overcome these problems. Future directions and areas requiring further research are
discussed. This survey provides a starting point for anyone interested in researching or
using Augmented Reality.
Compiled by Omorogbe Harry
76
HCI
Introduction
This section surveys the current state-of-the-art in Augmented Reality. It describes work
performed at many different sites and explains the issues and problems encountered
when building Augmented Reality systems. It summarizes the tradeoffs and approaches
taken so far to overcome these problems and speculates on future directions that deserve
exploration.
The survey paper does not present new research results. The contribution comes from
consolidating existing information from many sources and publishing an extensive
bibliography of papers in this field. While several other introductory papers have been
written on this subject, this survey is more comprehensive and up-to-date. This survey
provides a good beginning point for anyone interested in starting research in this area.
Definition
Augmented Reality (AR) is a variation of Virtual Environments (VE), or Virtual Reality
as it is more commonly called. VE technologies completely immerse a user inside a
synthetic environment. While immersed, the user cannot see the real world around him.
In contrast, AR allows the user to see the real world, with virtual objects superimposed
upon or composited with the real world. Therefore, AR supplements reality, rather than
completely replacing it. Ideally, it would appear to the user that the virtual and real
objects coexisted in the same space, similar to the effects achieved in the film "Who
Framed Roger Rabbit?" Figure 1 shows an example of what this might look like. It
shows a real desk with a real phone. Inside this room are also a virtual lamp and two
virtual chairs. Note that the objects are combined in 3-D, so that the virtual lamp covers
the real table, and the real table covers parts of the two virtual chairs. AR can be thought
of as the "middle ground" between VE (completely synthetic) and telepresence
(completely real).
Figure 5: Real desk with virtual lamp and two virtual chairs. (Courtesy ECRC)
Some researchers define AR in a way that requires the use of Head-Mounted Displays
(HMDs). To avoid limiting AR to specific technologies, this survey defines AR as
systems that have the following three characteristics:
1) Combines real and virtual
2) Interactive in real time
3) Registered in 3-D
Compiled by Omorogbe Harry
77
HCI
This definition allows other technologies besides HMDs while retaining the essential
components of AR. For example, it does not include film or 2-D overlays. Films like
"Jurassic Park" feature photorealistic virtual objects seamlessly blended with a real
environment in 3-D, but they are not interactive media. 2-D virtual overlays on top of
live video can be done at interactive rates, but the overlays are not combined with the
real world in 3-D. However, this definition does allow monitor-based interfaces,
monocular systems, see-through HMDs, and various other combining technologies.
Potential system configurations are discussed further in Section 3.
Motivation
Why is Augmented Reality an interesting topic? Why is combining real and virtual
objects in 3-D useful? Augmented Reality enhances a user's perception of and interaction
with the real world. The virtual objects display information that the user cannot directly
detect with his own senses. The information conveyed by the virtual objects helps a user
perform real-world tasks. AR is a specific example of what Fred Brooks calls
Intelligence Amplification (IA): using the computer as a tool to make a task easier for a
human to perform.
At least six classes of potential AR applications have been explored: medical
visualization, maintenance and repair, annotation, robot path planning,
entertainment, and military aircraft navigation and targeting. The next section
describes work that has been done in each area. While these do not cover every potential
application area of this technology, they do cover the areas explored so far.
Applications
Medical
Doctors could use Augmented Reality as a visualization and training aid for surgery. It
may be possible to collect 3-D datasets of a patient in real time, using non-invasive
sensors like Magnetic Resonance Imaging (MRI), Computed Tomography scans (CT), or
ultrasound imaging. These datasets could then be rendered and combined in real time
with a view of the real patient. In effect, this would give a doctor "X-ray vision" inside a
patient. This would be very useful during minimally-invasive surgery, which reduces the
trauma of an operation by using small incisions or no incisions at all. A problem with
minimally-invasive techniques is that they reduce the doctor's ability to see inside the
patient, making surgery more difficult. AR technology could provide an internal view
without the need for larger incisions.
AR might also be helpful for general medical visualization tasks in the surgical room.
Surgeons can detect some features with the naked eye that they cannot see in MRI or CT
scans, and vice-versa. AR would give surgeons access to both types of data
simultaneously. This might also guide precision tasks, such as displaying where to drill a
hole into the skull for brain surgery or where to perform a needle biopsy of a tiny tumor.
The information from the non-invasive sensors would be directly displayed on the
patient, showing exactly where to perform the operation.
Compiled by Omorogbe Harry
78
HCI
AR might also be useful for training purposes. Virtual instructions could remind a novice
surgeon of the required steps, without the need to look away from a patient to consult a
manual. Virtual objects could also identify organs and specify locations to avoid
disturbing.
Several projects are exploring this application area. At UNC Chapel Hill, a research
group has conducted trial runs of scanning the womb of a pregnant woman with an
ultrasound sensor, generating a 3-D representation of the fetus inside the womb and
displaying that in a see-through HMD (Figure 12). The goal is to endow the doctor with
the ability to see the moving, kicking fetus lying inside the womb, with the hope that this
one day may become a "3-D stethoscope". More recent efforts have focused on a needle
biopsy of a breast tumor. Figure 3 shows a mockup of a breast biopsy operation, where
the virtual objects identify the location of the tumor and guide the needle to its target.
Other groups at the MIT AI Lab, General Electric, and elsewhere are investigating
displaying MRI or CT data, directly registered onto the patient.
Figure6: Virtual fetus inside womb of pregnant patient. (Courtesy UNC Chapel Hill Dept. of Computer Science.)
Figure7: Mockup of breast tumor biopsy. 3-D graphics guide needle insertion.(Courtesy UNC Chapel Hill Dept. of Computer
Science.)
Manufacturing and repair.
Another category of Augmented Reality applications is the assembly, maintenance, and
repair of complex machinery. Instructions might be easier to understand if they were
available, not as manuals with text and pictures, but rather as 3-D drawings
superimposed upon the actual equipment, showing step-by-step the tasks that need to be
done and how to do them. These superimposed 3-D drawings can be animated, making
the directions even more explicit.
Several research projects have demonstrated prototypes in this area. Steve Feiner's group
at Columbia built a laser printer maintenance application, shown in Figures 8 and 9.
Figure 8 shows an external view, and Figure 9 shows the user's view, where the
computer-generated wireframe is telling the user to remove the paper tray. A group at
Boeing is developing AR technology to guide a technician in building a wiring harness
that forms part of an airplane's electrical system. Storing these instructions in electronic
form will save space and reduce costs. Currently, technicians use large physical layout
boards to construct such harnesses, and Boeing requires several warehouses to store all
Compiled by Omorogbe Harry
79
HCI
these boards. Such space might be emptied for other use if this application proves
successful. Boeing is using a Technology Reinvestment Program (TRP) grant to
investigate putting this technology onto the factory floor. Figure 10 shows an external
view of Adam Janin using a prototype AR system to build a wire bundle. Eventually, AR
might be used for any complicated machinery, such as automobile engines.
Figure 8: External view of Columbia printer maintenance application. Note that all objects must be tracked. (Courtesy Steve Feiner, Blair MacIntyre,
and Dorée Seligmann, Columbia University.)
Figure9: Prototype laser printer maintenance application, displaying how to remove the paper tray. (Courtesy Steve Feiner, Blair
MacIntyre, and Dorée Seligmann, Columbia University.)
Figure 10: Adam Janin demonstrates Boeing's prototype wire bundle assembly application. (Courtesy David Mizell, Boeing)
Annotation and visualization
AR could be used to annotate objects and environments with public or private
information. Applications using public information assume the availability of public
databases to draw upon. For example, a hand-held display could provide information
about the contents of library shelves as the user walks around the library. At the
European Computer-Industry Research Centre (ECRC), a user can point at parts of an
engine model and the AR system displays the name of the part that is being pointed at.
Figure 11 shows this, where the user points at the exhaust manifold on an engine model
and the label "exhaust manifold" appears.
Figure 11: Engine model part labels appear as user points at them. (Courtesy ECRC)
Compiled by Omorogbe Harry
80
HCI
Alternately, these annotations might be private notes attached to specific objects.
Researchers at Columbia demonstrated this with the notion of attaching windows from a
standard user interface onto specific locations in the world, or attached to specific objects
as reminders. Figure 12 shows a window superimposed as a label upon a student. He
wears a tracking device, so the computer knows his location. As the student moves
around, the label follows his location, providing the AR user with a reminder of what he
needs to talk to the student about.
Figure 12: Windows displayed on top of specific real-world objects. (Courtesy Steve Feiner, Blair MacIntyre, Marcus Haupt, and
Eliot Solomon, Columbia University.)
AR might aid general visualization tasks as well. An architect with a see-through HMD
might be able to look out a window and see how a proposed new skyscraper would
change her view. If a database containing information about a building's structure was
available, AR might give architects "X-ray vision" inside a building, showing where the
pipes, electric lines, and structural supports are inside the walls. Researchers at the
University of Toronto have built a system called Augmented Reality through Graphic
Overlays on Stereovideo, which among other things is used to make images easier to
understand during difficult viewing conditions. Figure 13 shows wireframe lines drawn
on top of a space shuttle bay interior, while in orbit. The lines make it easier to see the
geometry of the shuttle bay. Similarly, virtual lines and objects could aid navigation and
scene understanding during poor visibility conditions, such as underwater or in fog.
Figure 13: Virtual lines help display geometry of shuttle bay, as seen in orbit. (Courtesy David Drascic and Paul Milgram, U.
Toronto.)
Robot path planning
Teleoperation of a robot is often a difficult problem, especially when the robot is far
away, with long delays in the communication link. Under this circumstance, instead of
controlling the robot directly, it may be preferable to instead control a virtual version of
the robot. The user plans and specifies the robot's actions by manipulating the local
virtual version, in real time. The results are directly displayed on the real world. Once the
plan is tested and determined, then user tells the real robot to execute the specified plan.
This avoids pilot-induced oscillations caused by the lengthy delays. The virtual versions
can also predict the effects of manipulating the environment, thus serving as a planning
and previewing tool to aid the user in performing the desired task. The ARGOS system
has demonstrated that stereoscopic AR is an easier and more accurate way of doing robot
path planning than traditional monoscopic interfaces. Others have also used registered
overlays with telepresence systems. Figure 14 shows how a virtual outline can represent
a future location of a robot arm.
Compiled by Omorogbe Harry
81
HCI
Figure 14: Virtual lines show a planned motion of a robot arm (Courtesy David Drascic and Paul Milgram, U.
Toronto.)
Entertainment
At SIGGRAPH '95, several exhibitors showed "Virtual Sets" that merge real actors with
virtual backgrounds, in real time and in 3-D. The actors stand in front of a large blue
screen, while a computer-controlled motion camera records the scene. Since the camera's
location is tracked, and the actor's motions are scripted, it is possible to digitally
composite the actor into a 3-D virtual background. For example, the actor might appear
to stand inside a large virtual spinning ring, where the front part of the ring covers the
actor while the rear part of the ring is covered by the actor. The entertainment industry
sees this as a way to reduce production costs: creating and storing sets virtually is
potentially cheaper than constantly building new physical sets from scratch. The ALIVE
project from the MIT Media Lab goes one step further by populating the environment
with intelligent virtual creatures that respond to user actions [Maes95].
Military aircraft
For many years, military aircraft and helicopters have used Head-Up Displays (HUDs)
and Helmet-Mounted Sights (HMS) to superimpose vector graphics upon the pilot's view
of the real world. Besides providing basic navigation and flight information, these
graphics are sometimes registered with targets in the environment, providing a way to
aim the aircraft's weapons. For example, the chin turret in a helicopter gunship can be
slaved to the pilot's HMS, so the pilot can aim the chin turret simply by looking at the
target. Future generations of combat aircraft will be developed with an HMD built into
the pilot's helmet.
Characteristics
This section discusses the characteristics of AR systems and design issues encountered
when building an AR system. The section describes the basic characteristics of
augmentation. There are two ways to accomplish this augmentation: optical or video
technologies. Also in this section discusses their characteristics and relative strengths and
weaknesses. Blending the real and virtual poses problems with focus and contrast, and
some applications require portable AR systems to be truly effective. Finally, in this
section a summarizes the characteristics by comparing the requirements of AR against
those for Virtual Environments.
Augmentation
Besides adding objects to a real environment, Augmented Reality also has the potential
to remove them. Current work has focused on adding virtual objects to a real
environment. However, graphic overlays might also be used to remove or hide parts of
Compiled by Omorogbe Harry
82
HCI
the real environment from a user. For example, to remove a desk in the real environment,
draw a representation of the real walls and floors behind the desk and "paint" that over
the real desk, effectively removing it from the user's sight. This has been done in feature
films. Doing this interactively in an AR system will be much harder, but this removal
may not need to be photorealistic to be effective.
Augmented Reality might apply to all senses, not just sight. So far, researchers have
focused on blending real and virtual images and graphics. However, AR could be
extended to include sound. The user would wear headphones equipped with microphones
on the outside. The headphones would add synthetic, directional 3–D sound, while the
external microphones would detect incoming sounds from the environment. This would
give the system a chance to mask or cover up selected real sounds from the environment
by generating a masking signal that exactly canceled.10 the incoming real sound. While
this would not be easy to do, it might be possible. Another example is haptics. Gloves
with devices that provide tactile feedback might augment real forces in the environment.
For example, a user might run his hand over the surface of a real desk. Simulating such a
hard surface virtually is fairly difficult, but it is easy to do in reality. Then the tactile
effectors in the glove can augment the feel of the desk, perhaps making it feel rough in
certain spots. This capability might be useful in some applications, such as providing an
additional cue that a virtual object is at a particular location on a real desk.
Optical vs. video
A basic design decision in building an AR system is how to accomplish the combining of
real and virtual. Two basic choices are available: optical and video technologies. Each
has particular advantages and disadvantages. This section compares the two and notes
the tradeoffs.
A see-through HMD is one device used to combine real and virtual. Standard closedview HMDs do not allow any direct view of the real world. In contrast, a see-through
HMD lets the user see the real world, with virtual objects superimposed by optical or
video technologies.
Optical see-through HMDs work by placing optical combiners in front of the user's eyes.
These combiners are partially transmissive, so that the user can look directly through
them to see the real world. The combiners are also partially reflective, so that the user
sees virtual images bounced off the combiners from head-mounted monitors. This
approach is similar in nature to Head-Up Displays (HUDs) commonly used in military
aircraft, except that the combiners are attached to the head. Thus, optical see-through
HMDs have sometimes been described as a "HUD on a head" [Wanstall89]. Figure 11
shows a conceptual diagram of an optical see-through HMD. Figure 12 shows two
optical see-through HMDs made by Hughes Electronics.
The optical combiners usually reduce the amount of light that the user sees from the real
world. Since the combiners act like half-silvered mirrors, they only let in some of the
light from the real world, so that they can reflect some of the light from the monitors into
the user's eyes. For example, the HMD described in [Holmgren92] transmits about 30%
of the incoming light from the real world. Choosing the level of blending is a design
problem. More sophisticated combiners might vary the level of contributions based upon
Compiled by Omorogbe Harry
83
HCI
the wavelength of light. For example, such a combiner might be set to reflect all light of
a certain wavelength and none at any other wavelengths. This would be ideal with a
monochrome monitor. Virtually all the light from the monitor would be reflected into the
user's eyes, while almost all the light from the real world (except at the particular
wavelength) would reach the user's eyes. However, most existing optical see-through
HMDs do reduce the amount of light from the real world, so they act like a pair of
sunglasses when the power is cut off.
Figure 15: Optical see-through HMD conceptual diagram
Figure 16: Two optical see-through HMDs, made by Hughes Electronics
In contrast, video see-through HMDs work by combining a closed-view HMD with one
or two head-mounted video cameras. The video cameras provide the user's view of the
real world. Video from these cameras is combined with the graphic images created by
the scene generator, blending the real and virtual. The result is sent to the monitors in
front of the user's eyes in the closed-view HMD. Figure 17 shows a conceptual diagram
of a video see-through HMD. Figure 18 shows an actual video see-through HMD, with
two video cameras mounted on top of a Flight Helmet.
Figure 17: Video see-through HMD conceptual diagram
Compiled by Omorogbe Harry
84
HCI
Figure 18: An actual video see-through HMD. (Courtesy Jannick Rolland, Frank Biocca, and UNC Chapel Hill Dept. of Computer
Science. Photo by Alex Treml.)
Video composition can be done in more than one way. A simple way is to use chromakeying: a technique used in many video special effects. The background of the computer
graphic images is set to a specific color, say green, which none of the virtual objects use.
Then the combining step replaces all green areas with the corresponding parts from the
video of the real world. This has the effect of superimposing the virtual objects over the
real world. A more sophisticated composition would use depth information. If the system
had depth information at each pixel for the real world images, it could combine the real
and virtual images by a pixel-by-pixel depth comparison. This would allow real objects
to cover virtual objects and vice-versa.
AR systems can also be built using monitor-based configurations, instead of see-through
HMDs. Figure 19 shows how a monitor-based system might be built. In this case, one or
two video cameras view the environment. The cameras may be static or mobile. In the
mobile case, the cameras might move around by being attached to a robot, with their
locations tracked. The video of the real world and the graphic images generated by a
scene generator are combined, just as in the video see-through HMD case, and displayed
in a monitor in front of the user. The user does not wear the display device. Optionally,
the images may be displayed in stereo on the monitor, which then requires the user to
wear a pair of stereo glasses. Figure 20 shows an external view of the ARGOS system,
which uses a monitor-based configuration.
Figure 19: Monitor-based AR conceptual diagram
Figure 20: External view of the ARGOS system, an example of monitor-based
AR. (Courtesy David Drascic and Paul Milgram, U. Toronto.)
Finally, a monitor-based optical configuration is also possible. This is similar to Figure
18 except that the user does not wear the monitors or combiners on her head. Instead, the
monitors and combiners are fixed in space, and the user positions her head to look
Compiled by Omorogbe Harry
85
HCI
through the combiners. This is typical of Head-Up Displays on military aircraft, and at
least one such configuration has been proposed for a medical application.
The rest of this section compares the relative advantages and disadvantages of optical
and video approaches, starting with optical. An optical approach has the following
advantages over a video approach:
Simplicity: Optical blending is simpler and cheaper than video blending. Optical
approaches have only one "stream" of video to worry about: the graphic images.
The real world is seen directly through the combiners, and that time delay is
generally a few nanoseconds. Video blending, on the other hand, must deal with
separate video streams for the real and virtual images. Both streams have inherent
delays in the tens of milliseconds. Digitizing video images usually adds at least
one frame time of delay to the video stream, where a frame time is how long it
takes to completely update an image. A monitor that completely refreshes the
screen at 60 Hz has a frame time of 16.67ms. The two streams of real and virtual
images must be properly synchronized or temporal distortion results. Also,
optical see-through HMDs with narrow field-of-view combiners offer views of
the real world that have little distortion. Video cameras almost always have some
amount of distortion that must be compensated for, along with any distortion
from the optics in front of the display devices. Since video requires cameras and
combiners that optical approaches do not need, video will probably be more
expensive and complicated to build than optical-based systems.
Resolution: Video blending limits the resolution of what the user sees, both real
and virtual, to the resolution of the display devices. With current displays, this
resolution is far less than the resolving power of the fovea. Optical see-through
also shows the graphic images at the resolution of the display device, but the
user's view of the real world is not degraded. Thus, video reduces the resolution
of the real world, while optical see-through does not.
Safety: Video see-through HMDs are essentially modified closed-view HMDs. If
the power is cut off, the user is effectively blind. This is a safety concern in some
applications. In contrast, when power is removed from an optical see-through
HMD, the user still has a direct view of the real world. The HMD then becomes a
pair of heavy sunglasses, but the user can still see.
No eye offset: With video see-through, the user's view of the real world is
provided by the video cameras. In essence, this puts his "eyes" where the video
cameras are. In most configurations, the cameras are not located exactly where
the user's eyes are, creating an offset between the cameras and the real eyes. The
distance separating the cameras may also not be exactly the same as the user's
interpupillary distance (IPD). This difference between camera locations and eye
locations introduces displacements from what the user sees compared to what he
expects to see. For example, if the cameras are above the user's eyes, he will see
the world from a vantage point slightly taller than he is used to. Video seethrough can avoid the eye offset problem through the use of mirrors to create
another set of optical paths that mimic the paths directly into the user's eyes.
Using those paths, the cameras will see what the user's eyes would normally see
Compiled by Omorogbe Harry
86
HCI
without the HMD. However, this adds complexity to the HMD design. Offset is
generally not a difficult design problem for optical see-through displays. While
the user's eye can rotate with respect to the position of the HMD, the resulting
errors are tiny. Using the eye's center of rotation as the viewpoint in the computer
graphics model should eliminate any need for eye tracking in an optical seethrough HMD.
Video blending offers the following advantages over optical blending:
Flexibility in composition strategies: A basic problem with optical see-through is
that the virtual objects do not completely obscure the real world objects, because
the optical combiners allow light from both virtual and real sources. Building an
optical see-through HMD that can selectively shut out the light from the real
world is difficult. In a normal optical system, the objects are designed to be in
focus at only one point in the optical path: the user's eye. Any filter that would
selectively block out light must be placed in the optical path at a point where the
image is in focus, which obviously cannot be the user's eye. Therefore, the optical
system must have two places where the image is in focus: at the user's eye and the
point of the hypothetical filter. This makes the optical design much more difficult
and complex. No existing optical see-through HMD blocks incoming light in this
fashion. Thus, the virtual objects appear ghost-like and semi-transparent. This
damages the illusion of reality because occlusion is one of the strongest depth
cues. In contrast, video see-through is far more flexible about how it merges the
real and virtual images. Since both the real and virtual are available in digital
form, video see-through compositors can, on a pixel-by-pixel basis, take the real,
or the virtual, or some blend between the two to simulate transparency. Because
of this flexibility, video see-through may ultimately produce more compelling
environments than optical see-through approaches.
Wide field-of-view: Distortions in optical systems are a function of the radial
distance away from the optical axis. The further one looks away from the center
of the view, the larger the distortions get. A digitized image taken through a
distorted optical system can be undistorted by applying image processing
techniques to unwarp the image, provided that the optical distortion is well
characterized. This requires significant amounts of computation, but this
constraint will be less important in the future as computers become faster. It is
harder to build wide field-of-view displays with optical see-through techniques.
Any distortions of the user's view of the real world must be corrected optically,
rather than digitally, because the system has no digitized image of the real world
to manipulate. Complex optics are expensive and add weight to the HMD. Wide
field-of-view systems are an exception to the general trend of optical approaches
being simpler and cheaper than video approaches.
Real and virtual view delays can be matched: Video offers an approach for
reducing or avoiding problems caused by temporal mismatches between the real
and virtual images. Optical see-through HMDs offer an almost instantaneous
view of the real world but a delayed view of the virtual. This temporal mismatch
can cause problems. With video approaches, it is possible to delay the video of
the real world to match the delay from the virtual image stream.
Compiled by Omorogbe Harry
87
HCI
Additional registration strategies: In optical see-through, the only information
the system has about the user's head location comes from the head tracker. Video
blending provides another source of information: the digitized image of the real
scene. This digitized image means that video approaches can employ additional
registration strategies unavailable to optical approaches.
Easier to match the brightness of real and virtual objects: This is discussed in
previous section.
Both optical and video technologies have their roles, and the choice of technology
depends on the application requirements. Many of the mechanical assembly and repair
prototypes use optical approaches, possibly because of the cost and safety issues. If
successful, the equipment would have to be replicated in large numbers to equip workers
on a factory floor. In contrast, most of the prototypes for medical applications use video
approaches, probably for the flexibility in blending real and virtual and for the additional
registration strategies offered.
Focus and contrast
Focus can be a problem for both optical and video approaches. Ideally, the virtual should
match the real. In a video-based system, the combined virtual and real image will be
projected at the same distance by the monitor or HMD optics. However, depending on
the video camera's depth-of-field and focus settings, parts of the real world may not be in
focus. In typical graphics software, everything is rendered with a pinhole model, so all
the graphic objects, regardless of distance, are in focus. To overcome this, the graphics
could be rendered to simulate a limited depth-of-field, and the video camera might have
an autofocus lens.
In the optical case, the virtual image is projected at some distance away from the user.
This distance may be adjustable, although it is often fixed. Therefore, while the real
objects are at varying distances from the user, the virtual objects are all projected to the
same distance. If the virtual and real distances are not matched for the particular objects
that the user is looking at, it may not be possible to clearly view both simultaneously.
Contrast is another issue because of the large dynamic range in real environments and in
what the human eye can detect. Ideally, the brightness of the real and virtual objects
should be appropriately matched. Unfortunately, in the worst case scenario, this means
the system must match a very large range of brightness levels. The eye is a logarithmic
detector, where the brightest light that it can handle is about eleven orders of magnitude
greater than the smallest, including both dark-adapted and light-adapted eyes. In any one
adaptation state, the eye can cover about six orders of magnitude. Most display devices
cannot come close to this level of contrast. This is a particular problem with optical
technologies, because the user has a direct view of the real world. If the real environment
is too bright, it will wash out the virtual image. If the real environment is too dark, the
virtual image will wash out the real world. Contrast problems are not as severe with
video, because the video cameras themselves have limited dynamic response, and the
view of both the real and virtual is generated by the monitor, so everything must be
clipped or compressed into the monitor's dynamic range.
Compiled by Omorogbe Harry
88
HCI
Portability
In almost all Virtual Environment systems, the user is not encouraged to walk around
much. Instead, the user navigates by "flying" through the environment, walking on a
treadmill, or driving some mockup of a vehicle. Whatever the technology, the result is
that the user stays in one place in the real world.
Some AR applications, however, will need to support a user who will walk around a
large environment. AR requires that the user actually be at the place where the task is to
take place. "Flying," as performed in a VE system, is no longer an.option. If a mechanic
needs to go to the other side of a jet engine, she must physically move herself and the
display devices she wears. Therefore, AR systems will place a premium on portability,
especially the ability to walk around outdoors, away from controlled environments. The
scene generator, the HMD, and the tracking system must all be self-contained and
capable of surviving exposure to the environment. If this capability is achieved, many
more applications that have not been tried will become available. For example, the
ability to annotate the surrounding environment could be useful to soldiers, hikers, or
tourists in an unfamiliar new location.
Comparison against virtual environments
The overall requirements of AR can be summarized by comparing them against the
requirements for Virtual Environments, for the three basic subsystems that they require.
1) Scene generator: Rendering is not currently one of the major problems in AR. VE
systems have much higher requirements for realistic images because they completely
replace the real world with the virtual environment. In AR, the virtual images only
supplement the real world. Therefore, fewer virtual objects need to be drawn, and they
do not necessarily have to be realistically rendered in order to serve the purposes of the
application. For example, in the annotation applications, text and 3-D wireframe
drawings might suffice. Ideally, photorealistic graphic objects would be seamlessly
merged with the real environment, but more basic problems have to be solved first.
2) Display device: The display devices used in AR may have less stringent requirements
than VE systems demand, again because AR does not replace the real world. For
example, monochrome displays may be adequate for some AR applications, while
virtually all VE systems today use full color. Optical see-through HMDs with a small
field-of-view may be satisfactory because the user can still see the real world with his
peripheral vision; the see-through HMD does not shut off the user's normal field-of-view.
Furthermore, the resolution of the monitor in an optical see-through HMD might be
lower than what a user would tolerate in a VE application, since the optical see-through
HMD does not reduce the resolution of the real environment.
3) Tracking and sensing: While in the previous two cases AR had lower requirements
than VE, that is not the case for tracking and sensing. In this area, the requirements for
AR are much stricter than those for VE systems. A major reason for this is the
registration problem, which is described in the next section. The other factors that make
the tracking and sensing requirements higher are described in the next few page.
Compiled by Omorogbe Harry
89
HCI
Registration
The registration problem
One of the most basic problems currently limiting Augmented Reality applications is the
registration problem. The objects in the real and virtual worlds must be properly aligned
with respect to each other, or the illusion that the two worlds coexist will be
compromised. More seriously, many applications demand accurate registration. For
example, recall the needle biopsy application. If the virtual object is not where the real
tumor is, the surgeon will miss the tumor and the biopsy will fail. Without accurate
registration, Augmented Reality will not be accepted in many applications.
Registration problems also exist in Virtual Environments, but they are not nearly as
serious because they are harder to detect than in Augmented Reality. Since the user only
sees virtual objects in VE applications, registration errors result in visual-kinesthetic and
visual-proprioceptive conflicts. Such conflicts between different human senses may be a
source of motion sickness [Pausch92]. Because the kinesthetic and proprioceptive
systems are much less sensitive than the visual system, visual-kinesthetic and visualproprioceptive conflicts are less noticeable than visual-visual conflicts. For example, a
user wearing a closed-view HMD might hold up her real hand and see a virtual hand.
This virtual hand should be displayed exactly where she would see her real hand, if she
were not wearing an HMD. But if the virtual hand is wrong by five millimeters, she may
not detect that unless actively looking for such errors. The same error is much more
obvious in a see-through HMD, where the conflict is visual-visual.
Furthermore, a phenomenon known as visual capture makes it even more difficult to
detect such registration errors. Visual capture is the tendency of the brain to believe what
it sees rather than what it feels, hears, etc. That is, visual information tends to override all
other senses. When watching a television program, a viewer believes the sounds come
from the mouths of the actors on the screen, even though they actually come from a
speaker in the TV. Ventriloquism works because of visual capture. Similarly, a user
might believe that her hand is where the virtual hand is drawn, rather than where her real
hand actually is, because of visual capture. This effect increases the amount of
registration error users can tolerate in Virtual Environment systems. If the errors are
systematic, users might even be able to adapt to the new environment, given a long
exposure time of several hours or days.
Augmented Reality demands much more accurate registration than Virtual Environments
[Azuma93]. Imagine the same scenario of a user holding up her hand, but this time
wearing a see-through HMD. Registration errors now result in visual-visual conflicts
between the images of the virtual and real hands. Such conflicts are easy to detect
because of the resolution of the human eye and the sensitivity of the human visual
system to differences. Even tiny offsets in the images of the real and virtual hands are
easy to detect.
What angular accuracy is needed for good registration in Augmented Reality? A simple
demonstration will show the order of magnitude required. Take out a dime and hold it at
arm's length, so that it looks like a circle. The diameter of the dime covers about 1.2 to
2.0 degrees of arc, depending on your arm length. In comparison, the width of a full
moon is about 0.5 degrees of arc! Now imagine a virtual object superimposed on a real
Compiled by Omorogbe Harry
90
HCI
object, but offset by the diameter of the full moon. Such a difference would be easy to
detect. Thus, the angular accuracy required is a small fraction of a degree. The lower
limit is bounded by the resolving power of the human eye itself. The central part of the
retina is called the fovea, which has the highest density of color-detecting cones, about
120 per degree of arc, corresponding to a spacing of half a minute of arc. Observers can
differentiate between a dark and light bar grating when each bar subtends about one
minute of arc, and under special circumstances they can detect even smaller differences.
However, existing HMD trackers and displays are not capable of providing one minute
of arc in accuracy, so the present achievable accuracy is much worse than that ultimate
lower bound. In practice, errors of a few pixels are detectable in modern HMDs.
Registration of real and virtual objects is not limited to AR. Special-effects artists
seamlessly integrate computer-generated 3-D objects with live actors in film and video.
The difference lies in the amount of control available. With film, a director can carefully
plan each shot, and artists can spend hours per frame, adjusting each by hand if
necessary, to achieve perfect registration. As an interactive medium, AR is far more
difficult to work with. The AR system cannot control the motions of the HMD wearer.
The user looks where she wants, and the system must respond within tens of
milliseconds.
Registration errors are difficult to adequately control because of the high accuracy
requirements and the numerous sources of error. These sources of error can be divided
into two types: static and dynamic. Static errors are the ones that cause registration errors
even when the user's viewpoint and the objects in the environment remain completely
still. Dynamic errors are the ones that have no effect until either the viewpoint or the
objects begin moving.
For current HMD-based systems, dynamic errors are by far the largest contributors to
registration errors, but static errors cannot be ignored either. The next two sections
discuss static and dynamic errors and what has been done to reduce them. See
[Holloway95] for a thorough analysis of the sources and magnitudes of registration
errors.
Static errors
The four main sources of static errors are:
 Optical distortion
 Errors in the tracking system
 Mechanical misalignments.20
 Incorrect viewing parameters (e.g., field of view, tracker-to-eye position and
orientation, interpupillary distance)
1) Distortion in the optics: Optical distortions exist in most camera and lens systems,
both in the cameras that record the real environment and in the optics used for the
display. Because distortions are usually a function of the radial distance away from the
optical axis, wide field-of-view displays can be especially vulnerable to this error. Near
the center of the field-of-view, images are relatively undistorted, but far away from the
center, image distortion can be large. For example, straight lines may appear curved. In a
see-through HMD with narrow field-of-view displays, the optical combiners add
Compiled by Omorogbe Harry
91
HCI
virtually no distortion, so the user's view of the real world is not warped. However, the
optics used to focus and magnify the graphic images from the display monitors can
introduce distortion. This mapping of distorted virtual images on top of an undistorted
view of the real world causes static registration errors. The cameras and displays may
also have nonlinear distortions that cause errors.
Optical distortions are usually systematic errors, so they can be mapped and
compensated. This mapping may not be trivial, but it is often possible. For example,
describes the distortion of one commonly-used set of HMD optics. The distortions might
be compensated by additional optics describes such a design for a video see-through
HMD. This can be a difficult design problem, though, and it will add weight, which is
not desirable in HMDs. An alternate approach is to do the compensation digitally. This
can be done by image warping techniques, both on the digitized video and the graphic
images. Typically, this involves predistorting the images so that they will appear
undistorted after being displayed. Another way to perform digital compensation on the
graphics is to apply the predistortion functions on the vertices of the polygons, in screen
space, before rendering. This requires subdividing polygons that cover large areas in
screen space. Both digital compensation methods can be computationally expensive,
often requiring special hardware to accomplish in real time. Holloway determined that
the additional system delay required by the distortion compensation adds more
registration error than the distortion compensation removes, for typical head motion.
2) Errors in the tracking system: Errors in the reported outputs from the tracking and
sensing systems are often the most serious type of static registration errors. These
distortions are not easy to measure and eliminate, because that requires another "3-D
ruler" that is more accurate than the tracker being tested. These errors are often nonsystematic and difficult to fully characterize. Almost all commercially-available tracking
systems are not accurate enough to satisfy the requirements of AR systems. Section 5
discusses this important topic further.
3) Mechanical misalignments: Mechanical misalignments are discrepancies between the
model or specification of the hardware and the actual physical properties of the real
system. For example, the combiners, optics, and monitors in an optical see-through
HMD may not be at the expected distances or orientations with respect to each other. If
the frame is not sufficiently rigid, the various component parts may change their relative
positions as the user moves around, causing errors. Mechanical misalignments can cause
subtle changes in the position and orientation of the projected virtual images that are
difficult to compensate. While some alignment errors can be calibrated, for many others
it may be more effective to "build it right" initially.
4) Incorrect viewing parameters: Incorrect viewing parameters, the last major source of
static registration errors, can be thought of as a special case of alignment errors where
calibration techniques can be applied. Viewing parameters specify how to convert the
reported head or camera locations into viewing matrices used by the scene generator to
draw the graphic images. For an HMD-based system, these parameters include:
 Center of projection and viewport dimensions
 Offset, both in translation and orientation, between the location of the head
tracker and the user's eyes
 Field of view
Compiled by Omorogbe Harry
92
HCI
Incorrect viewing parameters cause systematic static errors. Take the example of a head
tracker located above a user's eyes. If the vertical translation offsets between the tracker
and the eyes are too small, all the virtual objects will appear lower than they should.
In some systems, the viewing parameters are estimated by manual adjustments, in a nonsystematic fashion. Such approaches proceed as follows: place a real object in the
environment and attempt to register a virtual object with that real object. While wearing
the HMD or positioning the cameras, move to one viewpoint or a few selected
viewpoints and manually adjust the location of the virtual object and the other viewing
parameters until the registration "looks right." This may achieve satisfactory results if the
environment and the viewpoint remain static. However, such approaches require a skilled
user and generally do not achieve robust results for many viewpoints. Achieving good
registration from a single viewpoint is much easier than registration from a wide variety
of viewpoints using a single set of parameters. Usually what happens is satisfactory
registration at one viewpoint, but when the user walks to a significantly different
viewpoint, the registration is inaccurate because of incorrect viewing parameters or
tracker distortions. This means many different sets of parameters must be used, which is
a less than satisfactory solution.
Another approach is to directly measure the parameters, using various measuring tools
and sensors. For example, a commonly-used optometrist's tool can measure the
interpupillary distance. Rulers might measure the offsets between the tracker and eye
positions. Cameras could be placed where the user's eyes would normally be in an optical
see-through HMD. By recording what the camera sees, through the see-through HMD, of
the real environment, one might be able to determine several viewing parameters. So far,
direct measurement techniques have enjoyed limited success.
View-based tasks are another approach to calibration. These ask the user to perform
various tasks that set up geometric constraints. By performing several tasks, enough
information is gathered to determine the viewing parameters. For example, [Azuma94]
asked a user wearing an optical see-through HMD to look straight through a narrow pipe
mounted in the real environment. This sets up the constraint that the user's eye must be
located along a line through the center of the pipe. Combining this with other tasks
created enough constraints to measure all the viewing parameters. [Caudell92] used a
different set of tasks, involving lining up two circles that specified a cone in the real
environment. [Oishi96] moves virtual cursors to appear on top of beacons in the real
environment. All view-based tasks rely upon the user accurately performing the specified
task and assume the tracker is accurate. If the tracking and sensing equipment is not
accurate, then multiple measurements must be taken and optimizers used to find the
"best-fit" solution.
For video-based systems, an extensive body of literature exists in the robotics and
photogrammetry communities on camera calibration techniques. Such techniques
compute a camera's viewing parameters by taking several pictures of an object of fixed
and sometimes unknown geometry. These pictures must be taken from different
locations. Matching points in the 2-D images with corresponding 3-D points on the
object sets up mathematical constraints. With enough pictures, these constraints
determine the viewing parameters and the 3-D location of the calibration object.
Alternately, they can serve to drive an optimization routine that will search for the best
Compiled by Omorogbe Harry
93
HCI
set of viewing parameters that fits the collected data. Several AR systems have used
camera calibration techniques.
Dynamic errors
Dynamic errors occur because of system delays, or lags. The end-to-end system delay is
defined as the time difference between the moment that the tracking system measures the
position and orientation of the viewpoint to the moment when the generated images
corresponding to that position and orientation appear in the displays. These delays exist
because each component in an Augmented Reality system requires some time to do its
job. The delays in the tracking subsystem, the communication delays, the time it takes
the scene generator to draw the appropriate images in the frame buffers, and the scanout
time from the frame buffer to the displays all contribute to end-to-end lag. End-to-end
delays of 100 ms are fairly typical on existing systems. Simpler systems can have less
delay, but other systems have more. Delays of 250 ms or more can exist on slow, heavily
loaded, or networked systems.
End-to-end system delays cause registration errors only when motion occurs. Assume
that the viewpoint and all objects remain still. Then the lag does not cause registration
errors. No matter how long the delay is, the images generated are appropriate, since
nothing has moved since the time the tracker measurement was taken. Compare this to
the case with motion. For example, assume a user wears a see-through HMD and moves
her head. The tracker measures the head at an initial time t. The images corresponding to
time t will not appear until some future time t2 , because of the end-to-end system delays.
During this delay, the user's head remains in motion, so when the images computed at
time t finally appear, the user sees them at a different location than the one they were
computed for. Thus, the images are incorrect for the time they are actually viewed. To
the user, the virtual objects appear to "swim around" and "lag behind" the real objects.
This was graphically demonstrated in a videotape of UNC's ultrasound experiment
shown at SIGGRAPH '92. In Figure 17, the picture on the left shows what the
registration looks like when everything stands still. The virtual gray trapezoidal region
represents what the ultrasound wand is scanning. This virtual trapezoid should be
attached to the tip of the real ultrasound wand. This is the case in the picture on the left,
where the tip of the wand is visible at the bottom of the picture, to the left of the "UNC"
letters. But when the head or the wand moves, large dynamic registration errors occur, as
shown in the picture on the right. The tip of the wand is now far away from the virtual
trapezoid. Also note the motion blur in the background, which is caused by the user's
head motion.
Figure 21: Effect of motion and system delays on registration. Picture on the left is a static scene. Picture on the right shows motion.
(Courtesy UNC Chapel Hill Dept. of Computer Science)
Compiled by Omorogbe Harry
94
HCI
System delays seriously hurt the illusion that the real and virtual worlds coexist because
they cause large registration errors. With a typical end-to-end lag of 100 ms and a
moderate head rotation rate of 50 degrees per second, the angular dynamic error is 5
degrees. At a 68 cm arm length, these results in registration errors of almost 60 mm.
System delay is the largest single source of registration error in existing AR systems,
outweighing all others combined. Methods used to reduce dynamic registration fall under
four main categories:




Reduce system lag
Reduce apparent lag
Match temporal streams (with video-based systems)
Predict future locations
1) Reduce system lag: The most direct approach is simply to reduce, or ideally eliminate,
the system delays. If there are no delays, there are no dynamic errors. Unfortunately,
modern scene generators are usually built for throughput, not minimal latency. It is
sometimes possible to reconfigure the software to sacrifice throughput to minimize
latency. For example, the SLATS system completes rendering a pair of interlaced NTSC
images in one field time (16.67 ms) on Pixel-Planes 5. Being careful about synchronizing
pipeline tasks can also reduce the end-to-end lag
System delays are not likely to completely disappear anytime soon. Some believe that
the current course of technological development will automatically solve this problem.
Unfortunately, it is difficult to reduce system delays to the point where they are no
longer an issue. Recall that registration errors must be kept to a small fraction of a
degree. At the moderate head rotation rate of 50 degrees per second, system lag must be
10 ms or less to keep angular errors below 0.5 degrees. Just scanning out a frame buffer
to a display at 60 Hz requires 16.67 ms. It might be possible to build an HMD system
with less than 10 ms of lag, but the drastic cut in throughput and the expense required to
construct the system would make alternate solutions attractive. Minimizing system delay
is important, but reducing delay to the point where it is no longer a source of registration
error is not currently practical.
2) Reduce apparent lag: Image deflection is a clever technique for reducing the amount
of apparent system delay for systems that only use head orientation. It is a way to
incorporate more recent orientation measurements into the late stages of the rendering
pipeline. Therefore, it is a feed-forward technique. The scene generator renders an image
much larger than needed to fill the display. Then just before scanout, the system reads
the most recent orientation report. The orientation value is used to select the fraction of
the frame buffer to send to the display, since small orientation changes are equivalent to
shifting the frame buffer output horizontally and vertically.
Image deflection does not work on translation, but image warping techniques might.
After the scene generator renders the image based upon the head tracker reading, small
adjustments in orientation and translation could be done after rendering by warping the
image. These techniques assume knowledge of the depth at every pixel, and the warp
must be done much more quickly than rerendering the entire image.
Compiled by Omorogbe Harry
95
HCI
3) Match temporal streams: In video-based AR systems, the video camera and
digitization hardware impose inherent delays on the user's view of the real world. This is
potentially a blessing when reducing dynamic errors, because it allows the temporal
streams of the real and virtual images to be matched. Additional delay is added to the
video from the real world to match the scene generator delays in generating the virtual
images. This additional delay to the video streeam will probably not remain constant,
since the scene generator delay will vary with the complexity of the rendered scene.
Therefore, the system must dynamically synchronize the two streams.
Note that while this reduces conflicts between the real and virtual, now both the real and
virtual objects are delayed in time. While this may not be bothersome for small delays, it
is a major problem in the related area of telepresence systems and will not be easy to
overcome. For long delays, this can produce negative effects such as pilot-induced
oscillation.
4) Predict: The last method is to predict the future viewpoint and object locations. If the
future locations are known, the scene can be rendered with these future locations, rather
than the measured locations. Then when the scene finally appears, the viewpoints and
objects have moved to the predicted locations, and the graphic images are correct at the
time they are viewed. For short system delays (under ~80 ms), prediction has been
shown to reduce dynamic errors by up to an order of magnitude [Azuma94]. Accurate
predictions require a system built for real-time measurements and computation. Using
inertial sensors makes predictions more accurate by a factor of 2-3. Predictors have been
developed for a few AR systems, but the majority were implemented and evaluated with
VE systems. More work needs to be done on ways of comparing the theoretical
performance of various predictors and in developing prediction models that better match
actual head motion.
Vision-based techniques
Mike Bajura and Ulrich Neumann point out that registration based solely on the
information from the tracking system is like building an "open-loop" controller. The
system has no feedback on how closely the real and virtual actually match. Without
feedback, it is difficult to build a system that achieves perfect matches. However, videobased approaches can use image processing or computer vision techniques to aid
registration. Since video-based AR systems have a digitized image of the real
environment, it may be possible to detect features in the environment and use those to
enforce registration. They call this a "closed-loop" approach, since the digitized image
provides a mechanism for bringing feedback into the system.
This is not a trivial task. This detection and matching must run in real time and must be
robust. This often requires special hardware and sensors. However, it is also not an "AIcomplete" problem because this is simpler than the general computer vision problem.
For example, in some AR applications it is acceptable to place fiducials in the
environment. These fiducials may be LEDs or special markers. Recent ultrasound
experiments at UNC Chapel Hill have used colored dots as fiducials. The locations or
patterns of the fiducials are assumed to be known. Image processing detects the locations
Compiled by Omorogbe Harry
96
HCI
of the fiducials, and then those are used to make corrections that enforce proper
registration.
These routines assume that one or more fiducials are visible at all times; without them,
the registration can fall apart. But when the fiducials are visible, the results can be
accurate to one pixel, which is as about close as one can get with video techniques.
Figure 2, taken from [Bajura95], shows a virtual arrow and a virtual chimney exactly
aligned with their desired points on two real objects. The real objects each have an LED
to aid the registration. Figures 23 through 25 show registration from [Mellor95a], which
uses dots with a circular pattern as the fiducials. The registration is also nearly perfect.
Figure 31 demonstrates merging virtual objects with the real environment, using colored
dots as the fiducials in a video-based approach. In the picture on the left, the stacks of
cards in the center are real, but the ones on the right are virtual. Notice that they
penetrate one of the blocks. In the image on the right, a virtual spiral object
interpenetrates the real blocks and table and also casts virtual shadows upon the real
objects.
Figure 22: A virtual arrow and virtual chimney aligned with two real objects. (Courtesy Mike Bajura, UNC Chapel Hill Dept. of
Computer Science, and Ulrich Neumann, USC)
Figure 23: Real skull with five fiducials. (Courtesy J.P. Mellor, MIT AI Lab)
Figure 24: Virtual wireframe skull registered with real skull. (Courtesy J.P. Mellor, MIT AI Lab)
Compiled by Omorogbe Harry
97
HCI
Figure 25: Virtual wireframe skull registered with real skull moved to a different position. (Courtesy J.P.
Mellor, MIT AI Lab)
Figure 26: Virtual cards and spiral object merged with real blocks and able.(Courtesy Andrei State, UNC
Chapel Hill Dept. of Computer Science.)
Instead of fiducials, [Uenohara95] uses template matching to achieve registration.
Template images of the real object are taken from a variety of viewpoints. These are used
to search the digitized image for the real object. Once that is found, a virtual wireframe
can be superimposed on the real object.
Recent approaches in video-based matching avoid the need for any calibration.
[Kutukalos96] represents virtual objects in a non-Euclidean, affine frame of reference
that allows rendering without knowledge of camera parameters. [Iu96] extracts contours
from the video of the real world, and then uses an optimization technique to match the
contours of the rendered 3-D virtual object with the contour extracted from the video.
Note that calibration-free approaches may not recover all the information required to
perform all potential AR tasks. For example, these two approaches do not recover true
depth information, which is useful when compositing the real and the virtual.
Techniques that use fiducials as the sole tracking source determine the relative projective
relationship between the objects in the environment and the video camera. While this is
enough to ensure registration, it does not provide all the information one might need in
some AR applications, such as the absolute (rather than relative) locations of the objects
and the camera. Absolute locations are needed to include virtual and real objects that are
not tracked by the video camera, such as a 3-D pointer or other virtual objects not
directly tied to real objects in the scene.
Additional sensors besides video cameras can aid registration. Both [Mellor95a]
[Mellor95b] and [Grimson94] [Grimson95] use a laser rangefinder to acquire an initial
depth map of the real object in the environment. Given a matching virtual model, the
system can match the depth maps from the real and virtual until they are properly
aligned, and that provides the information needed for registration.
Compiled by Omorogbe Harry
98
HCI
Another way to reduce the difficulty of the problem is to accept the fact that the system
may not be robust and may not be able to perform all tasks automatically. Then it can ask
the user to perform certain tasks. The system in [Sharma94] expects manual intervention
when the vision algorithms fail to identify a part because the view is obscured. The
calibration techniques in [Tuceryan95] are heavily based on computer vision techniques,
but they ask the user to manually intervene by specifying correspondences when
necessary.
Current status
The registration requirements for AR are difficult to satisfy, but a few systems have
achieved good results. [Azuma94] is an open-loop system that shows registration
typically within ±5 millimeters from many viewpoints for an object at about arm's
length. Closed-loop systems, however, have demonstrated nearly perfect registration,
accurate to within a pixel.
The registration problem is far from solved. Many systems assume a static viewpoint,
static objects, or even both. Even if the viewpoint or objects are allowed to move, they
are often restricted in how far they can travel. Registration is shown under controlled
circumstances, often with only a small number of real-world objects, or where the
objects are already well-known to the system. For example, registration may only work
on one object marked with fiducials, and not on any other objects in the scene. Much
more work needs to be done to increase the domains in which registration is robust.
Duplicating registration methods remains a nontrivial task, due to both the complexity of
the methods and the additional hardware required. If simple yet effective solutions could
be developed, that would speed the acceptance of AR systems.
Sensing
Accurate registration and positioning of virtual objects in the real environment requires
accurate tracking of the user's head and sensing the locations of other objects in the
environment. The biggest single obstacle to building effective Augmented Reality
systems is the requirement of accurate, long-range sensors and trackers that report the
locations of the user and the surrounding objects in the environment. Commercial
trackers are aimed at the needs of Virtual Environments and motion capture applications.
Compared to those two applications, Augmented Reality has much stricter accuracy
requirements and demands larger working volumes. No tracker currently provides high
accuracy at long ranges in real time. More work needs to be done to develop sensors and
trackers that can meet these stringent requirements. Specifically, AR demands more from
trackers and sensors in three areas:
 Greater input variety and bandwidth
 Higher accuracy
 Longer range
Input variety and bandwidth
VE systems are primarily built to handle output bandwidth: the images displayed, sounds
generated, etc. The input bandwidth is tiny: the locations of the user's head and hands,
the outputs from the buttons and other control devices, etc. AR systems, however, will
Compiled by Omorogbe Harry
99
HCI
need a greater variety of input sensors and much more input bandwidth. There are a
greater variety of possible input sensors than output displays. Outputs are limited to the
five human senses. Inputs can come from anything a sensor can detect. Robinett
speculates that Augmented Reality may be useful in any application that requires
displaying information not directly available or detectable by human senses by making
that information visible (or audible, touchable, etc.). Recall that the proposed medical
applications in Section 2.1 use CT, MRI and ultrasound sensors as inputs. Other future
applications might use sensors to extend the user's visual range into infrared or
ultraviolet frequencies, and remote sensors would let users view objects hidden by walls
or hills. Conceptually, anything not detectable by human senses but detectable by
machines might be transduced into something that a user can sense in an AR system.
Range data is a particular input that is vital for many AR applications. The AR system
knows the distance to the virtual objects, because that model is built into the system. But
the AR system may not know where all the real objects are in the environment. The
system might assume that the entire environment is measured at the beginning and
remains static thereafter. However, some useful applications will require a dynamic
environment, in which real objects move, so the objects must be tracked in real time.
However, for some applications a depth map of the real environment would be sufficient.
That would allow real objects to occlude virtual objects through a pixel-by-pixel depth
value comparison. Acquiring this depth map in real time is not trivial. Sensors like laser
rangefinders might be used. Many computer vision techniques for recovering shape
through various strategies (e.g., "shape from stereo," or "shape from shading") have been
tried. A recent work uses intensity-based matching from a pair of stereo images to do
depth recovery. Recovering depth through existing vision techniques is difficult to do
robustly in real time.
Finally, some annotation applications require access to a detailed database of the
environment, which is a type of input to the system. For example, the architectural
application of "seeing into the walls" assumes that the system has a database of where all
the pipes, wires and other hidden objects are within the building. Such a database may
not be readily available, and even if it is, it may not be in a format that is easily usable.
For example, the data may not be grouped to segregate the parts of the model that
represent wires from the parts that represent pipes. Thus, a significant modelling effort
may be required and should be taken into consideration when building an AR
application.
High accuracy
The accuracy requirements for the trackers and sensors are driven by the accuracies
needed for visual registration, as described in the previous section. For many approaches,
the registration is only as accurate as the tracker. Therefore, the AR system needs
trackers that are accurate to around a millimeter and a tiny fraction of a degree, across
the entire working range of the tracker.
Few trackers can meet this specification, and every technology has weaknesses. Some
mechanical trackers are accurate enough, although they tether the user to a limited
working volume. Magnetic trackers are vulnerable to distortion by metal in the
environment, which exists in many desired AR application environments. Ultrasonic
Compiled by Omorogbe Harry
100
HCI
trackers suffer from noise and are difficult to make accurate at long ranges because of
variations in the ambient temperature. Optical technologies have distortion and
calibration problems. Inertial trackers drift with time. Of the individual technologies,
optical technologies show the most promise due to trends toward high-resolution digital
cameras, real-time photogrammetric techniques, and structured light sources that result
in more signal strength at long distances. Future tracking systems that can meet the
stringent requirements of AR will probably be hybrid systems, such as a combination of
inertial and optical technologies. Using multiple technologies opens the possibility of
covering for each technology's weaknesses by combining their strengths.
Attempts have been made to calibrate the distortions in commonly-used magnetic
tracking systems. These have succeeded at removing much of the gross error from the
tracker at long ranges, but not to the level required by AR systems. For example, mean
errors at long ranges can be reduced from several inches to around one inch.
The requirements for registering other sensor modes are not nearly as stringent. For
example, the human auditory system is not very good at localizing deep bass sounds,
which is why subwoofer placement is not critical in a home theater system.
Long range
Few trackers are built for accuracy at long ranges, since most VE applications do not
require long ranges. Motion capture applications track an actor's body parts to control a
computer-animated character or for the analysis of an actor's movements. This is fine for
position recovery, but not for orientation. Orientation recovery is based upon the
computed positions. Even tiny errors in those positions can cause orientation errors of a
few degrees, which is too large for AR systems.
Two scalable tracking systems for HMDs have been described in the literature. A
scalable system is one that can be expanded to cover any desired range, simply by adding
more modular components to the system. This is done by building a cellular tracking
system, where only nearby sources and sensors are used to track a user. As the user
walks around, the set of sources and sensors changes, thus achieving large working
volumes while avoiding long distances between the current working set of sources and
sensors. While scalable trackers can be effective, they are complex and by their very
nature have many components, making them relatively expensive to construct.
The Global Positioning System (GPS) is used to track the locations of vehicles almost
anywhere on the planet. It might be useful as one part of a long range tracker for AR
systems. However, by itself it will not be sufficient. The best reported accuracy is
approximately one centimeter, assuming that many measurements are integrated (so that
accuracy is not generated in real time), when GPS is run in differential mode. That is not
sufficiently accurate to recover orientation from a set of positions on a user.
Tracking an AR system outdoors in real time with the required accuracy has not been
demonstrated and remains an open problem.
Compiled by Omorogbe Harry
101
HCI
Future directions
This section identifies areas and approaches that require further research to produce
improved AR systems.
Hybrid approaches: Future tracking systems may be hybrids, because combining
approaches can cover weaknesses. The same may be true for other problems in
AR. For example, current registration strategies generally focus on a single
strategy. Future systems may be more robust if several techniques are combined.
An example is combining vision-based techniques with prediction. If the fiducials
are not available, the system switches to open-loop prediction to reduce the
registration errors, rather than breaking down completely. The predicted
viewpoints in turn produce a more accurate initial location estimate for the
vision-based techniques.
Real-time systems and time-critical computing: Many VE systems are not truly
run in real time. Instead, it is common to build the system, often on UNIX, and
then see how fast it runs. This may be sufficient for some VE applications. Since
everything is virtual, all the objects are automatically synchronized with each
other. AR is a different story. Now the virtual and real must be synchronized, and
the real world "runs" in real time. Therefore, effective AR systems must be built
with real-time performance in mind. Accurate timestamps must be available.
Operating systems must not arbitrarily swap out the AR software process at any
time, for arbitrary durations. Systems must be built to guarantee completion
within specified time budgets, rather than just "running as quickly as possible."
These are characteristics of flight simulators and a few VE systems. Constructing
and debugging real-time systems is often painful and difficult, but the
requirements for AR demand real-time performance.
Perceptual and psychophysical studies: Augmented Reality is an area ripe for
psychophysical studies. How much lag can a user detect? How much registration
error is detectable when the head is moving? Besides questions on perception,
psychological experiments that explore performance issues are also needed. How
much does head-motion prediction improve user performance on a specific task?
How much registration error is tolerable for a specific application before
performance on that task degrades substantially? Is the allowable error larger
while the user moves her head versus when she stands still? Furthermore, not
much is known about potential optical illusions caused by errors or conflicts in
the simultaneous display of real and virtual objects.
Few experiments in this area have been performed. Jannick Rolland, Frank
Biocca and their students conducted a study of the effect caused by eye
displacements in video see-through HMDs. They found that users partially
adapted to the eye displacement, but they also had negative aftereffects after
removing the HMD. Steve Ellis' group at NASA Ames has conducted work on
perceived depth in a see-through HMD. ATR has also conducted a study.
Portability: The previous section explained why some potential AR applications
require giving the user the ability to walk around large environments, even
Compiled by Omorogbe Harry
102
HCI
outdoors. This requires making the equipment self-contained and portable.
Existing tracking technology is not capable of tracking a user outdoors at the
required accuracy.
Multimodal displays: Almost all work in AR has focused on the visual sense:
virtual graphic objects and overlays. But in the previous section I explained that
augmentation might apply to all other senses as well. In particular, adding and
removing 3-D sound is a capability that could be useful in some AR applications.
Social and political issues: Technological issues are not the only ones that need
to be considered when building a real application. There are also social and
political dimensions when getting new technologies into the hands of real users.
Sometimes, perception is what counts, even if the technological reality is
different. For example, if workers perceive lasers to be a health risk, they may
refuse to use a system with lasers in the display or in the trackers, even if those
lasers are eye safe. Ergonomics and ease of use are paramount considerations.
Whether AR is truly a cost-effective solution in its proposed applications has yet
to be determined. Another important factor is whether or not the technology is
perceived as a threat to jobs, as a replacement for workers, especially with many
corporations undergoing recent layoffs. AR may do well in this regard, because it
is intended as a tool to make the user's job easier, rather than something that
completely replaces the human worker. Although technology transfer is not
normally a subject of academic papers, it is a real problem. Social and political
concerns should not be ignored during attempts to move AR out of the research
lab and into the hands of real users.
Conclusion
Augmented Reality is far behind Virtual Environments in maturity. Several commercial
vendors sell complete, turnkey Virtual Environment systems. However, no commercial
vendor currently sells an HMD-based Augmented Reality system. A few monitor-based
"virtual set" systems are available, but today AR systems are primarily found in
academic and industrial research laboratories.
The first deployed HMD-based AR systems will probably be in the application of aircraft
manufacturing. Both Boeing and McDonnell Douglas are exploring this technology. The
former uses optical approaches, while the latter is pursuing video approaches. Boeing has
performed trial runs with workers using a prototype system but has not yet made any
deployment decisions. Annotation and visualization applications in restricted, limitedrange environments are deployable today, although much more work needs to be done to
make them cost effective and flexible. Applications in medical visualization will take
longer. Prototype visualization aids have been used on an experimental basis, but the
stringent registration requirements and ramifications of mistakes will postpone common
usage for many years. AR will probably be used for medical training before it is
commonly used in surgery.
The next generation of combat aircraft will have Helmet-Mounted Sights with graphics
registered to targets in the environment [Wanstall89]. These displays, combined with
short-range steerable missiles that can shoot at targets off-boresight, give a tremendous
Compiled by Omorogbe Harry
103
HCI
combat advantage to pilots in dogfights. Instead of having to be directly behind his target
in order to shoot at it, a pilot can now shoot at anything within a 60-90 degree cone of his
aircraft's forward centerline. Russia and Israel currently have systems with this
capability, and the U.S. is expected to field the AIM-9X missile with its associated
Helmet-Mounted Sight in 2002. Registration errors due to delays are a major problem in
this application.
Augmented Reality is a relatively new field, where most of the research efforts have
occurred in the past four years, as shown by the references listed at the end of this paper.
The SIGGRAPH "Rediscovering Our Fire" report identified Augmented Reality as one
of four areas where SIGGRAPH should encourage more submissions. Because of the
numerous challenges and unexplored avenues in this area, AR will remain a vibrant area
of research for at least the next several years.
One area where a breakthrough is required is tracking an HMD outdoors at the accuracy
required by AR. If this is accomplished, several interesting applications will become
possible. Two examples are described here: navigation maps and visualization of past
and future environments.
The first application is a navigation aid to people walking outdoors. These individuals
could be soldiers advancing upon their objective, hikers lost in the woods, or tourists
seeking directions to their intended destination. Today, these individuals must pull out a
physical map and associate what they see in the real environment around them with the
markings on the 2–D map. If landmarks are not easily identifiable, this association can
be difficult to perform, as anyone lost in the woods can attest. An AR system makes
navigation easier by performing the association step automatically. If the user's position
and orientation are known, and the AR system has access to a digital map of the area,
then the AR system can draw the map in 3-D directly upon the user's view. The user
looks at a nearby mountain and sees graphics directly overlaid on the real environment
explaining the mountain's name, how tall it is, how far away it is, and where the trail is
that leads to the top.
The second application is visualization of locations and events as they were in the past or
as they will be after future changes are performed. Tourists that visit historical sites, such
as a Civil War battlefield or the Acropolis in Athens, Greece, do not see these locations
as they were in the past, due to changes over time. It is often difficult for a modern
visitor to imagine what these sites really looked like in the past. To help, some historical
sites stage "Living History" events where volunteers wear ancient clothes and reenact
historical events. A tourist equipped with an outdoors AR system could see a computergenerated version of Living History. The HMD could cover up modern buildings and
monuments in the background and show, directly on the grounds at Gettysburg, where
the Union and Confederate troops were at the fateful moment of Pickett's charge. The
gutted interior of the modern Parthenon would be filled in by computer-generated
representations of what it looked like in 430 BC, including the long-vanished gold statue
of Athena in the middle. Tourists and students walking around the grounds with such AR
displays would gain a much better understanding of these historical sites and the
important events that took place there. Similarly, AR displays could show what proposed
architectural changes would look like before they are carried out. An urban designer
could show clients and politicians what a new stadium would look like as they walked
Compiled by Omorogbe Harry
104
HCI
around the adjoining neighborhood, to better understand how the stadium project will
affect nearby residents.
After the basic problems with AR are solved, the ultimate goal will be to generate virtual
objects that are so realistic that they are virtually indistinguishable from the real
environment. Photorealism has been demonstrated in feature films, but accomplishing
this in an interactive application will be much harder. Lighting conditions, surface
reflections, and other properties must be measured automatically, in real time. More
sophisticated lighting, texturing, and shading capabilities must run at interactive rates in
future scene generators. Registration must be nearly perfect, without manual intervention
or adjustments. While these are difficult problems, they are probably not insurmountable.
It took about 25 years to progress from drawing stick figures on a screen to the
photorealistic dinosaurs in "Jurassic Park." Within another 25 years, we should be able to
wear a pair of AR glasses outdoors to see and interact with photorealistic dinosaurs
eating a tree in our backyard.
Computer Supported Cooperative Work (CSCW)
Overview
The power of the web as a new medium derives not only from its ability to allow people
to communicate across vast distances and to different times, but also from the ability of
machines to help people communicate and manage information. The web is a complex
distributed system, and object technology has been an important part of the managing of
the complexity of the web from its creation.
Despite the growth of interest in the field of Computer Supported Cooperative Work
(CSCW), and the increasingly large number of systems, which have been developed, it is
still the case that few systems have been adopted for widespread use. This is particularly
true for widely dispersed, cross-organisational working groups where problems of
heterogeneity in computing hardware and software environments inhibit the deployment
of CSCW technologies. With a lightweight and extensible client-server architecture,
client implementations for all popular computing platforms, and an existing user base
numbered in millions, the World Wide Web offers great potential in solving some of
these problems to provide an `enabling technology' for CSCW applications. I illustrate
this potential using the work with the BSCW shared workspace system--an extension to
the Web architecture, which provides basic facilities for collaborative information
sharing from unmodified Web browsers. I conclude that despite limitations in the range
of applications, which can be directly supported, building on the strengths of the Web
can give significant benefits in easing the development and deployment of CSCW
applications.
Introduction
Over the last decade the level of interest in the field of Computer Supported Cooperative
Work (CSCW) has grown enormously and an ever-increasing number of systems have
been developed with the goal of supporting collaborative work. These efforts have led to
a greater understanding of the complexity of group work and the implications of this
Compiled by Omorogbe Harry
105
HCI
complexity, in terms of the flexibility required of supporting computer systems, have
driven much of the recent work in the field. Despite these advances, however, it is still
the case that few cooperative systems are in widespread use and most exist only as
laboratory-based prototypes. This is particularly true for widely dispersed working
groups, where electronic mail and simple file-transfer programs remain the state-of-theart in providing computer support for collaborative work.
In this section I examine the World Wide Web as a technology for enabling development
of more effective Computer Supported Cooperative Work (CSCW) systems. The Web
provides simple client-server architecture with client programs (browsers) implemented
for all popular computing platforms and a central server component that can be extended
through a standard API. The Web has been extremely successful in providing a simple
method for users to search, browse and retrieve information as well as publish
information of their own, but does not currently offer features for more collaborative
forms of information sharing such as joint document production.
There are a number of reasons to suggest the Web might be a suitable focus for
developers of CSCW systems. For widely dispersed working groups, where members
may be in different organisations and different countries, issues of integration and
interoperability often make it difficult to deploy existing groupware applications.
Although non computer-based solutions such as telephone and video conferencing
technologies provide some support for collaboration, empirical evidence suggests that
computer systems providing access to shared information, at any time and place and
using minimal technical infrastructure, are the main requirement of groups collaborating
in decentralised working environments. By offering an extensible centralised architecture
and cross-platform browser implementations, increasingly deployed and integrated with
user environments, the Web may provide a means of introducing CSCW systems which
offer much richer support for collaboration than email and FTP, and thus serve as an
`enabling technology' for CSCW.
In the following section I discuss the need for such enabling technologies for CSCW to
address problems of system development and deployment. I then give an overview of the
Web architecture and components and critically examine these in the context of CSCW
systems development. I suggest that the Web is limited in the range of CSCW systems
that can be developed on the basic architecture and, in its current form, is most suited for
asynchronous, centralised CSCW applications with no strong requirements for
notification, disconnected working and rich user interfaces. I reveal benefits of the Web
as a platform for deploying such applications in real work domains, and conclude with a
discussion of some current developments, which may ease the limitations of the Web as
a platform for system development and increase its utility as an enabling technology for
CSCW.
What is CSCW?
Computer Supported Cooperative Work, or CSCW, is a rapidly growing multidisciplinary field. As personal workstations get more powerful and as networks get faster
and wider, the stage seems to be set for using computers not only to help accomplish our
everyday, personal tasks but also to help us communicate and work with others. Indeed,
group activities occupy a large amount of our time: meetings, telephone calls, mail
Compiled by Omorogbe Harry
106
HCI
(electronic or not), but also informal encounters in corridors, coordination with
secretaries, team workers or managers, etc. In fact, work is so much group work that it is
surprising to see how poorly computer systems support group activities. For example,
many documents (such as this research work) are created by multiple authors but yet no
commercial tool currently allows a group of authors to create such shared documents as
easily as one can create a single-author document. We have all experienced the
nightmares of multiple copies being edited in parallel, format conversion, mail and file
transfers, etc.
CSCW is a research area that examines issues relating to the design of computer systems
to support people working together. This seemingly all-encompassing definition is in
part a reaction to what has been seen as a set of implicit design assumptions in many
computer applications - that they are intended to support users to do their work on their
own. In cases where a scarce resource (such as early computers themselves, or a
database, or even a digital library) has to be shared; systems designers have minimised
the effects of this shared activity and tried to create the illusion of the (presumed ideal)
case of exclusive access to resources. We see the same assumptions in discussion of
digital libraries as a way of offering access to resources without the need to compete with
(or even be aware of the existence of) other library users.
By contrast, CSCW acknowledges that people work together as a way of managing
complex tasks. Despite the wilder claims of Artificial Intelligence, not all these tasks can
be automated. Thus it is sensible to design systems that allow people to collaborate more
effectively. This can also open up opportunities for collaboration that have previously
been impossible, overly complex or too expensive; such as working not merely with
colleagues in the same office, but via video and audio links with colleagues in a different
building or on a different continent. CSCW has a strong interdisciplinary tradition,
drawing of researchers from computer science, sociology, management, psychology and
communication. Although the bulk of this article is about how CSCW might be used in
libraries, it is also the contention that CSCW should also be informed by work in library
and information science.
The world of CSCW is often described in terms of the time and space in which a
collaborative activity occurs. Collaboration can be between people in the same place (colocated) or different places (remote). Collaboration can be at the same time
(synchronous) or separated in time (asynchronous). Figure 9 illustrates the possibilities.
Figure 27 - The CSCW spatial and temporal quadrants
Compiled by Omorogbe Harry
107
HCI
Examples from the various quadrants are:
 same time, same place: meeting support tools.
 same time, different place: video conferencing.
 different time, same place: A design team's shared room containing specialist
equipment.
 different time, different place: email systems.
CSCW radically changes the status of the computer. Until now, the computer has been
used as a tool to solve problems. With CSCW, the computer/network is a medium: a
means to communicate with other human beings, a vector for information rather than a
box that stores and crunches data. If we look at the history of technology, new media
have been much more difficult to invent, create and operate than new tools. From this
perspective, it is not surprising that CSCW has not yet realized its full potential, even in
the research community. I hope this report will help readers to better understand the
challenges and promises of CSCW and encourage new developments both in research
and in industry.
CSCW is not recent. Back in the late 1960s, Doug Engelbart created the NLS/Augment
system that featured most of the functions that today's systems are trying to implement
such as real-time shared editing of outlines, shared annotations of documents, and videoconferencing. The field really emerged in the 1980s and has been growing since then,
boosted in the recent years by the explosion of the
Internet and the World Wide Web.
The Web itself is not a very collaborative system: pages can be easily published but it is
impossible (or very difficult) to share them, e.g. to know when someone is reading a
particular page or when a page has been modified. The range and complexity of the
problems to solve and to support cooperative activities is rapidly overwhelming: data
sharing, concurrency control, conflict management, access control, performance,
reliability, the list goes on.
In addition to these technical difficulties, there is another, maybe harder, problem in
implementing groupware: people. For a medium to work, there must be an audience that
accepts using it. Usability issues have stressed the need to take the users into account
when designing, developing and evaluating an interactive software. For groupware,
usability issues go beyond the now well-understood (if not always well-applied) methods
from psychology and design. They involve social sciences to understand how people
work together, how an organization imposes and/or adapts to the work practices of its
workers, etc. In many CSCW projects, ethnographic studies have been conducted to
better understand the nature of the problem and the possible solutions. A large body of
the research work in CSCW is conducted by social scientists, often within
multidisciplinary teams. Computer scientists often ignore or look down upon this aspect
of CSCW and almost always misunderstand it. User-centered design is essential to ensure
that computer scientists solve the right problems in the right way. Traditional software
works as soon as it "does the job"; Interactive software works better if it is easy to use
Compiled by Omorogbe Harry
108
HCI
rather than if it has more functions; Groupware works only if it is compatible with the
work practices of its users.
A large part of this section is devoted to the exploration of these problems and the state
of the art of their solutions. In fact, CSCW is challenging most of the assumptions that
were explicitly or implicitly embodied in the design of our current computer systems.
CSCW tools, or groupware, are by nature distributed and interactive. To succeed in the
marketplace, they must be safe (authentication), interoperable (from network protocols to
operating systems and GUI platforms), fault-tolerant and robust (you don't want to be
slowed down or loose your data if another participant in the session uses a slow
connection or experiences a crash).
The need for enabling technologies for CSCW
Most of the CSCW systems, which have been developed to date, have been constructed
in laboratories as research prototypes. This is perhaps not surprising, as CSCW systems
place novel requirements on underlying technology such as distributed systems and
databases, and many of the mechanisms developed to support multi-user interaction do
not address issues of cooperation such as activity awareness and coordination. This has
focused much attention on the development of mechanisms to support floor
management, user interface `coupling', update propagation and so on, and has resulted in
a range of experimental systems tailored to the particular issues being investigated. The
proprietary and incompatible architectures on which many are based, the esoteric
hardware and software required and the lack of integration with existing application
programs and data formats inhibits deployment outside the laboratory and within the
intended application domain.
It might be argued that this situation is not unduly problematic; issues of system
deployment are `implementation concerns' and would be addressed by re-implementation
of system prototypes. The lack of system deployment does however pose a serious
question to CSCW: if systems built to investigate particular models or mechanisms are
never deployed and evaluated in use, how can we determine the effectiveness of these
models and mechanisms in supporting cooperative work? A central concern of CSCW is
the need for systems which are sensitive to their contexts of use, and a body of empirical
data exists to show the problems caused when systems are introduced which do not
resonate with existing work practice. When systems do not leave the research laboratory
it is difficult to see how the models and mechanisms they propose can be assessed other
than from a technical perspective.
Recent calls for CSCW systems to be designed so they can be evaluated in use and for a
more situated approach to system evaluation reflect this need to migrate CSCW systems
out of the laboratory and into the field if we are to eventually provide more effective
systems. This migration is far from trivial, as the diversity of machines, operating
systems and application software, which characterises the real work domain, is often far
removed from the homogeneity of the laboratory. This is particularly true for working
groups, which cross departmental or organisational boundaries, where issues of
integration and interoperability mean it is extremely unlikely that systems developed as
research prototypes can be directly deployed. Adaptation or re-implementation of system
prototypes for deployment outside the laboratory is usually beyond the resources of most
research projects, suggesting that the issue of system deployment and the attendant
Compiled by Omorogbe Harry
109
HCI
problems should not be tackled at the end of the prototype development, but should be a
central focus of the system design.
Developing CSCW systems that integrate smoothly with systems, an applications and
data format already in place in the work domain adds considerably to what is already a
complex design task. A number of researchers have pointed to the need for tools to assist
with the development of CSCW systems, removing some of the complexity of user
interface, application and distributed systems programming which developers currently
face. Such `enabling technologies' would ease problems of system development and
allow a more evolutionary approach--an approach otherwise prohibited by the
investment necessary to create system prototypes and the need to commit to policy
decisions at an early stage in a system's design. Work in CSCW is already addressing
these issues through development of toolkits or application frameworks with
components, which can be instantiated and combined to create groupware systems.
Toolkits such as GroupKit are by now relatively mature, and seem to reduce the
complexity of CSCW system development in much the same way that user interface
toolkits allow rapid development of single-user interfaces.
 As I have shown, the desire for enabling technologies for CSCW lies not only in
easing problems of prototype construction but also facilitating deployment and
thereby evaluation of system prototypes in real work domains. As yet, most
CSCW toolkits focus primarily on system development and issues of crossplatform deployment, integration with existing applications and so on are
secondary. In this regard more than any other the World Wide Web seems to
offer potential as an enabling technology for CSCW:
 Web client programs (browsers) are available for all popular computing
platforms and operating systems, providing access to information in a platform
independent manner,
 Browsers offer a simple user interface and consistent information presentation
across these platforms, and are themselves extensible through association of
external `helper applications',
 Browsers are already part of the computing environment in an increasing number
of organisations, requiring no additional installation or maintenance of software
for users to cooperate using the Web,
 Many organisations have also installed their own Web servers as part of an
Internet presence or a corporate Intranet and have familiarity with server
maintenance and, in many cases, server extension through programming the
server API.
As a basis for deployment of CSCW applications in real work domains, the level of
acceptance and penetration of Web technology in commercial and academic
environments is grounds alone for suggesting that CSCW should pay serious attention to
the World Wide Web.
Compiled by Omorogbe Harry
110
HCI
Supporting Collaboration within Widely-distributed Work-groups
Most shared Workspace system were conceived as a means of supporting the work of
widely-dispersed work- groups, particularly those involved in large research and
development projects. Members of such projects may come from a number of
organisations, in different countries, yet have a need to share and exchange information
and often collaborate over its production. The geographical distribution prohibits
frequent face-to-face meetings, and would clearly benefit from computer support for the
collaborative aspects of the work. Unfortunately, the lack of common computing
infrastructure within the group often prohibits deployment of such technology and causes
serious problems for system developers who must pay close attention to issues of
heterogeneous machines, networks, and application software.
As a consequence of these problems, despite over 10 years of research in the field of
CSCW (Computer Supported Cooperative Work), email and ftp remain the state-of-theart in supporting collaboration within widely-distributed work- groups. Although such
tools facilitate information exchange, they provide little support for information sharing,
whereby details of users' changes, annotations and so on are made visible and available
to all other participants. A conclusion drawn by many is that for more powerful CSCW
technologies to flourish, a common infrastructure that addresses problems of integration
is required, allowing developers to focus on application details rather than complexities
of different system configurations. The W3 is the first real example of such a common
infrastructure, offering huge potential to CSCW system developers, through:





platform, network and operating system transparency,
integration with end-user environments and application programs,
a simple and consistent user interface across platforms,
an application programmer interface for 'bolt-on' functionality, and
ease of deployment facilitating rapid system prototyping.
Given this potential, it is unsurprising that a number of W3- based collaboration systems
have been developed. We can classify these systems in four broad categories, based on
the extent to which they depart from existing W3 standards:
Purely W3-based: Such systems use standard W3 clients, comply with HTML and HTTP
standards, and only extend server functionality using the CGI interface. Any additional
client functionality is provided by helper applications (we do not include client APIs
here, such as CCI, as they are not standardised across clients and platforms). An example
of such a purely W3- based system is reported in.
Customised servers: As 1, but require special-purpose servers, to provide behaviour
beyond the possibilities offered by CGI. Such systems still support standard W3 clients
and protocols, but the enhancements may reduce the portability of the server itself.
InterNotes is an example of such a customised server.
Customised clients: As 1 (and sometimes 2), but require particular or modified clients
(often to support non-standard HTML tags), or non-standard client APIs, and could not
necessarily be used with different platforms or different clients. These systems do
Compiled by Omorogbe Harry
111
HCI
however support the HTTP protocol. The Sesame client for Ubique's Virtual Places
system is an example.
Web-related: Such systems may provide a W3 interface, but support only limited
interaction using the HTTP protocol. The Worlds system is an example of this category.
In this classification, the degree of W3 compliance decreases from 1 to 4; one might say
that a system in Category 1 inherits all the benefits of the W3 listed above, while a
system in Category 4 gives the developer a free- hand in choice of protocols, interface
toolkits and so on but few of the benefits. A major goal of this work was to produce a
useful and usable system--one that could be deployed in the target domain and refined on
the basis of actual usage feedback. It therefore means that some set of design goals has to
be set as follows:
 No modification to the HTTP protocol
 No modifications to HTML, or client customisation other than through Helper
applications
 All server customisation to be performed through the CGI interface
The following section describes the current version of the system we developed, and we
then return to these three design goals to discuss the system's implementation.
The Web as enabling technology for CSCW
``[The Web] was developed to be a pool of human knowledge, which would allow
collaborators in remote sites to share their ideas and all aspects of a common project''
(Berners-Lee et al. 1994, page 76). From its inception the Web was intended as a tool to
support a richer, more active form of information sharing than is currently the case. Early
implementations at CERN allowed the browsing of pages as is common today, but also
supported annotation of these pages and addition of links between arbitrary pages, not
just from pages on local servers the user can access and edit. Some of these concepts
were carried through to early drafts of the standards for Web protocols and architecture
which described features such as remote publishing of hypertext pages and check in/out
support for locking documents to ensure consistency in a multi-author environment. To
date these aspects have largely been sidelined while development of Web browsers,
servers and protocols has focused on more `passive' aspects of information browsing. In
this section I examine the Web as it currently exists as a platform for developing and
deploying CSCW technologies, following a brief overview of the components on which
it is based.
Developing Web-based CSCW applications
Despite the lack of direct support for collaboration, the current Web architecture does
hide some of the complexity of deploying applications in a distributed, heterogeneous
environment. The most common method of doing this is by extending a Web server
through the CGI with new application functionality or `glue' code to an existing
application, presenting the application user interface as a series of HTML pages, which
can be displayed by standard Web browsers. With this approach developers can take
advantage of the existing base of browsers as client programs for their applications but
must accept the constraints of the basic Web architecture and protocols as currently
Compiled by Omorogbe Harry
112
HCI
implemented and the limitations of existing Web browsers. These constraints are severe,
inhibiting the development and deployment of CSCW applications in a number of areas:
Communication: There is no support for server-server, (server initiated) server-client or
client-client communication, problematic for applications where the server needs to play
an active role (e.g. to notify users of changes to information or maintain information
consistency over several servers). One consequence is that applications are now in
common use which poll Web servers periodically to check if pages have been updated,
allowing users to monitor Web sites of interest (e.g. Netscape's SmartMarks). Users can
specify a very small time interval between checks, even for pages which change rarely,
leading to huge amounts of unnecessary traffic on the Internet and `hits' on Web servers.
Pure centralised architecture: The architecture provides no support for distribution of
information or computation between clients and servers or replication across servers.
Expensive, powerful and fault-tolerant machines are required to run a Web server if it is
to scale to a large number of users. Even simple computations are not performed by the
client, for example to check if a user has filled in all the fields of a form, resulting in
unnecessary network traffic, server loading and slow feedback times for the user. The
lack of support for replication means that disconnected working is not possible.
No guaranteed `Quality of Service': The HTTP protocol does not support the
specification of guaranteed transmission rates between servers and clients. Data transfer
is often `bursty', subject to network and server loading which might vary considerably
during a single transmission. This is unsuitable for transmission of (real-time) continuous
media like audio and video, and alternative protocols such as RTP, `the Real-Time
Protocol', have been proposed for these media types.
User interface design: HTML is not a user interface design toolkit, and although markup tags are provided for simple form-filling widgets like input fields these do not support
features now common in desktop user interfaces such as drag and drop, multiple
selection and semantic feedback. Although some browser vendors have introduced new
tags to provide features like multiple, independent screen areas (Netscape Frames) they
do little to broaden the possibilities for user interface design (and are not supported by all
browsers). A fundamental problem here is the lack of server-client notification (see
above); it is easy for the interface to become inconsistent with the information on the
central server and is only updated when the user reloads the entire page.
Some of these limitations are not so much problems with Web components like HTTP
and HTML but more with the current implementations of browsers and servers. For
example there is no reason why a server could not act as a client and vice versa to allow
a form of update propagation and notification. (In fact some servers can send requests as
well as handle them, often to act as a `proxy' for routing requests through a firewall.)
These limitations do however restrict the kinds of CSCW systems which can be
developed as extensions of the Web using the CGI, and suggest that the Web in its
current form is largely unsuitable for developing systems which require highlyinteractive user interfaces, rapid feedback and `feedthrough' (user interface updates in
response to others' interactions) or a high degree of synchronous notification.
Compiled by Omorogbe Harry
113
HCI
Of course, extending a Web server through the CGI programming interface is not the
only method of deploying a CSCW system on the Web, and more radical approaches can
remove some of the constraints of the basic architecture. Based on the extent to which a
developer must modify the basic Web components, we can identify the following
approaches:
Extending through CGI: As described above, where no modifications are required to
protocols, browsers or servers. Any additional client functionality is provided through
`helper' applications. The BSCW system described in the next section is an example of
such a system.
Customising/building a server: Building a special-purpose Web server may be
necessary to achieve adequate performance or security, or to introduce new functionality
such as server-initiated notification. This approach requires deployment of the server
software and any other application code, but is sometimes a better method of enabling
access to existing applications from the Web in a more flexible and secure manner than
CGI. The BASIS WEBserver, which enables Web access to the BASISplus document
management system, is a good example of this.
Customising/building a client: Building a special-purpose client allows applications
other than Web browsers to communicate with Web servers using HTTP, such as the
`coordinator' clients developed for the WebFlow distributed workflow system (Grasso et
al. 1997). Customising a client may also be necessary to interpret non-standard HTML
tags such as those proposed by Vitali and Durand (1995) for version control to support
collaborative editing of HTML documents. Custom clients can be used in conjunction
with custom servers to provide additional services; as part of their Virtual Places system,
the Ubique client can interact with the Virtual Places server to provide synchronous
communication and a form of `presence awareness'.
Providing a Web interface: Some systems such as Worlds (Fitzpatrick et al. 1995)
provide a Web interface but are not designed specifically for deployment on the Web.
These applications use other means of providing the user interface, managing data and
event distribution and so on, and only limited interaction is possible using a Web
browser and HTTP.
Using this classification the flexibility for the developer increases from 1 to 4, and many
of the problems identified above can be solved by specialising or replacing components
such as clients and servers to provide richer mechanisms for the user interface, update
propagation and so on. Of course, this level of flexibility is bought at the price of the
innovation required from developers to build or integrate these components, and it
should be obvious that very soon we may find ourselves back to square one; with a
system which cannot be deployed outside the laboratory due to particular hardware and
software requirements and a lack of integration with existing user environments. In this
case, if our goal is eventual deployment and evaluation in real work domains, there
seems little point in using the Web as a platform for CSCW system development.
Despite these problems however I would strongly argue that the Web is an `enabling
technology for CSCW'. The limitations identified above mean the Web is more suited to
asynchronous, centralised applications with no strong requirements for synchronous
Compiled by Omorogbe Harry
114
HCI
notification, disconnected working and rich user interfaces. The advantages however--an
accepted technology, integrated with existing user environments and extensible through
the server API without requiring additional client software on users' machines--indicate
that here we have a method of deploying and evaluating basic mechanisms to support
collaboration in real work domains. Further, the rapid pace of development in Web
technologies suggests that many proprietary and experimental features, which address
some of the current limitations, could become standards in the future. Of course much
depends on the willingness of the main browser vendors (currently Netscape and
Microsoft) to agree on and implement these features, but this does not seem to have been
a problem to date. As Web technology matures some of the current problems with
CSCW development on the Web should be solved.
Experiences and perspectives of the Web as enabling technology for CSCW
In this section I am concerned primarily with the Web as a potential enabling technology
for CSCW systems, rather than possibilities for enhancing the Web itself with
mechanisms to make it more `collaborative'. I therefore focus my discussion on the role
of the Web as a vehicle for developing and deploying CSCW systems, instead of a target
of CSCW research in its own right, and thus orient more to the utility of current and
future Web standards for CSCW systems rather than possible modifications to these
standards as informed by CSCW research. This last however is clearly an area where
CSCW and the Web have much to say to each other; for example, the phenomenon that
the Web is currently a `lonely place' is an argument put forward by Lea et al. (1997) for
their work on Virtual Societies, and the goal of adding group `awareness' mechanisms to
augment the Web is receiving increasing attention from the CSCW community (see for
example Greenberg and Roseman 1996, Palfreyman and Rodden 1996). The topic of
awareness is only one of several issues, which might be included in a research agenda for
CSCW with respect to augmenting the basic Web architecture, protocols and
technologies.
I have taken the position that for CSCW the role of an enabling technology is twofold,
easing problems of both development and deployment of CSCW systems in real-world
domains, and that deployment is best achieved when systems integrate smoothly with
existing Web technologies. I now discuss the possibilities and problems of developing
CSCW systems on the Web, before reviewing recent developments, which might address
these problems and broaden the range of CSCW systems, which can be supported.
Experiences of developing Web-based CSCW systems
The current standards for Web components like HTML and HTTP reflect the emphasis
to date on the Web as a tool for information browsing. This allows information providers
to design and retain control of the form and content of their information and `publish' it
via a Web server. Consumers can then access the information by sending requests via
their Web browsers. The CGI server programming interface allows extension of the Web
within this `provider-consumer' framework, so that servers can generate responses onthe-fly as well as serve static Web pages stored in files on the server file system.
Our experiences with most collaborative system suggest that, as a tool for application
development, it is straightforward to extend the Web with application functionality or
Compiled by Omorogbe Harry
115
HCI
interface to an existing application. The method of passing request details through the
CGI programming interface is simple and allows developers to write extension programs
in most programming languages, with no need to link extension code with the server.
Extension programs must generate and return HTML, which again is straightforward. In
combination with a high-level, interpreted programming language such as Python, this
arrangement allows extremely rapid prototyping and testing using a standard Web client
and server.
The CGI approach does however inherit all the problems of the request-response model
of the Web. One of these is the feedback delay caused by the round-trip to the server to
service every user interaction. When requesting documents or even HTML pages this
delay may be acceptable, but for simple requests, especially those that change only the
state of the interface, this delay is a problem. For example, with the BSCW system it is
possible for users to fold in/out the object action and description lines using the `A' and
`D' buttons, and with the adjacent checkbox buttons select all/none of the objects in a
folder listing. Using these features requires a request to the server to generate a modified
HTML page, and when interacting via the Internet (as do most of the users of our public
server) network delays represent a much larger component of the total time to service the
request than processing time. In designing a user interface for a Web-based application,
developers must take care to reduce the number of required trips to the server, possibly
by allowing the user to `batch' requests at the client (using multiple HTML forms for
example).
At the server side the simplicity of the CGI approach can also be problematic. The
execution of extension programs in separate processes which are passed details of the
request may allow rapid development, but gives the developer no chance to modify
server behaviour or request information which is not passed explicitly through the CGI.
Where the default behaviour is adequate, as is the case for the user authentication
features used directly by BSCW for example, there are no problems. Where features are
inadequate for an application's needs the developer cannot modify these but must either
re-implement them using the CGI or build a custom HTTP server (Trevor et al. 1996).
The Web is best suited as a development platform for applications which do not need to
step outside the information provider-consumer model, currently enshrined in existing
standards and browser and server implementations. When this is required, it is often
necessary to provide additional components at the server or client (in the form of helper
applications). The latter removes one of the main advantages of the Web, which is the
ability to deploy systems without requiring development of client programs that run
across platforms or installation of additional software by users. For BSCW, the need to
upload documents to the server has required considerable effort to produce versions of
the (very simple) helper, which operate on PC, Macintosh and Unix machines. For this
and other aspects such as synchronous notification, information replication and so on the
basic Web standards and components offer no support and developers must provide their
own solutions.
Much of the work in the Web standards community is focusing on refinement of
protocols, client and server architectures to improve the speed and reliability with which
requests can be handled, and not on providing more flexible and powerful components
for application development. This emphasis is not surprising; the growth of the Web has
Compiled by Omorogbe Harry
116
HCI
been so rapid that aspects of the HTTP protocol in particular must urgently be redesigned to ensure the Web architecture can continue to scale to millions of users
worldwide. However, this growth has also led to demand from users and third-party
vendors for extensions to Web components to allow richer support for different media
types, user interfaces and so on. To meet this demand, server and browser vendors have
proposed a number of mechanisms and built support for these in their products.
There is some evidence that this practice is key in the continuing development of the
Web. An example of this is the support for HTML page editing and remote publishing,
identified as an area requiring support by a number of vendors including Netscape (with
the Navigator Gold browser), Microsoft (with FrontPage) and GNN's GNNPress.
Although the solutions offered are currently incompatible, all have a need for uploading
documents to a Web server and this has prompted efforts to agree a standard method for
doing this. The World Wide Web Consortium (W3C) has recently established a working
group on "Distributed Authoring and Versioning on the World Wide Web" to examine
requirements and work towards specifications to support this activity.
Similarly, the need for richer Web pages has led to tools like Java and JavaScript being
supported by the major browser vendors and becoming de-facto standards. Where
relevant, these de-facto standards have also filtered into the documented standards
process, as is the case with some of the proprietary extensions to HTML now part of the
latest proposed standard, HTML 3.2.
Broadening the possibilities for CSCW
As Internet technologies continue to penetrate and impact upon marketing, finance,
publishing, organisational IT and so on, the demand for extension and innovation will
increase. The growth of the corporate Intranet, for example, raises requirements of
information replication, workflow services and the like, while commerce applications
require much higher levels of security and privacy than are currently supported.
Although many vendors will seek to provide proprietary solutions to these requirements,
and thus lock corporate customers into particular technological solutions, it is also clear
that technologies are emerging which have the potential to broaden the possibilities for
third-parties to customise Web components and develop new extensions in a more
generic fashion.
The problems of the CGI approach to extending an existing Web server are well known,
and vendors of Web server technology are seeking to provide more flexible solutions for
developers. For example, in designing the API for the Apache server (currently the most
well-deployed Web server on the Internet), the developers sought to allow ``third-party
developers to easily change aspects of the server functionality which you can't easily
access from CGI'' (Thau 1996, page 1113). Similar developments in browser
programming interfaces such as Netscape's `Plug-in' development kit and Microsoft's
`ActiveX' environment are intended to extend the capabilities of standard Web browsers
to handle new media types directly, embed Web browsers in other applications and more.
Such advances in client and server programming interfaces allow development of much
richer CSCW systems, better integrated with desktop environments than is possible with
basic Web components. In the main however these developments are specialised to
Compiled by Omorogbe Harry
117
HCI
particular browsers or servers or operate only on particular platforms, and do not offer
the same advantages as the basic components for cross-platform deployment of CSCW
systems. Although some vendors have announced support for others' programming
interfaces, it remains to be seen how this will work in practice as they (particularly
browser vendors) seek to differentiate their products on the basis of the richness of
features supported.
An approach, which is independent of particular client and server programming
interfaces, seems to offer more potential in this regard. One area receiving much
attention is that of `mobile code' where, in addition to data in HTML or other formats, a
browser might download small application programs or `applets' which are executed on
the local machine, taking input and displaying output via the Web browser. This should
remove many of the constraints on the application developer: applets can be designed
which provide much richer user interfaces than are possible with HTML; computation
can be moved to the client, for example to check for valid input data and thus reduce
network traffic, server loading and feedback lags; applets supporting special protocols
can be developed which handle different media types and so on.
Although there are many problems to be overcome, most notably security concerns when
code downloaded over the Internet is executed on the user's machine, significant progress
has been made in this area. The latest Web browsers now provide support for applets
written in Sun’s Java programming language from Netscape, Microsoft and IBM. For
tasks requiring less power than a full programming language, scripting tools like
Netscape's JavaScript follow similar principles but are less ambitious, allowing HTML
pages to be extended with code fragments to pass responsibility for simple computations
from the server to the client.
I see these developments as broadening the role of the Web as an enabling technology
for CSCW, increasing the range of CSCW systems, which can be developed while not
compromising the benefits of cross-platform system deployment. In the BSCW project
they are making use of both Java and JavaScript to overcome problems with the basic
Web components and provide richer collaboration services to users of the BSCW system.
With JavaScript they have augmented the HTML user interface of the BSCW system to
remove the need to send requests to the server for changes in user interface state,
including the folding of actions and descriptions and the select all/none behaviour
discussed above. With Java they are being more ambitious, designing applets which
provide synchronous collaboration services such as event notification, presence
awareness, simple text chat and more, which can be presented in standard Web browsers
alongside the existing BSCW HTML user interface. An early prototype of this work is
discussed in (Bentley et al. 1995).
Collaborative Computing – A chronic examples
Collaborative computing allows users to work together on documents and projects,
usually in real time, by taking advantage of underlying network communication systems.
Whole new categories of software have been developed for collaborative computing, and
many existing applications now include features that let people work together over
networks. Here are some examples:
Compiled by Omorogbe Harry
118
HCI
 Application suites such as Microsoft Office and Exchange, Lotus Notes, and
Novell Groupwise that provide messaging, scheduling, document coauthoring, rulesbased message management, workflow routing, and discussion groups.
 Videoconferencing applications that allow users to collaborate over local
networks, private WANs, or over the Internet. See “Videoconferencing and Desktop
Video” for more information.
 Internet collaboration tools that provide virtual meetings, group discussions, chat
rooms, whiteboards, document exchange, workflow routing, and many other features.
Multicasting is an enabling technology for groupware and collaborative work on the
Internet that reduces bandwidth requirements. A single packet can be addressed to a
group, rather than having to send a packet to each member of the group. See
“Multicasting” for more details.
Good examples of collaborative applications designed for Internet use are Microsoft’s
NetMeeting and NetShow. NetMeeting allows intranet and Internet users to collaborate
with applications over the Internet while NetShow let’s users set up audio and graphic
(nonvideo) conferences. These products are described below as examples of the type of
collaborative applications available in the intranet/Internet environment. More
information about the products is available at http://www.microsoft.com.
NetMeeting
NetMeeting uses Internet phone voice communications and conferencing standards to
provide multiuser applications and data sharing over intranets or the Internet. Two or
more users can work together and collaborate in real time using application sharing,
whiteboard, and chat functionality. NetMeeting is included in Microsoft’s Internet
Explorer.
NetMeeting can be used for common collaborative activities such as virtual meetings. It
can also be used for customer service applications, telecommuting, distance learning, and
technical support. The product is based on ITU (International Telecommunication
Union) standards, so it is compatible with other products based on the same standards.
Some of NetMeeting’s built in features are listed here.
INTERNET PHONE
Provides point-to-point audio conferencing over the Internet. A sound card with attached
microphone and speaker is required.
ULS (USER LOCATION SERVICE) DIRECTORY
Locates users who are currently running NetMeeting so you can participate in a
conference. Internet service providers can implement their own ULS server to establish a
community of NetMeeting users.
MULTIPOINT DATA CONFERENCING
Provides a multipoint link among people who require virtual meetings. Users can share
applications, exchange information through a shared clipboard, transfer files, use a
shared whiteboard, and use text-based chat features.
APPLICATION SHARING
Compiled by Omorogbe Harry
119
HCI
Allows a user to share an application running on his computer with other people in a
conference. Works with most Windows-based programs. As one user works with a
program, other people in the conference see the actions of that user. Users may take turns
editing or controlling the application.
SHARED CLIPBOARD
Allows users to easily exchange information by using familiar cut, copy, and paste
operations.
FILE TRANSFER
Lets you transfer a file to another person by simply choosing a person in the conference
and specifying a file. File transfers occur in the background as the meeting progresses.
WHITEBOARD
Provides a common drawing surface that is shared by all users in a conference. Users can
sketch pictures, draw diagrams, or paste in graphics from other applications and make
changes as necessary for all to see.
CHAT
Provides real-time text-based messaging among members of a conference.
NetShow
NetShow is basically a low-bandwidth alternative to videoconferencing. It provides live
multicast audio, file transfer and on-demand streamed audio, illustrated audio, and video.
It is also a development platform on which software developers can create add-on
products. According to Microsoft, NetShow provides “complete information-sharing
solutions, spanning the spectrum from one-to-one, fully interactive meetings to broadly
distributed, one-way, live, or stored presentations.”
NetShow takes advantage of important Internet and network communication
technologies to minimize traffic while providing useful tools for multiuser collaboration.
IP multicasting is used to distribute identical information to many users at the same time.
This avoids the need to send the same information to each user separately and
dramatically reduces network traffic. Routers on the network must be multicast-enabled
to take advantage of these features.
NetShow also uses streaming technology, which allows users to see or hear information
as it arrives, rather than wait for it to be completely transferred.
Other Products
A number of other companies are working on collaborative products that do many of the
same things as NetMeeting and NetShow. Netscape Conference and SuiteSpot are
similar products. SuiteSpot integrates up to ten collaborative applications into a single
package. Additional information is available at http://www.netscape.com.
Netscape Collabra Server, which is included in its SuiteSpot enterprise suite of
applications lets people work together over intranets or over the Internet. Companies can
Compiled by Omorogbe Harry
120
HCI
create discussion forums and open those forums to partners and customers. Collabra
Server employs a standards-based NNTP (Network News Transport Protocol) and it
allows discussions to be opened to any NNTP-compliant client on the Internet. In
addition, discussions can be secured and encrypted.
Another interesting product is one called CyberHub from Blaxxun Interactive
(http://www.blaxxun.com). It provides a high-end virtual meeting environment that uses
3-D graphics and VRML (Virtual Reality Modeling Language).
Compiled by Omorogbe Harry
121
Download