HCI A LOOK INTO THE WORLD OF HUMAN-COMPUTER INTERFACE PART ONE Compiled by Omorogbe Harry 1 HCI TABLE OD CONTENTS Chapter One Introduction and Overview History and Background Earlier development and Foundation of the Field Pioneers The need for HCI Strategic Themes Basic Interaction Direct Manipulation of graphical object Application Types Current Development Technological Trends Up-and-Coming Areas Visualization and Biological Field Chapter Two Concept and Design in HCI Design and Evaluation Methods Concepts of User Interface design Principle of User Interface Design Ergonomic Guidelines for User-Interface Design General Principles to follow when designing any programme Human Issues Importance of HCI Chapter Three HCI and Web: problems and Promises Issues in HCI design in Web Mediun How screens Display Colours Web-Safe Colours Contributors to HCI Chapter Four Gesture Recognition Augmented Reality Computer Supported Cooperative Work. Compiled by Omorogbe Harry 2 HCI CHAPTER ONE HUMAN AND COMPUTER INTERFACE Connecting with your computer - Human-computer interaction and Artificial Intelligence INTRODUCTION AND OVERVIEW "Computer, this is captain Jeanway. Abort the self-destruction sequence. Authorization code, 89453432..." "Voice, confirmed. Authorization code, confirmed. Abort the self-destruction sequence...Unable to comply, System malfunction..." BANG!!! ...... If you are a trekker, you will undoubtly recognize the above conversation. Yes, it is from Star Trek, a television series spawned by one of the most popular science fiction of the century. However, if you simply have not heard of "Star Trek", do not worry because we only need to know that the above is a human-computer interaction, which is hopefully to happen in the future (except for the "BANG" part). Actually, a conversation as simple as the above between the human and the computer is far more difficult for today's technology to accomplish than you may have imagined. It involves speech recognition, natural language understanding, Artificial intelligence, and natural voice output, all of which are topics in the study of Human-Computer Interaction (HCI). Simply put, Human-Computer Interaction is a interdisciplinary study of how humans interact with computers, which includes user interface design, human perception and cognitive science, Artificial Intelligence, and Virtual reality. With the explosive growth of raw computing power and accompany technologies, computers become essential to everyday life, and because of this, HCI, the science of how humans interact with computers is attracting more and more attention these days. Comprehensively, Human-Computer Interaction (HCI) is the study of how people design, implement, and use interactive computer systems, and how computers affect individuals, organizations, and society. This encompasses not only ease of use but also new interaction techniques for supporting user tasks, providing better access to information, and creating more powerful forms of communication. It involves input and output devices and the interaction techniques that use them; how information is presented and requested; how the computer's actions are controlled and monitored; all forms of help, documentation, and training; the tools used to design, build, test, and evaluate user interfaces; and the processes that developers follow when creating interfaces. HCI is a research area of increasingly central significance to computer science, other scientific and engineering disciplines, and an ever expanding array of application domains. This more prominent role follows from the widely perceived need to expand the focus of computer science research beyond traditional hardware and software issues Compiled by Omorogbe Harry 3 HCI to attempt to better understand how technology can more effectively support people in accomplishing their goals. At the same time that a human-centered approach to system development is of growing significance, factors conspire to make the design and development of systems even more difficult than in the past. This increased difficulty follows from the disappearance of boundaries between applications as we start to support people's real activities; between machines as we move to distributed computing; between media as we expand systems to include video, sound, graphics, and communication facilities; and between people as we begin to realize the importance of supporting organizations and group activities. Research in Human-Computer Interaction (HCI) has been spectacularly successful, and has fundamentally changed computing. Just one example is the ubiquitous graphical interface used by Microsoft Windows 95, which is based on the Macintosh, which is based on work at Xerox PARC, which in turn is based on early research at the Stanford Research Laboratory (now SRI) and at the Massachusetts Institute of Technology. Another example is that virtually all software written today employs user interface toolkits and interface builders, concepts which were developed first at universities. Even the spectacular growth of the World-Wide Web is a direct result of HCI research: applying hypertext technology to browsers allows one to traverse a link across the world with a click of the mouse. Interface improvements more than anything else has triggered this explosive growth. Furthermore, the research that will lead to the user interfaces for the computers of tomorrow is happening at universities and a few corporate research labs. This lecture note tries to briefly summarize many of the important research developments in Human-Computer Interaction (HCI) technology. By "research," I mean exploratory work at universities and government and corporate research labs (such as Xerox PARC) that is not directly related to products. By "HCI technology," I am referring to the computer side of HCI. A companion work on the history of the "human side," discussing the contributions from psychology, design, human factors and ergonomics would also be appropriate. Figure 1 shows time lines for some of the technologies discussed in this book. Of course, a deeper analysis would reveal much interaction between the university, corporate research and commercial activity streams. It is important to appreciate that years of research are involved in creating and making these technologies ready for widespread use. The same will be true for the HCI technologies that will provide the interfaces of tomorrow. It is clearly impossible to list every system and source in a lecture note of this scope, but I have tried to represent the earliest and most influential systems. Although there are a number of other surveys of HCI topics. The technologies covered in this material include fundamental interaction styles like direct manipulation, the mouse pointing device, and windows; several important kinds of application areas, such as drawing, text editing and spreadsheets; the technologies that will likely have the biggest impact on interfaces of the future, such as gesture recognition, multimedia, Computer supported Cooperative work, and 3D; and the Compiled by Omorogbe Harry 4 HCI technologies used to create interfaces using the other technologies, such as user interface management systems, toolkits, and interface builders. Figure 1: Approximate time lines showing where work was performed on some major technologies discussed in this article. Contributors to HCI HCI is a multidisciplinary field. The main contributions come from computer science, cognitive psychology, and ergonomics and human factors. However, other areas of interest include artificial intelligence, (graphic) design, engineering, and even psychology, sociology, and anthropology: Compiled by Omorogbe Harry 5 HCI Computer science Artificial intelligence Cognitive psychology HCI Ergonomics and human factors Engineering Philosophy Design Sociology Anthropology Fig 2: Diagram of contributor to HCI Early development What we today take for granted were actually the accomplishments of over 30 years of continuing research in the area. For instance, Direct Manipulation of graphical objects: the now ubiquitous direct manipulation interface, where visible objects on the screen are directly manipulated with a pointing device, was first demonstrated by Ivan Sutherland in Sketchpad, which was his 1963 MIT PhD thesis. SketchPad supported the manipulation of objects using a light-pen, including grabbing, moving objects, changing size, and using constraints. Following that was William Newman's Reaction Handler which was created at Imperial College, London in 1967. Reaction Handler provided direct manipulation of graphics, and introduced "Light Handles, " a form of graphical potentiometer, that was probably the first "widget." Another early System was AMBIT/G (implemented at MIT's Lincoln Labs, 1968). It employed iconic representations, gesture recognition, dynamic menus with items selected using a pointing devices, selection of icons by pointing, and moded and mode-free styles of interaction. Many of the interaction techniques popular indirect manipulation interfaces, such as how objects and text are selected, opened, and manipulated, were researched at Xerox PARC in the 1970's. In particular, the idea of "WYSIWYG" (what you see is what you get) originated there with systems such as the Bravo text editor and the Draw drawing program. The first commercial systems to make extensive use of Direct Manipulation were the Xerox Star (1981), the Apple Lisa (1982) and Macintosh (1984). Today, when most people take for granted the ability of dragging an icon or dropping a file on their computer, how many have thought that those are the efforts of 30-year global research. Compiled by Omorogbe Harry 6 HCI Pioneers Major technologies emerged at the same period including Text Editing, The Mouse, Windows, Gesture recognition and Computer Aided Design, and in most of those fields, researchers have made astonishing progresses which we can easily discern today. Among all facilities working on HCI, there are a few pioneers that are worth mentioning here. Xerox PARC is one of the most innovative organizations in the early HCI research and development. It is a major contributor to many important interface ideas such as Direct Manipulation of graphical objects, The Mouse, Windows, etc. MIT AI Lab, IBM, AT&T Bell lab are also among the most prominent organizations to the early HCI development. Because of the collective efforts and contributions from various organizations and individual, we were able to revolutionize the way humans interact with computers since 1960. However, after 30 years of research, more exciting fields are emerging day by day. The need for HCI (Prospective) Although one is encouraged by past research success in HCI and excited by the potential of current research, I want to emphasize how central a strong research effort is to future practical use of computational and network technologies. For example, popular discussion of the National Information Infrastructure (NII) envisions the development of an information marketplace that can enrich people's economic, social, cultural, and political lives. For such an information marketplace, or, in fact, many other applications, to be successful require solutions to a series of significant research issues that all revolve around better understanding how to build effective human-centered systems. The following sections discuss selected strategic themes, technology trends, and opportunities to be addressed by HCI research. Strategic Themes If one step back from the details of current HCI research a number of themes are visible. Although I cannot hope to do justice here to elaborating these or a number of other themes that arose in workshop discussions, it is clear that HCI research has now started to crystallize as a critical discipline, intimately involved in virtually all uses of computer technologies and decisive to successful applications. Here I expand on just a few themes: Universal Access to Large and Complex Distributed Information: As the "global information infrastructure" expands at unprecedented rates, there are dramatic changes taking place in the kind of people who access the available information and the types of information involved. Virtually all entities (from large corporations to individuals) are engaged in activities that increasingly involve accessing databases, and their livelihood and/or competitiveness depend heavily on the effectiveness and efficiency of that access. As a result, the potential user community of database and other information systems is becoming startlingly large and rather nontechnical, with most users bound to remain permanent novices with respect to many of the diverse information sources they can access. It is therefore urgently necessary and strategically critical to develop user Compiled by Omorogbe Harry 7 HCI interfaces that require minimal technical sophistication and expertise by the users and support a wide variety of information-intensive tasks. Information-access interfaces must offer great flexibility on how queries are expressed and how data are visualized; they must be able to deal with several new kinds of data, e.g., multimedia, free text, documents, the Web itself; and they must permit several new styles of interaction beyond the typical, two-step query-specification/result-visualization loop, e.g., data browsing, filtering, and dynamic and incremental querying. Fundamental research is required on visual query languages, user-defined and constraint-based visualizations, visual metaphors, and generic and customizable interfaces, and advances seem most likely to come from collaborations between the HCI and database research communities. Information-discovery interfaces must support a collaboration between humans and computers, e.g., for data mining. Because of our limited memory and cognitive abilities, the growing volume of available information has increasingly forced us to delegate the discovery process to computers, greatly underemphasizing the key role played by humans. Discovery should be viewed as an interactive process in which the system gives users the necessary support to analyze terabytes of data, and users give the system the feedback necessary to better focus its search. Fundamental issues for the future include how best to array tasks between people and computers, create systems that adapt to different kinds of users, and support the changing context of tasks. Also, the system could suggest appropriate discovery techniques depending on data characteristics, as well as data visualizations, and help integrate what are currently different tools into a homogeneous environment. Education and Life-Long Learning: Computationally assisted access to information has important implications for education and learning as evidenced in current discussions of "collaboratories" and "virtual universities." Education is a domain that is fundamentally intertwined with human-computer interaction. HCI research includes both the development and evaluation of new educational technologies such as multimedia systems, interactive simulations, and computer-assisted instructional materials. For example, consider distance learning situations involving individuals far away from schools. What types of learning environments, tools, and media effectively deliver the knowledge and understanding that these individuals seek? Furthermore, what constitutes an effective educational technology? Do particular media or types of simulations foster different types of learning? These questions apply not only to secondary and university students, but also to adults through life-long learning. Virtually every current occupation involves workers who encounter new technologies and require additional training. How can computer-assisted instructional systems engage individuals and help them to learn new ideas? HCI research is crucial to answering these important questions. Electronic Commerce: Another important theme revolves around the increasing role of computation in our economic life and highlights central HCI issues that go beyond usability to concerns with privacy, security, and trust. Although currently there is much hyperbole, as with most Internet technologies, over the next decade commercialization of the Internet may mean that digital commerce replaces much traditional commerce. The Internet makes possible services that could potentially be Compiled by Omorogbe Harry 8 HCI quite adaptive and responsive to consumer wishes. Digital commerce may require dramatic changes to internal processes as well as the invention of new processes. For digital commerce to be successful, the technology surrounding it will have to be affordable, widely available, simple to use, and secure. Interface issues are, of course, key. End-User Programming: An important reason that the WWW has been so successful is that everyone can create his or her own pages. With the advent of WYSIWYG html page-editing tools, it will be even easier. However, for "active" pages that use forms, animations, or computation, a professional programmer is required to write the required code in a programming language like PERL or Java. The situation is the same for the desktop where applications are becoming increasingly programmable (e.g, by writing Visual Basic scripts for Microsoft Word), but only to those with training in programming. Applying the principles and methods of HCI to the design of programming languages and programming systems for end-users should bring to everyone the ability to program Web pages and desktop applications. End-user programming will be increasingly important in the future. No matter how successful interface designers are, systems will still need to be customized to the needs of particular users. Although there will likely be generic structures, for example, in an email filtering system, that can be shared, such systems and agents will always need to be tailored to meet personal requirements. The use of various scripting languages to meet such needs is widespread, but better interfaces and understandings of end-user programming are needed. Information Visualization: This area focuses on graphical mechanisms designed to show the structure of information and improve the cost structure of access. Previous approaches have studied novel visualizations for information, such as the "Information Visualizer", history-enriched digital objects for displaying graphical abstractions of interaction history, and dotplots for visualizing self-similarity in millions of lines of text and code. Other approaches provide novel techniques for displaying data, e.g., dynamic queries, visual query languages, zoomable interfaces for supporting multiscale interfaces, and lenses to provide alternative views of information. Another branch of research is studying automatic selection of visualizations based on properties of the data and the user's tasks. The importance of information visualization will increase as people have access to larger and more diverse sources of information (e.g., digital libraries, large databases), which are becoming universally available with the WWW. Visualizing the WWW itself and other communication networks is also an important aim of information visualization systems. The rich variety of information may be handled by giving the users the ability to tailor the visualization to a particular application, to the size of the data set, or to the device (e.g., 2D vs. 3D capabilities, large vs. small screens). Research challenges include making the specification, exploration, and evolution of visualizations interactive and accessible to a variety of users. Tools should be designed that support a range of tailoring capabilities: from specifying visualizations from scratch to minor adaptations of existing visualizations. Incorporating automatic generation of information visualization with userdefined approaches is another interesting open problem, for example when the userdefined visualization is underconstrained. Compiled by Omorogbe Harry 9 HCI One fundamental issue for information visualization is how to characterize the expressiveness of visualization and judge its adequacy to represent a data set. For example, the "readability" of a visualization of a graph may depend on (often conflicting) aesthetic criteria, such as the minimization of edge crossings and of the area of the graph, and the maximization of symmetries. For other types of visualization, the criteria are quite ad hoc. Therefore, more foundation work is needed for establishing general principles. Computer-Mediated Communication: Examples of computer-mediated communication range from work that led to extraordinarily successful applications such as email to that involved in newer forms of communication via computers, such as realtime video and audio interactions. Research in Computer Supported Cooperative Work (CSCW) confronts complex issues associated with integration of several technologies (e.g., telephone, video, 3D graphics, cable, modem, fax, email), support for multi-person activities (which have particularly difficult interface development challenges), and issues of security, privacy, and trust. The unpredicted shift of focus to the Internet, intranets, and the World-Wide Web has ended a period in which the focus was on the interaction between an individual and a computer system, with relatively little attention to group and organizational contexts. Computer-mediated human communication raises a host of new interface issues. Additional challenges arise in coordinating the activities of computer-supported group members, either by providing shared access to common on-line resources and letting people structure their work around them, or by formally representing work processes to enable a system to guide the work. The CSCW subcommunity of human-computer interaction has grown rapidly, drawing from diverse disciplines. Social theory and social science, management studies, communication studies, education, are among the relevant areas of knowledge and expertise. Techniques drawn from these areas, including ethnographic approaches to understanding group activity, have become important adjuncts to more familiar usability methods. Mounting demands for more function, greater availability, and interoperability affect requirements in all areas. For example, the great increase in accessible information shifts the research agenda toward more sophisticated information retrieval techniques. Approaches to dealing with the new requirements through formal or de facto standards can determine where research is pointless, as well as where it is useful. As traditional applications are integrated into the Web, social aspects of computing are extended. Basic Interactions Direct Manipulation of graphical objects: The now ubiquitous direct manipulation interface, where visible objects on the screen are directly manipulated with a pointing device, was first demonstrated by Ivan Sutherland in Sketchpad, which was his 1963 MIT PhD thesis. Sketchpad supported the manipulation of objects using a light-pen, including grabbing objects, moving them, changing size, and using constraints. It contained the seeds of myriad important interface ideas. The system was built at Lincoln Labs with support from the Air Force and NSF. William Newman's Reaction Handler, created at Imperial College, London (1966-67) provided direct manipulation of graphics, Compiled by Omorogbe Harry 10 HCI and introduced "Light Handles," a form of graphical potentiometer, that was probably the first "widget." Another early system was AMBIT/G (implemented at MIT's Lincoln Labs, 1968, ARPA funded). It employed, among other interface techniques, iconic representations, gesture recognition, dynamic menus with items selected using a pointing device, selection of icons by pointing, and moded and mode-free styles of interaction. David Canfield Smith coined the term "icons" in his 1975 Stanford PhD thesis on Pygmalion (funded by ARPA and NIMH) and Smith later popularized icons as one of the chief designers of the Xerox Star. Many of the interaction techniques popular in direct manipulation interfaces, such as how objects and text are selected, opened, and manipulated, were researched at Xerox PARC in the 1970's. In particular, the idea of "WYSIWYG" (what you see is what you get) originated there with systems such as the Bravo text editor and the Draw drawing program. The concept of direct manipulation interfaces for everyone was envisioned by Alan Kay of Xerox PARC in a 1977 article about the "Dynabook". The first commercial systems to make extensive use of Direct Manipulation were the Xerox Star (1981), the Apple Lisa (1982) and Macintosh (1984). Ben Shneiderman at the University of Maryland coined the term "Direct Manipulation" in 1982 and identified the components and gave psychological foundations. The Mouse: The mouse was developed at Stanford Research Laboratory (now SRI) in 1965 as part of the NLS project (funding from ARPA, NASA, and Rome ADC) to be a cheap replacement for light-pens, which had been used at least since 1954. Many of the current uses of the mouse were demonstrated by Doug Engelbart as part of NLS in a movie created in 1968. The mouse was then made famous as a practical input device by Xerox PARC in the 1970's. It first appeared commercially as part of the Xerox Star (1981), the Three Rivers Computer Company's PERQ (1981), the Apple Lisa (1982), and Apple Macintosh (1984). Windows: Multiple tiled windows were demonstrated in Engelbart's NLS in 1968. Early research at Stanford on systems like COPILOT (1974) and at MIT with the EMACS text editor (1974) also demonstrated tiled windows. Alan Kay proposed the idea of overlapping windows in his 1969 University of Utah PhD thesis and they first appeared in 1974 in his Smalltalk system at Xerox PARC, and soon after in the InterLisp system. Some of the first commercial uses of windows were on Lisp Machines Inc. (LMI) and Symbolic Lisp Machines (1979), which grew out of MIT AI Lab projects. The Cedar Window Manager from Xerox PARC was the first major tiled window manager (1981), followed soon by the Andrew window manager by Carnegie Mellon University's Information Technology Center (1983, funded by IBM). The main commercial systems popularizing windows were the Xerox Star (1981), the Apple Lisa (1982), and most importantly the Apple Macintosh (1984). The early versions of the Star and Microsoft Windows were tiled, but eventually they supported overlapping windows like the Lisa and Macintosh. The X Window System, a current international standard, was developed at MIT in 1984. Application Types Drawing programs: Much of the current technology was demonstrated in Sutherland's 1963 Sketchpad system. The use of a mouse for graphics was demonstrated in NLS (1965). In 1968 Ken Pulfer and Grant Bechthold at the National Research Council of Compiled by Omorogbe Harry 11 HCI Canada built a mouse out of wood patterned after Engelbart's and used it with a keyframe animation system to draw all the frames of a movie. A subsequent movie, "Hunger" in 1971 won a number of awards, and was drawn using a tablet instead of the mouse (funding by the National Film Board of Canada). William Newman's Markup (1975) was the first drawing program for Xerox PARC's Alto, followed shortly by Patrick Baudelaire's Draw which added handling of lines and curves. The first computer painting program was probably Dick Shoup's "Superpaint" at PARC (1974-75). Text Editing: In 1962 at the Stanford Research Lab, Engelbart proposed, and later implemented a word processor with automatic word wrap, search and replace, userdefinable macros, scrolling text, and commands to move, copy, and delete characters, words, or blocks of text. Stanford's TV Edit (1965) was one of the first CRT-based display editors that was widely used. The Hypertext Editing System from Brown University had screen editing and formatting of arbitrary-sized strings with a light pen in 1967 (funding from IBM). NLS demonstrated mouse-based editing in 1968. TECO from MIT was an early screen-editor (1967) and EMACS developed from it in 1974. Xerox PARC's Bravo was the first WYSIWYG editor-formatter (1974). It was designed by Butler Lampson and Charles Simonyi who had started working on these concepts around 1970 while at Berkeley. The first commercial WYSIWYG editors were the Star, Lisa Write and then Mac Write. Spreadsheets: The initial spreadsheet was VisiCalc which was developed by Frankston and Bricklin (1977-8) for the Apple II while they were students at MIT and the Harvard Business School. The solver was based on a dependency-directed backtracking algorithm by Sussman and Stallman at the MIT AI Lab. Hypertext: The idea for hypertext (where documents are linked to related documents) is credited to Vannevar Bush's famous MEMEX idea from 1945. Ted Nelson coined the term "hypertext" in 1965. Engelbart's NLS system at the Stanford Research Laboratories in 1965 made extensive use of linking (funding from ARPA, NASA, and Rome ADC). The "NLS Journal" was one of the first on-line journals, and it included full linking of articles (1970). The Hypertext Editing System, jointly designed by Andy van Dam, Ted Nelson, and two students at Brown University (funding from IBM) was distributed extensively. The University of Vermont's PROMIS (1976) was the first Hypertext system released to the user community. It was used to link patient and patient care information at the University of Vermont's medical center. The ZOG project (1977) from CMU was another early hypertext system, and was funded by ONR and DARPA. Ben Shneiderman's Hyperties was the first system where highlighted items in the text could be clicked on to go to other pages (1983, Univ. of Maryland). HyperCard from Apple (1988) significantly helped to bring the idea to a wide audience. There have been many other hypertext systems through the years. Tim Berners-Lee used the hypertext idea to create the World Wide Web in 1990 at the government-funded European Particle Physics Laboratory (CERN). Mosaic, the first popular hypertext browser for the World-Wide Web was developed at the Univ. of Illinois' National Center for Supercomputer Applications (NCSA). Computer Aided Design (CAD): The same 1963 IFIPS conference at which Sketchpad was presented also contained a number of CAD systems, including Doug Ross's Computer-Aided Design Project at MIT in the Electronic Systems Lab and Coons' work Compiled by Omorogbe Harry 12 HCI at MIT with Sketchpad. Timothy Johnson's pioneering work on the interactive 3D CAD system Sketchpad 3 was his 1963 MIT MS thesis (funded by the Air Force). The first CAD/CAM system in industry was probably General Motor's DAC-1 (about 1963). Video Games: The first graphical video game was probably Spaceward by Slug Russell of MIT in 1962 for the PDP-1 including the first computer joysticks. The early computer Adventure game was created by Will Crowther at BBN, and Don Woods developed this into a more sophisticated Adventure game at Stanford in 1966. Conway's game of LIFE was implemented on computers at MIT and Stanford in 1970. The first popular commercial game was Pong (about 1976). UIMSs and Toolkits: The first User Interface Management System (UIMS) was William Newman's Reaction Handler created at Imperial College, London (1966-67 with SRC funding). Most of the early work took place at universities (University of Toronto with Canadian government funding; George Washington University with NASA, NSF, DOE, and NBS funding; Brigham Young University with industrial funding). The term UIMS was coined by David Kasik at Boeing (1982). Early window managers such as Smalltalk (1974) and InterLisp, both from Xerox PARC, came with a few widgets, such as popup menus and scrollbars. The Xerox Star (1981) was the first commercial system to have a large collection of widgets and to use dialog boxes. The Apple Macintosh (1984) was the first to actively promote its toolkit for use by other developers to enforce a consistent interface. An early C++ toolkit was InterViews, developed at Stanford (1988, industrial funding). Much of current research is now being performed at universities, including Garnet and Amulet at CMU (ARPA funded), MasterMind at Georgia Tech (ARPA funded), and Artkit at Georgia Tech (funding from NSF and Intel). There are, of course, many other examples of HCI research that should be included in a complete history, including work that led to drawing programs, paint programs, animation systems, text editing, spreadsheets, multimedia, 3D, virtual reality, interface builders, event-driven architectures, usability engineering, and a very long list of other significant developments. Although our brief history here has had to be selective, what we hope is clear is that there are many years of productive HCI research behind our current interfaces and that it has been research results that have led to the successful interfaces of today. For the future, HCI researchers are developing interfaces that will greatly facilitate interaction and make computers useful to a wider population. These technologies include: handwriting and gesture recognition, speech and natural language understanding, multiscale zoomable interfaces, "intelligent agents" to help users understand systems and find information, end-user programming systems so people can create and tailor their own applications, and much, much more. New methods and tools promise to make the process of developing user interfaces significantly easier but the challenges are many as we expand the modalities that interface designers employ and as computing systems become an increasingly central part of virtually every aspect of our lives. As HCI has matured as a discipline, a set of principles is emerging that are generally agreed upon and that are taught in courses on HCI at the undergraduate and graduate level. These principles should be taught to every CS undergraduate, since virtually all Compiled by Omorogbe Harry 13 HCI programmers will be involved in designing and implementing user interfaces during their careers. These principles are described in other publications, such as, and include task analysis, user-centered design, and evaluation methods. Technological Trends Again, the number and variety of trends identified in this discussions outstrip the space I have here for reporting. One can see large general trends that are moving the field from concerns about connectivity, as the networked world becomes a reality, to compatibility, as applications increasingly need to run across different platforms and code begins to move over networks as easily as data, to issues of coordination, as we understand the need to support multiperson and organization activities. I will limit the discussion here to a few instances of these general trends. Computational Devices and Ubiquitous Computing: One of the most notable trends in computing is the increase in the variety of computational devices with which users interact. In addition to workstations and desktop personal computers, users are faced with (to mention only a few) laptops, PDAs, and LiveBoards. In the near future, Internet telephony will be universally available, and the much-heralded Internet appliance may allow interactions through the user's television and local cable connection. In the more distant future, wearable devices may become more widely available. All these technologies have been considered under the heading of "Ubiquitous Computing" because they involve using computers everywhere, not just on desks. The introduction of such devices presents a number of challenges to the discipline of HCI. First, there is the tension between the design of interfaces appropriate to the device in question and the need to offer a uniform interface for an application across a range of devices. The computational devices differ greatly, most notably in the sizes and resolutions of displays, but also in the available input devices, the stance of the user (is the user standing, sitting at a desk, or on a couch?), the physical support of the device (is the device sitting on a desk, mounted on a wall, or held by the user, and is the device immediately in front of the user or across the room?), and the social context of the device's use (is the device meant to be used in a private office, a meeting room, a busy street, or a living room?). On the other hand, applications offered across a number of devices need to offer uniform interfaces, both so that users can quickly learn to use a familiar application on new devices, and so that a given application can retain its identity and recognizability, regardless of the device on which it is operating. Development of systems meeting the described requirements will involve user testing and research into design of displays and input devices, as well as into design of effective interfaces, but some systems have already begun to address these problems. Some browsers for the World-Wide Web attempt to offer interfaces that are appropriate to the devices on which they run and yet offer some uniformity. At times this can be difficult. For example, the frames feature of HTML causes a browser to attempt to divide up a user's display without any knowledge of the characteristics of that display. Although building applications that adapt their interfaces to the characteristics of the device on which they are running is one potential direction of research in this area, perhaps a more promising one is to separate the interface from the application and give the responsibility Compiled by Omorogbe Harry 14 HCI of maintaining the interface to the device itself. A standard set of protocols would allow the application to negotiate the setup of an interface, and later to interact with that interface and, indirectly, with the user. Such multimodal architectures could address the problems of generating an appropriate interface, as well as providing better support for users with specific disabilities. The architectures could also be distributed, and the building blocks of forthcoming distributed applications could become accessible from assorted computational devices. Speed, Size, and Bandwidth: The rate of increase of processor speed and storage (transistor density of semiconductor chips doubles roughly every 18 months according to Moore's law) suggests a bright future for interactive technologies. An important constraint on utilizing the full power afforded by these technological advances, however, may be network bandwidth. Given the overwhelming trends towards global networked computing, and even the network as computer, the implications of limited bandwidth deserves careful scrutiny. The bottleneck is the "last mile" connecting the Internet to individual homes and small offices. Individuals who do not get access through large employers may be stuck at roughly the present bandwidth rate (28,800 kilobits per second) at least until the turn of the century. The rate needed for delivery of televisionquality video, one of the promises of the National Information Infrastructure, is 4-6 megabits, many times that amount. What are the implications for strategic HCI research of potentially massive local processing power together with limited bandwidth? Increases in processor speed and memory suggest that if the information can be collected and cached from the network and/or local sources, local interactive techniques based on signal processing and work context could be utilized to the fullest. With advances in speech and video processing, interfaces that actively watch, listen, catalog, and assist become possible. With increased CPU speed we might design interactive techniques based on work context rather than isolated event handling. Fast event dispatch becomes less important than helpful action. Tools might pursue multiple redundant paths, leaving the user to choose and approve rather than manually specify. We can afford to "waste" time and space on indexing information and tasks that may never be used, solely for the purpose of optimizing user effort. With increased storage capacity it becomes potentially possible to store every piece of interactive information that a user or even a virtual community ever sees. The processes of sifting, sorting, finding and arranging increase in importance relative to the editing and browsing that characterizes today's interfaces. When it is physically possible to store every paper, e-mail, voice-mail and phone conversation in a user's working life, the question arises of how to provide effective access. Speech, Handwriting, Natural Language, and Other Modalities: The use of speech will increase the need to allow user-centered presentation of information. Where the form and mode of the output generated by computer-based systems is currently defined by the system designer, a new trend may be to increasingly allow the user to determine the way in which the computer will interact and to support multiple modalities at the same time. For instance, the user may determine that in a given situation, textual natural language output is preferred to speech, or that pictures may be more appropriate than words. These distinctions will be made dynamically, based on the abilities of the user or the limitations of the presentation environment. As the computing environment used to present data becomes distinct from the environment used to create or store information, Compiled by Omorogbe Harry 15 HCI interface systems will need to support information adaptation as a fundamental property of information delivery. 3D and Virtual Reality: Another trend is the migration from two-dimensional presentation space (or a 2 1/2 dimensional space, in the case of overlapping windows) to three dimensional spaces. The beginning of this in terms of a conventional presentation environment is the definition of the Virtual Reality Modeling Language (VRML). Other evidences are the use of integrated 3D input and output control in virtual reality systems. The notions of selecting and interacting with information will need to be revised, and techniques for navigation through information spaces will need to be radically altered from the present page-based models. Three-dimensional technologies offer significant opportunities for human-computer interfaces. Application areas that may benefit from three-dimensional interfaces include training and simulation, as well as interactive exploration of complex data environments. A central aspect of three-dimensional interfaces is "near-real-time" interactivity, the ability for the system to respond quickly enough that the effect of direct manipulation is achieved. Near-real-time interactivity implies strong performance demands that touch on all aspects of an application, from data management through computation to graphical rendering. Designing interfaces and applications to meet these demands in an application-independent manner presents a major challenge to the HCI community. Maintaining the required performance in the context of an unpredictable user-configured environment implies a "time-critical" capability, where the system automatically gracefully degrades quality in order to maintain performance. The design of general algorithms for time-critical applications is a new area and a significant challenge. Compiled by Omorogbe Harry 16 HCI CHAPTER TWO CURRENT DEVELOPMENT The current development of HCI is focused on advanced user interface design, human perception and cognitive science, Artificial Intelligence, and virtual reality, etc. Human Perception and Cognitive Science Why do we always need to type into the computer in order for it to do something for us? A very active subfield of HCI these days is human perception and cognitive science. The goal is to enable computer to recognize human actions the same way human perceive things. The focused subfields include Natural language and speech recognition, gesture recognition, etc. Natural language interfaces enable the user to communicate with the computer in their natural languages. Some applications of such interfaces are database queries, information retrieval from texts and so-called expert systems. Current advances in recognition of spoken language improve the usability of many types of natural language systems. Communication with computers using spoken language will have a lasting impact upon the work environment, opening up completely new areas of application for information technology. In recent years a substantial amount of research has been invested in applying the computer science tool of computational complexity theory to natural language and linguistic theory, and scientists have found that Word Grammar Recognition is computationally intractable (NP-hard, in fact). Thus, we still have a long way to go before we can conquer this important field of study. Reasoning, Intelligence Filtering, Artificial Intelligence To realize the full potential of HCI, the computer has to share the reasoning involved in interpreting and intelligently filtering the input provided by the human to the computer or, conversely, the information presented to the human. Currently, many scientists and researchers are involved in developing the scientific principles underlying the reasoning mechanism. The approaches used varied widely, but all of them are based on the fundamental directions such as case-based reasoning, learning, computer-aided instruction, natural language processing and expert systems. Among those, computeraided instruction (CAI) has its origins in the 1960s too. These systems were designed to tutor users, thus augmenting, or perhaps substituting for human teachers. Expert systems are software tools that attempt to model some aspect of human reasoning within a domain of knowledge. Initially, expert systems rely on human experts for their knowledge (an early success in this field was MYCIN [11], developed in the early 1970s under Edward Shortliffe. Now, scientists are focusing on building an expert system that does not rely on human experts. Compiled by Omorogbe Harry 17 HCI Virtual Reality From the day we used wires and punch cards to input data to the computer and received output via blinking lights, to nowadays easy-to-use, easy-to-manipulate GUI, the advancement in the user interface is astonishing; however, many novice computer users still find that computers are hard to access; moreover, even to the experienced user, current computer interface is still restricting in some sense, that is, one cannot communicate with computers in all the way he/she wants. A complete theory of communication must be able to account for all the ways that people communicate, not just natural language. Therefore, virtual reality becomes the ultimate goal of computer interface design. Virtual reality has its origins in the 1950s, when the first video-based flight simulator systems were developed for the military. These days, it receives more and more attention from not only the scientists but the mass population. (The popularity of the movie "Matrix" is a demonstration) Up-and-Coming Areas Gesture Recognition: The first pen-based input device, the RAND tablet, was funded by ARPA. Sketchpad used light-pen gestures (1963). Teitelman in 1964 developed the first trainable gesture recognizer. A very early demonstration of gesture recognition was Tom Ellis' GRAIL system on the RAND tablet (1964, ARPA funded). It was quite common in light-pen-based systems to include some gesture recognition, for example in the AMBIT/G system (1968 -- ARPA funded). A gesture-based text editor using proofreading symbols was developed at CMU by Michael Coleman in 1969. Bill Buxton at the University of Toronto has been studying gesture-based interactions since 1980. Gesture recognition has been used in commercial CAD systems since the 1970s, and came to universal notice with the Apple Newton in 1992. Multi-Media: The FRESS project at Brown used multiple windows and integrated text and graphics (1968, funding from industry). The Interactive Graphical Documents project at Brown was the first hypermedia (as opposed to hypertext) system, and used raster graphics and text, but not video (1979-1983, funded by ONR and NSF). The Diamond project at BBN (starting in 1982, DARPA funded) explored combining multimedia information (text, spreadsheets, graphics, speech). The Movie Manual at the Architecture Machine Group (MIT) was one of the first to demonstrate mixed video and computer graphics in 1983 (DARPA funded). 3-D: The first 3-D system was probably Timothy Johnson's 3-D CAD system mentioned above (1963, funded by the Air Force). The "Lincoln Wand" by Larry Roberts was an ultrasonic 3D location sensing system, developed at Lincoln Labs (1966, ARPA funded). That system also had the first interactive 3-D hidden line elimination. An early use was for molecular modeling. The late 60's and early 70's saw the flowering of 3D raster graphics research at the University of Utah with Dave Evans, Ivan Sutherland, Romney, Gouraud, Phong, and Watkins, much of it government funded. Also, the military-industrial flight simulation work of the 60's - 70's led the way to making 3-D real-time with commercial systems from GE, Evans&Sutherland, Singer/Link (funded by Compiled by Omorogbe Harry 18 HCI NASA, Navy, etc.). Another important center of current research in 3-D is Fred Brooks' lab at UNC. Virtual Reality and "Augmented Reality": The original work on VR was performed by Ivan Sutherland when he was at Harvard (1965-1968, funding by Air Force, CIA, and Bell Labs). Very important early work was by Tom Furness when he was at WrightPatterson AFB. Myron Krueger's early work at the University of Connecticut was influential. Fred Brooks' and Henry Fuch's groups at UNC did a lot of early research, including the study of force feedback (1971, funding from US Atomic Energy Commission and NSF). Much of the early research on head-mounted displays and on the Data Glove was supported by NASA. Computer Supported Cooperative Work. Doug Engelbart's 1968 demonstration of NLS included the remote participation of multiple people at various sites (funding from ARPA, NASA, and Rome ADC). Licklider and Taylor predicted on-line interactive communities in a 1968 article and speculated about the problem of access being limited to the privileged. Electronic mail, still the most widespread multi-user software, was enabled by the ARPAnet, which became operational in 1969 and by the Ethernet from Xerox PARC in 1973. An early computer conferencing system was Turoff's EIES system at the New Jersey Institute of Technology (1975). Natural language and speech: The fundamental research for speech and natural language understanding and generation has been performed at CMU, MIT, SRI, BBN, IBM, AT&T Bell Labs and Bell Core, much of it government funded. See, for example, for a survey of the early work. New Frontiers Now let us take a look of some of the newest developments in HCI today. Intelligent Room The Intelligent room is a project of MIT Artificial Intelligence Lab. The goal for the project is, said by Michael H. Coen from MIT AIL, is "creating spaces in which computation is seamless used to enhance ordinary, everyday activities." They want to incorporate computers into the real world by embedding them in regular environments, such as homes and offices, and allow people to interact with them the way they do with other people. The user interfaces of these systems are not menus, mice, and keyboards but instead gesture, speech, affect, context, and movement. Their applications are not word processors and spreadsheets, but smart homes and personal assistants. "Instead of making computer-interface for people, it is of more fundamental value to make peopleinterfaces for computers." They have built two Intelligent Rooms in the laboratory. They give the rooms cameras for eyes and microphones for ears to make accessible the real-world phenomena occurring within them. A multitude of computer vision and speech understanding systems then help interpret human-level phenomena, such as what people are saying, Compiled by Omorogbe Harry 19 HCI where they are standing, etc. By embedding user-interfaces this way, the fact that people tend to point at what they are speaking about is no longer meaningless from a computational viewpoint and they can then use build systems that make use of the information. Coupled with their natural interfaces is the expectation that these systems are not only highly interactive, they talk back when spoken to, but more importantly, that they are useful during ordinary activities. They enable talks historically outside the normal range of human-computer interaction by connecting computers to phenomena (such as someone sneezing or walking into a room) that have traditionally been outside the purview of contemporary user-interfaces. Thus, in the future, you can imagine that elderly people's homes would call an ambulance if they saw anyone fall down. Similarly, you can also imagine kitchen cabinets that automatically lock when young children approach them. Brain-Machine Interfaces Scientists are not satisfied with communicating with computers using natural language or gestures and movements. Instead, they ask a question why can not computers just do what people have in mind. Out of questions like this, there come brain-machine interfaces. Miguel Nicolelis, a Duke University neurobiologist, is one of the leading researchers in this competitive and highly significant field. There are only about a halfdozen teams around the world are pursuing the same goals: gaining a better understanding of how the mind works and then using that knowledge to build implant systems that would make brain control of computers and other machines possible. Nicolelis terms such systems "hybrid brain-machine interfaces" (HBMIs) Recently, working with the Laboratory for Human and Machine Haptics at MIT, he was able to send signals from individual neurons in Belle's, a nocturnal owl monkey, brain to a robot, which used the data to mimic the monkey's arm movements in real time. Scientists predict that Brain-Machine Interfaces will allow human brains to control artificial devices designed to restore lost sensory and motor functions. Paralysis sufferers, for example, might gain control over a motorized wheelchair or a prosthetic arm, or perhaps, even regain control over their own limbs. They believe the brain will prove capable of readily assimilating human-made devices in much the same way that a musician grows to feel that here instrument is a part of his/her own body. Ongoing experiments in other labs are showing that the idea is credible. At Emory University, neurologist Phillip Kennedy has helped severely paralyzed people communicate via a brain implant that allows them to move a cursor on a computer screen. However, scientists still know relatively little about how the electrical and chemical signals emitted by the brain's millions of neurons let us perceive color and smell, or give rise to the precise movements of professional dancers. Numerous stumbling blocks remain to be overcome before human brains can interface reliably and comfortably with artificial devices or making mind-controlled prosthetic limbs. Among the key challenges is developing electrode devices and surgical methods that will allow safe, long-term recording of neuronal activities. Conclusion - a look at the future In conclusion, Human Computer Interaction holds great promise. Exploiting this tremendous potential can bring profound benefits in all areas of human concern. Just imagine that one day, we will be able to tell computers to do what we want them to do, Compiled by Omorogbe Harry 20 HCI use gestures and hand signals to command them, or directly invoke them through our thoughts. One day, we will be able to call out an artificial intelligence from the computer or better yet, a hologram (YES! I am a diehard startrek fan) to perform the tasks that we can not accomplish, to solve aid in the emergency situations, or simply, to have someone that can listen to talk to. How bright a future that is shown to as, all thanks for the research that is going to be done in the Human-Computer Interaction field. Compiled by Omorogbe Harry 21 HCI CHAPTER THREE CONCEPT AND DESIGN IN HCI Design and Evaluation Methods Design and evaluation methods have evolved rapidly as the focus of human-computer interaction has expanded. Contributing to this are the versatility of software and the downward price and upward performance spiral, which continually extend the applications of software. The challenges overshadow those faced by designers using previous media and assessment methods. Design and evaluation for a monochrome, ASCII, stand-alone PC was challenging, and still does not routinely use more than ad hoc methods and intuition. New methods are needed to address the complexities of multimedia design, of supporting networked group activities, and of responding to routine demands for ever-faster turnaround times. More rapid evaluation methods will remain a focus, manifest in recent work on cognitive walkthrough, heuristic evaluation, and other modifications of earlier cognitive modeling and usability engineering approaches. Methods to deal with the greater complexity of assessing use in group settings are moving from research into the mainstream. Ethnographic observation, participatory design, and scenario-based design are being streamlined. Contextual inquiry and design is an example of a method intended to quickly obtain a rich understanding of an activity and transfer that understanding to all design team members. As well as developing and refining the procedures of design and evaluation methods, we need to understand the conditions under which they work. Are some better for individual tasks, some excellent for supporting groupware? Are some useful very early in the conceptual phase of design, others best when a specific interface design has already been detailed, and some restricted to when a prototype is in existence? In addition, for proven and promising techniques to become widespread, they need to be incorporated into the education of UI designers. Undergraduate curricula should require such courses for a subset of their students; continuing education courses need to be developed to address the needs of practicing designers. Tools All the forms of computer-human interaction discussed here will need to be supported by appropriate tools. The interfaces of the future will use multiple modalities for input and output (speech and other sounds, gestures, handwriting, animation, and video), multiple screen sizes (from tiny to huge), and have an "intelligent" component ("wizards" or "agents" to adapt the interface to the different wishes and needs of the various users). The tools used to construct these interfaces will have to be substantially different from those of today. Whereas most of today's tools well support widgets such as menus and dialog boxes, these will be a tiny fraction of the interfaces of the future. Instead, the tools will need to access and control in some standard way the main application data structures Compiled by Omorogbe Harry 22 HCI and internals, so the speech system and agents can know what the user is talking about and doing. If the user says "delete the red truck," the speech system needs access to the objects to see which one is to be deleted. Otherwise, each application will have to deal with its own speech interpretation, which is undesirable. Furthermore, an agent might notice that this is the third red truck that was deleted, and propose to delete the rest. If confirmed, the agent will need to be able to find the rest of the trucks that meet the criteria. Increasingly, future user interfaces will be built around standardized data structures or "knowledge bases" to make these facilities available without requiring each application to rebuild them. These procedures should be supported by the system-building tools themselves. This would make the evaluation of ideas extremely easy for designers, allowing ubiquitous evaluation to become a routine aspect of system design. Concepts of User Interface Design Learnability vs. Usability Many people consider the primary criterion for a good user interface to be the degree to which it is easy to learn. This is indeed a laudable quality of any user interface, but it is not necessarily the most important. The goal of the user interface should be foremost in the design process. Consider the example of a visitor information system located on a kiosk. In this case it makes perfect sense that the primary goal for the interface designers should be ease of operation for the first-time user. The more the interface walks the user through the system step by step, the more successful the interface would be. In contrast, consider a data entry system used daily by an office of heads-down operators. Here the primary goal should be that the operators can input as much information as possible as efficiently as possible. Once the users have learned how to use the interface, anything intended to make first-time use easier will only get in the way. User interface design is not a "one size fits all" process. Every system has its own considerations and accompanying design goals. The Requirements Phase is designed to elicit from the design team the kind of information that should make these goals clear. Metaphors and Idioms The True Role of Metaphors in the GUI When the GUI first entered the market, it was heralded most of all for its use of metaphors. Careful consideration of what really made the GUI successful, however, would appear to indicate that the use of metaphors was actually a little further down in the list. Metaphors were really nothing new. The term computer "file" was chosen as a metaphor for a collection of separate but related items held in a single container. This term dates back to the very early days of computers. The single most significant aspect of the GUI was the way in which it presented all possible options to the users rather than requiring them to memorize commands and enter Compiled by Omorogbe Harry 23 HCI them without error. This has nothing to do with metaphor and everything to do with focusing the user interface on the needs of the user rather than mandating that the user conform to the needs of the computer. The visual aspect of the GUI was also a tremendous advancement. People often confuse this visual presentation with pure metaphor, but closer inspection reveals that this is not necessarily the case. The "desktop" metaphor was the first thing to hit users of the GUI. Since it was a global metaphor and the small pictures of folders, documents, and diskettes played directly into it, people bought the entire interface as one big metaphor. But there are significant aspects of the GUI that have nothing to do with metaphor. Metaphors vs Idioms If someone says that a person "wants to have his cake and eat it too," we can intuit the meaning of the expression through its metaphoric content. The cake is a metaphor for that which we desire, and the expectation of both possessing it and consuming it is metaphoric for the assumption that acquisition of our desires comes at no cost. But if someone says that his pet turtle "croaked," it is not possible to intuit the meaning through the metaphoric content of the expression. The expression "croaked" is an idiom. We know instantly that the turtle didn't make a funny noise but rather that it died. The meaning of the idiom must be learned, but it is learned quickly and, once learned, retained indefinitely. Most visual elements of the GUI are better thought of as idioms. A scroll bar, for example, is not a metaphor for anything in the physical world. It is an entirely new construct, yet it performs an obvious function, its operation is easily mastered, and users easily remember how it works. It is the visual aspect of the scroll bar that allows it to be learned so quickly. Users operate it with visual clues rather than remembering the keys for line up, line down, page up, page down, etc. Metaphors Can Hinder As Well As Help The use of metaphor can be helpful when it fits well into a situation, but it is not a panacea and is not guaranteed to add value. The use of icons as metaphors for functions is a good example. It can be a gamble if someone will understand the connection between an icon and the function. Anyone who has played Pictionary knows that the meaning of a picture is not always clear. Consider the Microsoft Word 5.0 toolbar. Some icons area readily identifiable, some are not. The meaning of the identifiable icons will likely be gleaned from the icon, but is still not a guarantee. The unidentifiable icons, however, can be utterly perplexing, and rather than helping they can create confusion and frustration. And with so many pictographs crammed into such a small space, the whole thing reads like a row of enigmatic, ancient Egyptian hieroglyphs. The Netscape toolbar, by contrast, can be considered to be much more graceful and useful. The buttons are a bit larger, which makes them generally more readable. Their added size also allows the inclusion of text labels indicating the command to which the icon corresponds. Once the meaning of each icon has become learned the icon can serve as a visual mnemonic, but until then the text label clearly and unambiguously relays the function the button will initiate. Compiled by Omorogbe Harry 24 HCI The Netscape toolbar admittedly consumes more valuable window real estate than the Microsoft Word toolbar does. There are keystroke shortcuts for every button, however, and users who have mastered them can easily hide the toolbar from view. Users who prefer to use the toolbar are probably willing to sacrifice that small bit of real estate in order to have a toolbar that is presentable and easy to use. The "Global Metaphor" Quagmire One major pitfall into which metaphors can lead us is the "Global Metaphor," which is a metaphor that is intended to encompass an entire application. The "desktop" concept is an example of a global metaphor. The global metaphor becomes a quagmire when reality begins to diverge from the metaphor. Consider carefully the desktop metaphor. It can be seen how it deviates from reality immediately. The trash can is a wonderful metaphor for the deletion function, but trash cans are generally not situated on the top of a desk. The use of the trash can to eject a disk is a perfect example of contorting the metaphor to accommodate the divergence from reality. The expectation is that "trashing" a disk will delete its contents, yet the interface designers needed a way to eject a disk and the trash can came closer than anything else. Once learned it becomes an idiom that works fine, but it is initially counter-intuitive to the point that it is shocking. The vertical aspect of the desktop also subverts the metaphor. It's closer to a refrigerator on which one can randomly place differently shaped magnets, or the old-fashioned displays on which TV weathermen placed various symbols. The fact that the desktop metaphor has to be explained to first-time users is an indication that it might not be terribly intuitive. The global metaphor is an example of the "bigger is better" mentality. Metaphors are perceived as being useful, so some people assume that the more all-encompassing a metaphor is the more useful it will be. As in all other situations, the usefulness of a global metaphor is dictated by the overall goals of the interface. If the goal of the interface is to present a non-threatening face on a system that will be used primarily by non-technical first-time users, a global metaphor might be useful. But if the goal of the interface is to input large quantities of data quickly and effectively, a global interface might be an enormous hindrance. Don't Throw The Baby Out With The Bath Water While metaphors aren't always as useful as other solutions, it is important to note that in the right situation they can be a vital part of a quality user interface. The folder is a particularly useful and successful metaphor. Its purpose is immediately apparent, and by placing one folder inside another the user creates a naturally intuitive hierarchy. The counterpart in the character user interface is the directory/subdirectory construct. This has no clear correspondence to anything in the physical world, and many non-technical people have difficulty grasping the concept. The bottom line is that if a metaphor works naturally, use it by all means. But at the first hint that the metaphor is not clearly understood or has to be contorted in order to Compiled by Omorogbe Harry 25 HCI accommodate reality, it should be strongly considered as to whether it will really help or not. Intuitiveness It is generally perceived that the most fundamental quality of any good user interface should be that it is intuitive. The problem is that "intuitive" means different things to different people. To some an intuitive user interface is one that users can figure out for themselves. There are some instances where this is helpful, but generally the didactic elements geared for the first-time user will hamper the effectiveness of intermediate or advanced users. A much better definition of an intuitive user interface is one that is easy to learn. This does not mean that no instruction is required, but that it is minimal and that users can "pick it up" quickly and easily. First-time users might not intuit how to operate a scroll bar, but once it is explained they generally find it to be an intuitive idiom. Icons, when clearly unambiguous, can help to make a user interface intuitive. But the user interface designer should never overlook the usefulness of good old-fashioned text labels. Icons depicting portrait or landscape orientation, for example, are clearly unambiguous and perhaps more intuitive than the labels themselves, but without the label of "orientation," they could make no sense at all. Labels should be concise, cogent, and unambiguous. A good practice is to make labels conform to the terminology of the business that the application supports. This is a good way to pack a lot of meaning into a very few words. Designing intuitive user interfaces is far more an art than a science. It draws more upon skills of psychology and cognitive reasoning than computer engineering or even graphic design. The process of Usability Testing, however, can assess the intuitiveness of a user interface in an objective manner. Designing an intuitive user interface is like playing a good game of tennis. Instructors can tell you how to do it, but it can only be achieved through hard work and practice with a lot of wins and losses on the way. Consistency Consistency between applications is always good, but within an application it is essential. The standard GUI design elements go a long way to bring a level of consistency to every panel, but "look and feel" issues must be considered as well. The use of labels and icons must always be consistent. The same label or icon should always mean the same thing, and conversely the same thing should always be represented by the same label or icon. In addition to consistency of labeling, objects should also be placed in a consistent manner. Consider the example of the Employee Essentials Address Update panels (available through Bear Access). Compiled by Omorogbe Harry 26 HCI There is a different panel for every address that can be updated, each with its own set of fields to be displayed and modified. Note that each panel is clearly labeled, with the label appearing in the same location on every panel. A button bank appears in the same place along the left side of every panel. Some buttons must change to accommodate the needs of any given panel, but positionality was used consistently. The closer buttons are to the top the less likely they are to change, and the closer to the bottom the more likely. Note especially the matrix of buttons at the top left corner of every panel. These buttons are the same in every panel of the entire Employee Essentials application. They are known as "permanent objects." Early navigators used stars and constellations as unchanging reference points around which they could plot their courses. Similarly, modern aviation navigators use stationary radar beacons. They know that wherever the plane is, they can count on the radar beacon always being in the same place. User interface designers should always provide permanent objects as unchanging reference points around which the users can navigate. If they ever get lost or disoriented, they should be able to quickly find the permanent objects and from there get to where they need to be. On the Macintosh, the apple menu and applications menu are examples of permanent objects. No matter what application the user is in, those objects will appear on the screen. Most all Macintosh applications provide "File" and "Edit" as the first two pull-down menus. The "File" menu generally has "New" "Open" "Close" "Save" and "Save As" as the first selections in the menu, and "Quit" as the last selection. The "Edit" menu generally has "Cut," "Copy," and "Paste" as the first selections. The ubiquity of these conventions has caused them to become permanent objects. The users can count on finding them in virtually all circumstances, and from there do what they need to do. Bear Access itself is becoming a permanent object at Cornell. If a user is at an unfamiliar workstation, all he or she needs to do is locate Bear Access, and from there an extensive suite of applications will be available. Simplicity The complexity of computers and the information systems they support often causes us to overlook Occam's Razor, the principle that the most graceful solution to any problem is the one which is the most simple. A good gauge of simplicity is often the number of panels that must be displayed and the number of mouse clicks or keystrokes that are required to accomplish a particular task. All of these should be minimized. The fewer things users have to see and do in order to get their work done, the happier and more effective they will be. A good example of this is the way in which the user sets the document type in Microsoft Word version 5.0 as compared to version 4.0. In version 4.0, the user clicks a button on the save dialog that presents another panel in which there is a selection of radio buttons indicating all the valid file types. In version 5.0, there is simply a popup list on the save dialog. This requires fewer panels to be displayed and fewer mouse clicks to be made, and yet accomplishes exactly the same task. Compiled by Omorogbe Harry 27 HCI A pitfall that should be avoided is "featuritis," providing an over-abundance of features that do not add value to the user interface. New tools that are available to developers allow all kinds of things to be done that weren't possible before, but it is important not to add features just because it's possible to do so. The indiscriminate inclusion of features can confuse the users and lead to "window pollution." Features should not be included on a user interface unless there is a compelling need for them and they add significant value to the application. Prevention A fundamental tenet of graphic user interfaces is that it is preferable to prevent users from performing an inappropriate task in the first place rather than allowing the task to be performed and presenting a message afterwards saying that it couldn't be done. This is accomplished by disabling, or "graying out" certain elements under certain conditions. Consider the average save dialog. A document can not be saved if it has not been given a name. Note how the Save button is disabled when the name field is blank, but is enabled when a name has been entered. Forgiveness One of the advantages of graphic user interfaces is that with all the options plainly laid out for users, they are free to explore and discover things for themselves. But this requires that there always be a way out if they find themselves somewhere they realize they shouldn't be, and that special care is taken to make it particularly difficult to "shoot themselves in the foot." A good tip to keep users from inadvertently causing damage is to avoid the use of the Okay button in critical situations. It is much better to have button labels that clearly indicate the action that will be taken. Consider the example when the user closes a document that contains changes that have not been saved. It can be very misleading to have a message that says "Continue without saving?" and a default button labeled "Okay." It is much better to have a dialog that says "Document has been changed" and a default button labeled "Save", with a "Don't save" button to allow the user not to save changes if that is, in fact, the desired action. Likewise, it can be helpful in potentially dangerous situations to have the Cancel button be the default button so that it must be a deliberate action on the part of the user to execute the function. An example is a confirmation dialog when a record is being deleted. Aesthetics Finally, it is important that a user interface be aesthetically pleasing. It is possible for a user interface to be intuitive, easy to use, and efficient and still not be terribly nice to look at. While aesthetics do not directly impact the effectiveness of a user interface, users will be happier and therefore more productive if they are presented with an attractive user interface. Compiled by Omorogbe Harry 28 HCI CHAPTER FOUR Principles for User-Interface Design. This section represents a compilation of fundamental principles for designing user interfaces, which have been drawn from various books on interface design, as well as my own experience. Most of these principles can be applied to either command-line or graphical environments. I welcome suggestions for changes and additions -- I would like this to be viewed as an "open-source" evolving section. The principle of user profiling -- Know who your user is. Before we can answer the question "How do we make our user-interfaces better", we must first answer the question: Better for whom? A design that is better for a technically skilled user might not be better for a non-technical businessman or an artist. One way around this problem is to create user models. [TOG91] has an excellent chapter on brainstorming towards creating "profiles" of possible users. The result of this process is a detailed description of one or more "average" users, with specific details such as: What are the user's goals? What are the user's skills and experience? What are the user's needs? Armed with this information, we can then proceed to answer the question: How do we leverage the user's strengths and create an interface that helps them achieve their goals? In the case of a large general-purpose piece of software such as an operating system, there may be many different kinds of potential users. In this case it may be more useful to come up with a list of user dichotomies, such as "skilled vs. unskilled", "young vs. old", etc., or some other means of specifying a continuum or collection of user types. Another way of answering this question is to talk to some real users. Direct contact between end-users and developers has often radically transformed the development process. The principle of metaphor -- Borrow behaviors from systems familiar to your users. Frequently a complex software system can be understood more easily if the user interface is depicted in a way that resembles some commonplace system. The ubiquitous "Desktop metaphor" is an overused and trite example. Another is the tape deck metaphor seen on many audio and video player programs. In addition to the standard transport controls (play, rewind, etc.), the tape deck metaphor can be extended in ways that are quite natural, with functions such as time-counters and cueing buttons. This concept of "extendibility" is what distinguishes a powerful metaphor from a weak one. There are several factors to consider when using a metaphor: Compiled by Omorogbe Harry 29 HCI Once a metaphor is chosen, it should be spread widely throughout the interface, rather than used once at a specific point. Even better would be to use the same metaphor spread over several applications (the tape transport controls described above is a good example.) Don't bother thinking up a metaphor which is only going to apply to a single button. There's no reason why an application cannot incorporate several different metaphors, as long as they don't clash. Music sequencers, for example, often incorporate both "tape transport" and "sheet music" metaphors. Metaphor isn't always necessary. In many cases the natural function of the software itself is easier to comprehend than any real-world analog of it. Don't strain a metaphor in adapting it to the program's real function. Nor should you strain the meaning of a particular program feature in order to adapt it to a metaphor. Incorporating a metaphor is not without certain risks. In particular, whenever physical objects are represented in a computer system, we inherit not only the beneficial functions of those objects but also the detrimental aspects. Be aware that some metaphors don't cross cultural boundaries well. For example, Americans would instantly recognize the common U.S. Mailbox (with a rounded top, a flat bottom, and a little red flag on the side), but there are no mailboxes of this style in Europe. The principle of feature exposure -- Let the user see clearly what functions are available Software developers tend to have little difficulty keeping large, complex mental models in their heads. But not everyone prefers to "live in their heads" -- instead, they prefer to concentrate on analyzing the sensory details of the environment, rather than spending large amounts of time refining and perfecting abstract models. Both type of personality (labeled "Intuitive" and "Sensable" in the Myers-Briggs personality classification) can be equally intelligent, but focus on different aspects of life. It is to be noted that according to some psychological studies "Sensables" outnumber "Intuitives" in the general population by about three to one. Intuitives prefer user interfaces that utilize the power of abstract models -- command lines, scripts, plug-ins, macros, etc. Sensables prefer user interfaces that utilize their perceptual abilities -- in other words, they like interfaces where the features are "up front" and "in their face". Toolbars and dialog boxes are an example of interfaces that are pleasing to this personality type. This doesn't mean that you have to make everything a GUI. What it does mean, for both GUI and command line programs, is that the features of the program need to be easily exposed so that a quick visual scan can determine what the program actually does. In some cases, such as a toolbar, the program features are exposed by default. In other cases, such as a printer configuration dialog, the exposures of the underlying printer state (i.e. the buttons and controls which depict the conceptual printing model) are contained in a dialog box which is brought up by a user action (a feature which is itself exposed in a menu). Compiled by Omorogbe Harry 30 HCI Of course, there may be cases where you don't wish to expose a feature right away, because you don't want to overwhelm the beginning user with too much detail. In this case, it is best to structure the application like the layers of an onion, where peeling away each layer of skin reveals a layer beneath. There are various levels of "hiding": Here's a partial list of them in order from most exposed to least exposed: Toolbar (completely exposed) Menu item (exposed by trivial user gesture) Submenu item (exposed by somewhat more involved user gesture) Dialog box (exposed by explicit user command) Secondary dialog box (invoked by button in first dialog box) "Advanced user mode" controls -- exposed when user selects "advanced" option Scripted functions The above notwithstanding, in no case should the primary interface of the application be a reflection of the true complexity of the underlying implementation. Instead, both the interface and the implementation should strive to match a simplified conceptual model (in other words, the design) of what the application does. For example, when an error occurs, the explanation of the error should be phrased in a way that relates to the current user-centered activity, and not in terms of the low-level fault that caused there error. The principle of coherence -- The behavior of the program should be internally and externally consistent There's been some argument over whether interfaces should strive to be "intuitive", or whether an intuitive interface is even possible. However, it is certainly arguable that an interface should be coherent -- in other words logical, consistent, and easily followed. ("Coherent" literally means "stick together", and that's exactly what the parts of an interface design should do.) Internal consistency means that the program's behaviors make "sense" with respect to other parts of the program. For example, if one attribute of an object (e.g. color) is modifiable using a pop-up menu, then it is to be expected that other attributes of the object would also be editable in a similar fashion. One should strive towards the principle of "least surprise". External consistency means that the program is consistent with the environment in which it runs. This includes consistency with both the operating system and the typical suite of applications that run within that operating system. One of the most widely recognized forms of external coherence is compliance with user-interface standards. There are many others, however, such as the use of standardized scripting languages, plug-in architectures or configuration methods. The principle of state visualization -- Changes in behavior should be reflected in the appearance of the program Each change in the behavior of the program should be accompanied by a corresponding change in the appearance of the interface. One of the big criticisms of "modes" in Compiled by Omorogbe Harry 31 HCI interfaces is that many of the classic "bad example" programs have modes that are visually indistinguishable from one another. Similarly, when a program changes its appearance, it should be in response to a behavior change; A program that changes its appearance for no apparent reason will quickly teach the user not to depend on appearances for clues as to the program's state. One of the most important kinds of state is the current selection, in other words the object or set of objects that will be affected by the next command. It is important that this internal state be visualized in a way that is consistent, clear, and unambiguous. For example, one common mistake seen in a number of multi-document applications is to forget to "dim" the selection when the window goes out of focus. The result of this is that a user, looking at several windows at once, each with a similar-looking selection, may be confused as to exactly which selection will be affected when they hit the "delete" key. This is especially true if the user has been focusing on the selection highlight, and not on the window frame, and consequently has failed to notice which window is the active one. (Selection rules are one of those areas that are covered poorly by most UI style guidelines, which tend to concentrate on "widgets", although the Mac and Amiga guidelines each have a chapter on this topic.) The principle of shortcuts -- Provide both concrete and abstract ways of getting a task done Once a user has become experienced with an application, she will start to build a mental model of that application. She will be able to predict with high accuracy what the results of any particular user gesture will be in any given context. At this point, the program's attempts to make things "easy" by breaking up complex actions into simple steps may seem cumbersome. Additionally, as this mental model grows, there will be less and less need to look at the "in your face" exposure of the application's feature set. Instead, prememorized "shortcuts" should be available to allow rapid access to more powerful functions. There are various levels of shortcuts, each one more abstract than its predecessor. For example, in the emacs editor commands can be invoked directly by name, by menu bar, by a modified keystroke combination, or by a single keystroke. Each of these is more "accelerated" than its predecessor. There can also be alternate methods of invoking commands that are designed to increase power rather than to accelerate speed. A "recordable macro" facility is one of these, as is a regular-expression search and replace. The important thing about these more powerful (and more abstract) methods is that they should not be the most exposed methods of accomplishing the task. This is why emacs has the non-regexp version of search assigned to the easy-to-remember "C-s" key. The principle of focus -- Some aspects of the UI attract attention more than others do The human eye is a highly non-linear device. For example, it possesses edge-detection hardware, which is why we see Mach bands whenever two closely matched areas of Compiled by Omorogbe Harry 32 HCI color come into contact. It also has motion-detection hardware. As a consequence, our eyes are drawn to animated areas of the display more readily than static areas. Changes to these areas will be noticed readily. The mouse cursor is probably the most intensely observed object on the screen -- it's not only a moving object, but mouse users quickly acquire the habit of tracking it with their eyes in order to navigate. This is why global state changes are often signaled by changes to the appearance of the cursor, such as the well-known "hourglass cursor". It's nearly impossible to miss. The text cursor is another example of a highly eye-attractive object. Changing its appearance can signal a number of different and useful state changes. The principle of grammar -- A user interface is a kind of language -- know what the rules are Many of the operations within a user interface require both a subject (an object to be operated upon), and a verb (an operation to perform on the object). This naturally suggests that actions in the user interface form a kind of grammar. The grammatical metaphor can be extended quite a bit, and there are elements of some programs that can be clearly identified as adverbs, adjectives and such. The two most common grammars are known as "Action->Object" and "Object->Action". In Action->Object, the operation (or tool) is selected first. When a subsequent object is chosen, the tool immediately operates upon the object. The selection of the tool persists from one operation to the next, so that many objects can be operated on one by one without having to re-select the tool. Action->Object is also known as "modality", because the tool selection is a "mode" which changes the operation of the program. An example of this style is a paint program -- a tool such as a paintbrush or eraser is selected, which can then make many brush strokes before a new tool is selected. In the Object->Action case, the object is selected first and persists from one operation to the next. Individual actions are then chosen which operate on the currently selected object or objects. This is the method seen in most word processors -- first a range of text is selected, and then a text style such as bold, italic or a font change can be selected. Object->Action has been called "non-modal" because all behaviors that can be applied to the object are always available. One powerful type of Object->Action is called "direct manipulation", where the object itself is a kind of tool -- an example is dragging the object to a new position or resizing it. Modality has been much criticized in user-interface literature because early programs were highly modal and had hideous interfaces. However, while non-modality is the clear winner in many situations, there are a large number of situations in life that are clearly modal. For example, in carpentry, it’s generally more efficient to hammer in a whole bunch of nails at once than to hammer in one nail, put down the hammer, pick up the measuring tape, mark the position of the next nail, pick up the drill, etc. Compiled by Omorogbe Harry 33 HCI The principle of help -- Understand the different kinds of help a user needs In an essay in [LAUR91] it states that there are five basic types of help, corresponding to the five basic questions that users ask: Goal-oriented: "What kinds of things can I do with this program?" Descriptive: "What is this? What does this do?" Procedural: "How do I do this?" Interpretive: "Why did this happen?" Navigational: "Where am I?" The essay goes on to describe in detail the different strategies for answering these questions, and shows how each of these questions requires a different sort of help interface in order for the user to be able to adequately phrase the question to the application. For example, "about boxes" is one way of addressing the needs of question of type 1. Questions of type 2 can be answered with a standard "help browser", "tool tips" or other kinds of context-sensitive help. A help browser can also be useful in responding to questions of the third type, but these can sometimes be more efficiently addressed using "cue cards", interactive "guides", or "wizards" which guide the user through the process step-by-step. The fourth type has not been well addressed in current applications, although well-written error messages can help. The fifth type can be answered by proper overall interface design, or by creating an application "roadmap". None of the solutions listed in this paragraph are final or ideal; they are simply the ones in common use by many applications today. The principle of safety -- Let the user develop confidence by providing a safety net Ted Nelson once said "Using DOS is like juggling with straight razors. Using a Mac is like shaving with a bowling pin." Each human mind has an "envelope of risk", that is to say a minimum and maximum range of risk-levels which they find comfortable. A person who finds herself in a situation that is too risky for her comfort will generally take steps to reduce that risk. Conversely, when a person's life becomes too safe -- in other words, when the risk level drops below the minimum threshold of the risk envelope -- she will often engage in actions that increase their level of risk. This comfort envelope varies for different people and in different situations. In the case of computer interfaces, a level of risk that is comfortable for a novice user might make a "power-user" feel uncomfortably swaddled in safety. It's important for new users that they feel safe. They don't trust themselves or their skills to do the right thing. Many novice users think poorly not only of their technical skills, but of their intellectual capabilities in general (witness the popularity of the "...for Dummies" series of tutorial books.) In many cases these fears are groundless, but they Compiled by Omorogbe Harry 34 HCI need to be addressed. Novice users need to be assured that they will be protected from their own lack of skill. A program with no safety net will make this type of user feel uncomfortable or frustrated to the point that they may cease using the program. The "Are you sure?" dialog box and multi-level undo features are vital for this type of user. At the same time, an expert user must be able to use the program as a virtuoso. She must not be hampered by guard rails or helmet laws. However, expert users are also smart enough to turn off the safety checks -- if the application allows it. This is why "safety level" is one of the more important application configuration options. Finally, it should be noted that many things in life are not meant to be easy. Physical exercise is one -- "no pain, no gain". A concert performance in Carnegie Hall, a marathon, or the Guinness World Record would be far less impressive if anybody could do it. This is especially pertinent in the design of computer game interfaces, which operate under somewhat different principles than those listed here (although many of the principles in fact do apply). The principle of context -- Limit user activity to one well-defined context unless there's a good reason not to Each user action takes place within a given context -- the current document, the current selection, the current dialog box. A set of operations that is valid in one context may not be valid in another. Even within a single document, there may be multiple levels -- for example, in a structured drawing application, selecting a text object (which can be moved or resized) is generally considered a different state from selecting an individual character within that text object. It's usually a good idea to avoid mixing these levels. For example, imagine an application that allows users to select a range of text characters within a document, and also allows them to select one or more whole documents (the latter being a distinct concept from selecting all of the characters in a document). In such a case, it's probably best if the program disallows selecting both characters and documents in the same selection. One unobtrusive way to do this is to "dim" the selection that is not applicable in the current context. In the example above, if the user had a range of text selected, and then selecting a document, the range of selected characters could become dim, indicating that the selection was not currently pertinent. The exact solution chosen will of course depend on the nature of the application and the relationship between the contexts. Another thing to keep in mind is the relationship between contexts. For example, it is often the case that the user is working in a particular task-space, when suddenly a dialog box will pop up asking the user for confirmation of an action. This sudden shift of context may leave the user wondering how the new context relates to the old. This confusion is exacerbated by the terseness of writing style that is common amongst application writers. Rather than the "Are you sure?" confirmation mentioned earlier, something like "There are two documents unsaved. Do you want to quit anyway?" would help to keep the user anchored in their current context. Compiled by Omorogbe Harry 35 HCI The principle of aesthetics -- Create a program of beauty It's not necessary that each program be a visual work of art. But it's important that it not be ugly. There are a number of simple principles of graphical design that can easily be learned, the most basic of which was coined by artist and science fiction writer William Rotsler: "Never do anything that looks to someone else like a mistake." The specific example Rotsler used was a painting of a Conan-esque barbarian warrior swinging a mighty broadsword. In this picture, the tip of the broadsword was just off the edge of the picture. "What that looks like", said Rotsler, "is a picture that's been badly cropped. They should have had the tip of the sword either clearly within the frame or clearly out of it." An interface example can be seen in the placement of buttons -- imagine five buttons, each with five different labels that are almost the same size. Because the buttons are packed using an automated-layout algorithm, each button is almost but not exactly the same size. As a result, though the author has placed much care into his layout, it looks carelessly done. A solution would be to have the packing algorithm know that buttons that are almost the same size look better if they are exactly the same size -- in other words, to encode some of the rules of graphical design into the layout algorithm. Similar arguments hold for manual widget layout. Another area of aesthetics to consider is the temporal dimension. Users don't like using programs that feel sluggish or slow. There are many tricks that can be used to make a slow program "feel" snappy, such as the use of off-screen bitmaps for rendering, which can then be blitted forward in a single operation. (A pet peeve of this particular author is buttons that flicker when the button is being activated or the window is being resized. Multiply redundant refreshing of buttons when changing state is one common cause of this.) The principle of user testing -- Recruit help in spotting the inevitable defects in your design In many cases a good software designer can spot fundamental defects in a user interface. However, there are many kinds of defects which are not so easy to spot, and in fact an experienced software designer is often less capable of spotting them than the average person. In other cases, a bug can only be detected while watching someone else use the program. User-interface testing, that is, the testing of user-interfaces using actual end-users, has been shown to be an extraordinarily effective technique for discovering design defects. However, there are specific techniques that can be used to maximize the effectiveness of end-user testing. These are outlined in both [TOG91] and [LAUR91] and can be summarized in the following steps: Set up the observation. Design realistic tasks for the users, and then recruit endusers that have the same experience level as users of your product (Avoid recruiting users who are familiar with your product however). Describe to the user the purpose of the observation. Let them know that you're testing the product, not them, and that they can quit at any time. Make sure that Compiled by Omorogbe Harry 36 HCI they understand if anything bad happens, it's not their fault, and that it's helping you to find problems. Talk about and demonstrate the equipment in the room. Explain how to "think aloud". Ask them to verbalize what they are thinking about as they use the product, and let them know you'll remind them to do so if they forget. Explain that you will not provide help. Describe the tasks and introduce the product. Ask if there are any questions before you start; then begin the observation. Conclude the observation. Tell them what you found out and answer any of their questions. Use the results. User testing can occur at any time during the project, however, it's often more efficient to build a mock-up or prototype of the application and test that before building the real program. It's much easier to deal with a design defect before it's implemented than after. Tognazzini suggests that you need no more than three people per design iteration -- any more than that and you are just confirming problems already found. The principle of humility -- Listen to what ordinary people have to say Some of the most valuable insights can be gained by simply watching other people attempt to use your program. Others can come from listening to their opinions about the product. Of course, you don't have to do exactly everything they say. It's important to realize that each of you, user and developer, has only part of the picture. The ideal is to take a lot of user opinions, plus your insights as a developer and reduce them into an elegant and seamless whole -- a design which, though it may not satisfy everyone, will satisfy the greatest needs of the greatest number of people. One must be true to one's vision. A product built entirely from customer feedback is doomed to mediocrity, because what users want most are the features that they cannot anticipate. But a single designer's intuition about what is good and bad in an application is insufficient. Program creators are a small, and not terribly representative, subset of the general computing population. Some things designers should keep in mind about their users: Most people have a biased idea as to the what the "average" person is like. This is because most of our interpersonal relationships are in some way self-selected. It's a rare person whose daily life brings them into contact with other people from a full range of personality types and backgrounds. As a result, we tend to think that others think "mostly like we do." Designers are no exception. Most people have some sort of core competency, and can be expected to perform well within that domain. The skill of using a computer (also known as "computer literacy") is actually much harder than it appears. Compiled by Omorogbe Harry 37 HCI The lack of "computer literacy" is not an indication of a lack of basic intelligence. While native intelligence does contribute to one's ability to use a computer effectively, there are other factors which seem to be just as significant, such as a love of exploring complex systems, and an attitude of playful experimentation. Much of the fluency with computer interfaces derives from play -- and those who have dedicated themselves to "serious" tasks such as running a business, curing disease, or helping victims of tragedy may lack the time or patience to be able to devote effort to it. A high proportion of programmers are introverts, compared to the general population. This doesn't mean that they don't like people, but rather that there are specific social situations that make them uncomfortable. Many of them lack social skills, and retreat into the world of logic and programming as an escape; As a result, they are not experienced people-watchers. The best way to avoid misconceptions about users is to spend some time with them, especially while they are actually using a computer. Do this long enough, and eventually you will get a "feel" for how the average non-technical person thinks. This will increase your ability to spot defects, although it will never make it 100%, and will never be a substitute for user-testing. ERGONOMIC GUIDELINES FOR USER-INTERFACE DESIGN The following points are guidelines to good software interface design, not an absolute set of rules to be blindly followed. These guidelines apply to the content of screens. In addition to following these guidelines, effective software also necessitates using techniques, such as 'storyboarding', to ensure that the flow of information from screen to screen is logical, follows user expectations, and follows task requirements. Consistency ("Principle of least astonishment") certain aspects of an interface should behave in consistent ways at all times for all screens terminology should be consistent between screens icons should be consistent between screens colors should be consistent between screens of similar function Simplicity break complex tasks into simpler tasks break long sequences into separate steps keep tasks easy by using icons, words etc. use icons/objects that are familiar to the user Human Memory Limitations organize information into a small number of "chunks" try to create short linear sequences of tasks don't flash important information onto the screen for brief time periods organize data fields to match user expectations, or to organize user input (e.g. auto formatting phone numbers) provide cues/navigation aids for the user to know where they are in the software or at what stage they are in an operation Compiled by Omorogbe Harry 38 HCI provide reminders, or warnings as appropriate provide ongoing feedback on what is and/or just has happened let users recognize rather than recall information minimize working memory loads by limiting the length of sequences and quantity of information - avoid icon mania! Cognitive Directness minimize mental transformations of information (e.g. using 'control+shift+esc+8' to indent a paragraph) use meaningful icons/letters use appropriate visual cues, such as direction arrows use 'real-world' metaphors whenever possible (e.g. desktop metaphor, folder metaphor, trash can metaphor etc.) Feedback provide informative feedback at the appropriate points provide appropriate articulatory feedback - feedback that confirms the physical operation you just did (e.g. typed 'help' and 'help' appear on the screen). This includes all forms of feedback, such as auditory feedback (e.g. system beeps, mouse click, key clicks etc.) provide appropriate semantic feedback - feedback that confirms the intention of an action (e.g. highlighting an item being chosen from a list) provide appropriate status indicators to show the user the progress with a lengthy operation (e.g. the copy bar when copying files, an hour glass icon when a process is being executed etc.) System messages provide user-centered wording in messages (e.g. "there was a problem in copying the file to your disk" rather than "execution error 159") avoid ambiguous messages (e.g. hit 'any' key to continue - there is no 'any' key and there's no need to hit a key, reword to say 'press the return key to continue) avoid using threatening or alarming messages (e.g. fatal error, run aborted, kill job, catastrophic error) use specific, constructive words in error messages (e.g. avoid general messages such as 'invalid entry' and use specifics such as 'please enter your name') make the system 'take the blame' for errors (e.g. "illegal command" versus "unrecognized command") Anthropomorphization don't anthropomorphize (i.e. don't attribute human characteristics to objects) avoid the "Have a nice day" messages from your computer Modality use modes cautiously - a mode is an interface state where what the user does has different actions than in other states (e.g. changing the shape of the cursor can indicate whether the user is in an editing mode or a browsing mode) minimize preemptive modes, especially irreversible preemptive modes - a preemptive mode is one where the user must complete one task before Compiled by Omorogbe Harry 39 HCI proceeding to the next. In a preemptive mode other software functions are inaccessible (e.g. file save dialog boxes) make user actions easily reversible - use 'undo' commands, but use these sparingly allow escape routes from operations Attention use attention grabbing techniques cautiously (e.g. avoid overusing 'blinks' on web pages, flashing messages, 'you have mail', bold colors etc.) don't use more than 4 different font sizes per screen use serif or sans serif fonts appropriately as the visual task situation demands don't use all uppercase letters - use and uppercase/lowercase mix don't overuse audio or video use colors appropriately and make use of expectations (e.g. don't have an OK button colored red! use green for OK, yellow for 'caution, and red for 'danger' or 'stop') don't use more than 4 different colors on a screen don't use blue for text (hard to read), blue is a good background color don't put red text on a blue background use high contrast color combinations use colors consistently use only 2 levels of intensity on a single screen Use underlining, bold, inverse video or other markers sparingly on text screens don't use more than 3 fonts on a single screen Display issues maintain display inertia - make sure the screen changes little from one screen to the next within a functional task situation organize screen complexity eliminate unnecessary information use concise, unambiguous wording for instructions and messages use easy to recognize icons use a balanced screen layout - don't put too much information at the top of the screen - try to balance information in each screen quadrant use plenty of 'white space' around text blocks - use at least 50% white space for text screens group information logically structure the information rather than just presenting a narrative format (comprehension can be 40% faster for a structured format) Individual differences accommodate individual differences in user experience (from the novice to the computer literate) accommodate user preferences by allowing some degree of customization of screen layout, appearance, icons etc. allow alternative forms for commands (e.g. key combinations through menu selections) Compiled by Omorogbe Harry 40 HCI Web page design Download speed is a critical aspect of web page design. Remember that when you check your pages locally in your browser you aren't experiencing normal web delays! Regardless of your modem speed, pages will only download at the fastest rate of the slowest link in the 'chain' from a server to the browser. The following tips will help to speed downloads and aid comprehension of your web page materials: avoid using 'blinks' unless these are absolutely necessary - blinks are distracting, use fonts, sizes, colors to attract attention keep backgrounds simple and muted minimize audio and video use, this really slows download time use animated files (e.g. animated .GIFs) sparingly use thumbnail .GIFs linked to larger .GIFs specify .GIF size (HEIGHT, WIDTH) - this speeds download times use 'ALTs' for .GIFs where only the .GIF provides the link - this provides linked text information to those only browsing in text mode use image maps sparingly - they are slow and can be annoying - using an invisible table can often give similar results with much faster downloads use frames sparingly and consistently - use absolute widths for frames, scroll bars, avoid menus for small numbers of items, also check that users don't get stuck in a frame avoid 'construction signs' - web pages are meant to be dynamic and therefore should be changed/updated regularly - they are always under construction - try to tell users when content was last changed and what changes were made minimize use of Java, Javascript, Applets (e.g. ticker tape status bars) - they are cute but often provide little useful information content and slow downloads remember that 50% of users have monitors 15" or less and at 640 x 480 resolution, so use a maximum window width of 620 pixels or flexible window widths and test your pages in your browser at low screen resolutions and limited colors (256 or less) provide contact information at the home page location in your site General principles to follow when designing any programme. A good interface will fade into the background and the user will focus on the task at hand. Human Issues Baeker and Buxton (pg. 40) state that the "beliefs and expectations with which she (the computer user) sits down at her terminal or personal computer are a direct result of her concept of what the computer is like and what the computer has become.", thus Hansen (cited in Shneiderman, 1986) states that one should "know the user". This includes all aspects of the user's experience of computerized systems as well as their personal preferences. Compiled by Omorogbe Harry 41 HCI Previous computer experience and design expectations. For example a user who has only had experience in the windows environment is unlikely to benefit from a DOS look and feel, even if the programme is functionally adequate for all their programming needs. This is vitally important when one remembers that the computer, for most users, is simply one of an array of tools that can be used to perform a certain task. If the tool is not readily accessible and easy to use it will be discarded in preference of another. Cultural Issues Certain images, graphics and language may be offensive to one group of users, and care must be taken to avoid inadvertently offending any one on the basis of culture, race, creed, gender or sexual orientation. Muslim users may be offended (or alienated) by popping champagne bottles, whilst indirectly comparing a Zulu user to an animal (cartoon of a monkey) would equally offend and alienate this group. Language should be inoffensive, and gender neutral. Differently abled users Any computer programme may be used by people with physical challenges e.g. the blind and deaf. Even in areas where it is unlikely for the physically disabled to be accepted, there may be occasions when a user is temporarily disabled and still needs access to the equipment. For instance if a hand is in plaster cast would the user still be able to access the information. Sound should include textual alternatives, and visual graphics should have descriptions. Colour Vision Deficiency (Colour blindness) is more prevalent that one realizes, make sure that any important colour coding and contrasts take this into account. Table 1 outlines the more common discrimination confusions in fairly technical terms whilst Fowler and Stanwick (1995 pgs. 309, 310) state that "Color blindness or weakness has four basic varieties. (i) Green blindness - individuals confuse greens, yellows, and reds (6.39 percent) (ii) Red blindness - individuals confuse various shades of red (2.04 percent) (iii) Blue blindness - individuals confuse blues (0.0003 percent) (iv) Total color blindness, which affects no more than 0.005 percent of both sexes." The Macintosh Human Interface Guidelines also warns against this problem stating "people with color-deficient vision wouldn't recognize the use of color to indicate selection. Therefore, you shouldn't use color as the only means of communicating important information. Color should be used redundantly. It shouldn't be the only thing that distinguishes two objects; there should be other cues, such as text labels, shape, location, pattern, or sound." and suggests that all images should be developed in black and white first. (For more information about the use of colour see the section heading "Colour".) Learning Time Nelson (cited in Baeker and Buxton, 1987) stated that "any system which cannot be well taught to a layman in ten minutes, by a tutor in the presence of a responding set-up, is too complicated". Factors that lead to the shortening of the learning time include familiarity, Compiled by Omorogbe Harry 42 HCI consistency and the use of an accessible metaphor. If a user can visualize the structure of a system and is able to predict the outcome of interactions, they will have more confidence with quicker interactions and a lower error rate. Menus and selection objects Menu systems and graphical iconic symbolization are not necessarily universally understood. Various authors point to the following guidelines when creating selection items:(i) All graphic representation should have textual descriptions. (ii) Consistency of terminology should apply to all options throughout the system. (iii) Avoid the use of jargon and keep phrasing concise. (iv) Keywords should be scanned by the user first. (v) Group similar items in a menu map, or if this is not possible use other instinctive alternatives such as alphabetic order. (vi) Avoid multiple screen transversals for selection purposes. (vii) Avoid ambiguity. (viii) Consistency throughout is vital. Icon Tips Pictorial literacy is not a given. Interpretations of graphics are often dependant on culture, experience and exposure to a specific medium (see Amory and Mars, 1994 and Andrews, 1994). One pertinent example is that arrows are not a universal symbol of direction. It is for this reason that most authorities in Interface design recommend that all buttons, icons etc be labeled. Fowler and Stanwick (pages 57, 58) suggest that there are two standard sizes for icons, 16 pixels square and 32 pixels square, They quote William Horton's book "The Icon Book" as suggesting that "Design is easier if the grid has an odd number of pixels along each side. This is because an odd number provides a central pixel around which to focus design". They go on to state that each icon should have a label, which should be the same as (or an abbreviation of) the title of the corresponding window. Navigation Issues Navigation issues vary between Multimedia and WebPages but the common issues include links to the first screen/page, next screen/page, backtrack facilities and every system should have a quick exit button. See the section on the use of metaphor for commonly used buttons. All applications should have short cuts for expert users. Sound All aspects of design should adhere to the concept of adding meaning, if there is no enhancement of accessibility for the user, then there is no need for the information, graphic or media to be added. Similarly sound should only be inserted if it enhances meaning and it should not distract the users attention. Where ever possible allow the user interactive control to play, stop, rewind and pause. It is also useful to be aware that some users may be disturbed by a faceless voice. Many applications display a picture or video of a person when a voice recording is played. Compiled by Omorogbe Harry 43 HCI Mixed Media When using a combination of media e.g. sound, text, animation and video, be careful that the users attention is not distracted by one or other of the media. e.g. animation and sound can work well together, but animation and text presented simultaneously is likely to be distracting. Messages and Status reports Concise, brief, unambiguous, clearly visible and consistently placed on screen. Feedback Immediate, positive and instructional Tone Respect for the user and subject material is imperative. Avoid slang, misplaced humour and potentially offensive insinuations. Screen Layout and Design The layout of the screen is a controversial issue; what is aesthetically pleasing to one person may be considered dull and boring or, conversely, garish to another. The following locally designed pages may best illustrate this: Novice designers should aim for elegant simplicity and consistency. It helps to divide the screen into a grid where similar types of information are consistently placed. This helps the designer form a visual sense of balance across screens, and the consistency will aid the user to quickly locate the important information. Users typically suffer from "cognitive overload" from too much information and too many diverse media used simultaneously. Font should be legible, and care must be taken to ensure that the users machine is likely to have a similar font to the one selected so that there is a level of predictability in the final display. A mixture of too many fonts detracts from legibility, rather use a maximum of two fonts and vary the size and weights to change the emphasis or draw attention to different areas of information. All screens should be titled, and the titles should match the names of the interaction that brought the user to the screen. White space consistently used can separate the screen into logical groups of information and make it more legible. Colour Most people involved with the development of interactive course material cannot afford the expertise and skills of a graphic design artist. This is often obvious in the end results and if at all possible it is recommended that a graphic artist be included in a team of developers. However, for those that are in the unfortunate position of a "do or die" scenario the following advise may assist. Most authors suggest the use of a maximum of four colours. Use colours to colour code similar items, but remember that colour coding is only useful if the user knows the code (red=stop, green=go); the metaphor should be a familiar one to the users otherwise lengthy explanations are necessary and counter productive. Also colours are often used to depict various items (e.g. in medical illustrations red are used to Compiled by Omorogbe Harry 44 HCI depict arteries and yellow to depict nerves), switching or changing these colours could be confusing for the user. In dense screens colour coding can assist the user to identify grouped material - choose your colours carefully so as to accommodate people with Colour Discrimination Deficiencies as far as possible. If material is to be printed by the user, remember to design graphics with patterns as well as colour coding. Most people only have access to black and white printers. Consider contrasts carefully. If you have a dark background, use light foregrounds (this combination is good for long-distance viewing such as slide shows or projected computer screens). Use light backgrounds and dark foregrounds for situations with high ambient light e.g. overhead projectors. Note that different wavelengths of colour come into focus at different points in the eye (See Figure 3). It is difficult for people to focus on red and blue simultaneously. Colour confusions commonly perceived by people suffering from colour vision deficiencies. adapted from Travis (1991) pg. 59 Type of Defect Achromatopsia Incidence in % Typical Confusions White Matches 0.003 All colours look like shades of grey. Many colours 1 Bluish-green & brown green, olive, tan & red-orange, blue & red-purple, violet & purple Blue-green Deuteranopia 1 Dull green & pink Olive & brown Yellow-green & red-orange greenish-blue, dull blue & purple Blue-green Tritanopia 0.004 Green & greenish blue oranges & red-purples Yellow-orange Protanopia The use of metaphor in the interface design Imposing a metaphor on a virtual world, allows the user to be better able to predict the outcomes of new interactions. It also allows the designer to work with a model which will guide the development in a consistency of interactions and representations. Obvious metaphors are those of the "desktop" for office automation software, and the "paint brush and easel" for graphics packages. Care should be taken that the analogy is familiar to the users' experience of the "real world" and similar enough to be incorporated without excessive explanation. Another common metaphor for navigational buttons is the VCR or tape deck buttons, which are familiar to most users. e.g. Compiled by Omorogbe Harry 45 HCI Forward Back Fast forward Rewind Stop Interactivity Interactivity has been lauded as the most promising development in CAL since the euphoria of AI collapsed. However, interactivity should be more than a simple point and click scenario. Truly interactive systems based on a constructiveness approach would include drag and drop, text entries and other forms of interaction to develop a user’s knowledge of the subject material. Learning Styles Individuals typically have their own preferences in the way that they perceive, collect and process information. These methods are referred to as "Learning Styles". The Academic Skills Center at Western Michigan University offers the following breakdown of learning styles: Print - learns through reading (Allow printouts for these students). Aural - learns by listening - will enjoy audio tapes and listening to what other learners have to say. (Voice over will assist these users.) Interactive - enjoys discussions with other students on a one-to-one basis or in small groups. (CMC would assist many of these students). Visual - learns by looking at pictures, graphs, slides, demonstrations and films. (Colour coding will work well with these types of students.) Hap tic - learn through the sense of touch. (Drag and Drop interactions could help here.) Kinesthetic - learns through movement. (Animation could help students with this type of preference). Olfactory - uses the sense of smell in learning. (Any ideas?) Learners will not typically use only one of the above list but a combination of them, favouring one method over another e.g. some learners work well in a group environment using visual and interactive learning styles whilst others prefer to learn on their own, but still use a visual style. Many, although not all of the above can be used in the development of Interactive Multimedia Course Material. Instructional Events Gagne (1973, p. 303) states that "control of the external events in the learning situation is what is typically meant by the word 'instruction'". He then lists these events as: Gaining and controlling attention. Informing the learner of expected outcomes Stimulating recall of relevant prerequisite capabilities. Presenting the stimuli inherent to the learning task. Compiled by Omorogbe Harry 46 HCI Offering guidance for learning. Providing feedback. Appraising performance Making provisions for transferability. Insuring retention. Importance of HCI Users expect highly effective and easy-to-learn interfaces and developers now realize the crucial role the interface plays. Surveys show that over 50% of the design and programming effort on projects is devoted to the user interface portion. The humancomputer interface is critical to the success of products in the marketplace, as well as the safety, usefulness, and pleasure of using computer-based systems. There is substantial empirical evidence that employing the processes, techniques, and tools developed by the HCI community can dramatically decrease costs and increase productivity. For example, one study reported savings due to the use of usability engineering of $41,700 in a small application used by 23,000 marketing personnel, and $6,800,000 for a large business application used by 240,000 employees. Savings were attributed to decreased task time, fewer errors, greatly reduced user disruption, reduced burden on support staff, elimination of training, and avoidance of changes in software after release. Another analysis estimates the mean benefit for finding each usability problem at $19,300. A usability analysis of a proposed workstation saved a telephone company $2 million per year in operating costs. A mathematical model based on eleven studies suggests that using software that has undergone thorough usability engineering will save a small project $39,000, a medium project $613,000 and a large project $8,200,000. By estimating all the costs associated with usability engineering, another study found that the benefits can be up to 5000 times the cost. There are also well-known catastrophes that have resulted from not paying enough attention to the human-computer interface. For example, the complicated user interface of the Aegis tracking system was a contributing cause to the erroneous downing of an Iranian passenger plane, and the US Stark's inability to cope with Iraqi Exocet missiles was partly attributed to the human-computer interface. Problems with the interfaces of military and commercial airplane cockpits have been named as a likely cause for several crashes, including the Cali crash of December 1995. Sometimes the implementation of the user interface can be at fault. A number of people died from radiation overdoses partially as a result of faulty cursor handling code in the Therac-25. Effective user interfaces to complex applications are indispensable. The recognition of their importance in other disciplines is increasing and with it the necessary interdisciplinary collaboration needed to fully address many challenging research problems. For example, for artificial intelligence technologies such as agents, speech, and learning and adaptive systems, effective interfaces are fundamental to general acceptance. HCI sub disciplines such as information visualization and algorithm animation are used in computational geometry, databases, information retrieval, parallel and distributed computation, electronic commerce and digital libraries, and education. HCI requirements resulting from multimedia, distributed computing, real-time graphics, multimodal input and output, ubiquitous computing, and other new interface Compiled by Omorogbe Harry 47 HCI technologies shape the research problems currently being investigated in disciplines such as operating systems, databases, and networking. New programming languages such as Java result from the need to program new types of distributed interfaces on multiple platforms. As more and more of software designers' time and code are devoted to the user interface, software engineering must increase its focus on HCI. Differences between locally presented multimedia course material & World Wide Web delivered material There are a number of subtle differences between the Interface for locally presented multimedia course material and that that is delivered via the WWW. Response Time Probably the most significant comes about as a result of the difference in response time. Locally delivered material can usually rely on a quick response and display time, whilst internet delivered material has a slow response time. As users generally do not like to wait for information, internet material should be more detailed and lengthy than locally delivered material. This has particular relevance to menu's and navigational issues; Shneiderman (page 106) states that "deep menu trees or complex traversals become annoying to the user if the systems response time is slow, resulting in long and multiple delays. With slow display rates, lengthy menus become annoying because of the volume of text that must be displayed. In positive terms, if the response time is long, then create menus with more items on each menu to reduce the number of menus necessary. If the display rate is slow, create menus with fewer items to reduce the display time." It is important to ensure that colour graphics do not unnecessarily slow down the display of information. Web pages are particularly prone to slow response rates if large graphic files are necessary. Similarly in the development of multimedia CAL, care should be taken to reduce the number of colours in the graphic file to 256 as this allows quicker display times and compatibility with most computer colour monitors. However, the interpretations of colours vary from monitor to monitor and the visual implications should be tested on as many different display screens as possible. Compiled by Omorogbe Harry 48 HCI CHAPTER FIVE HCI AND WEB DESIGN Problems and Promises In this section, we will examine the relationship between the activity of designing information sites for the World Wide Web and the field of Human Computer Interactions. From the perspective of HCI, web site design offers some interesting problems that are not present in the creation of traditional, stand-alone software products. Because of the present development of the WWW's rise to prominence, HCI is only now beginning to address these new issues. A challenge for the field, therefore, will be to rigorously examine the process of web site design and offer recommendations and guidelines, as it has with the areas of software and hypermedia publishing. That such counsel is needed by web site designers becomes readily apparent when looking at the multitude of badly conceived and poorly designed sites now populating the web. As Borges and his collaborators point out, "the proliferation of pages with poor usability suggests that most of the designers of WWW pages have little knowledge of user interface design and usability engineering. This is a serious problem that needs to be addressed...". There are, in fact, any great numbers of guidelines currently published on the WWW offering advice on how to design effective and pleasing sites. Unfortunately, very few of these are grounded in the theories or empirical data that have been developed in HCI. In fact, as Morris notes, "at this point, HCI as a discipline has had a relatively limited impact upon the development of the web". It is my contention, however, that it is precisely the field of HCI that has the most to offer web site designers. Therefore, part of this paper will be devoted to examining different areas within the HCI literature that might be of most use to the individuals who are creating and maintaining web sites. This section is divided into two main parts. In the first section, we will identify some of the new and unique issues that designing for the medium of the web present to the field of HCI. In the second section, we will discuss areas of the HCI literature that are particularly useful to web designers and propose a method for web site design that is based upon these project work. Issues in HCI design in Web Medium Web and Traditional Software Design The question can be raised as to how similar the activity of designing World Wide Web sites is to the design of more "traditional" software and hypermedia products. The very fact that we are doing a project that attempts to relate web design to the established HCI literature suggests that we believe there are important similarities between designing for the web and designing other types of software. Yet there are obviously some important differences as well -- differences that the field of HCI is only beginning to consider. The most obvious dissimilarities involve the levels of technical knowledge necessary for Compiled by Omorogbe Harry 49 HCI design, and the types of entities that carry out the design process. While the creation of traditional "stand-alone" software applications requires extensive technical expertise, and is the largely the province of specialized companies, designing web sites requires relatively little technical knowledge, and can easily be done by almost anyone. But such surface distinctions, while important to note, are not what primarily concerns me. Rather, we are more interested in how the medium of the World Wide Web presents a set of challenges and issues to designers that are different to those presented to creators of traditional software products. Although there are undoubtedly some similarities in the process of creating web sites and stand-alone software, there are also some significant variations that result from the distinct characteristics of the mediums they are intended for. Put simply, the WWW is a very different environment from a single computer system or limited network, and designing applications to be displayed on it presents the designer with a number of unique issues that they must consider. Perhaps the most fundamental aspect of the web medium that designers must come to terms with is that it is platform independent, which means that materials on the web can be accessed by a wide variety of computer systems and browser software. Because the WWW is system and browser independent, and because the different systems/browsers have varying capabilities and features, the designer of a web site does not know and cannot control: 1) How their pages will be visibly rendered for any particular user (e.g., a pleasing and coherent layout on one system/browser may look terrible and be confusing on another), nor 2) What functionality of the site will be supported by the configurations of different users (e.g., important layout features like tables may not work in all browsers). Thus, designers of web sites have to account for the fact that they will have only a limited amount of control over the interface that their site will present to a visitor. As Simon Shum notes, "there has never been a hypertext system that was so large that noone could be sure what hardware or software the end users might be using. The user interface design community has had to get to grips with the concept of designing with this uncertainty. Creator s of sites who want their work to be accessible and usable to a wide audience either have to design it in a way that will allow all major systems/browsers to view it effectively (designing for the "lowest common denominator"), or they have to consider pro viding different versions of the same site that are optimized for different types of users]. While the former option may be unacceptable for designers who want to incorporate the latest technological advances into their sites, and the latter option requires extra work on the part of designers (who would have to present multiple versions of the same site), these are really the only options for dealing with the uncertainty caused by the independent nature of the WWW. Level of Interface to User A second unique feature that has to be considered by designers is that web pages represent "third-level interfaces" for a user. Above the level of the individual web page, a Compiled by Omorogbe Harry 50 HCI user is also interacting with browser software and an operating system, each which provide their own interfaces to the user. The most important levels to focus on, for my purposes, are those of the browser and of the individual web sites/pages. A web site, as it is experienced by a visitor, really has a dual-interface: one that is provided by their browser software, and the other which is provided by the site designer. Both the browser and the site levels are important, in that each provide mechanisms that determine how a user will interact with the site and how they will navigate the site. Browsers, for their part, display the individual web pages, and provide at least a minimal set of navigation options for the user. Different browsers, however, vary in their capabilities for visually rendering pages and supporting other features -- ranging from the text-only capabilities of the Lynx browser to more advanced software packages like the latest versions of Netscape Navigator and Microsoft Explorer, which support a wide variety of media types (text, images, video, audio) and features (Java, JavaScript, Vbscript, tables, etc.). Browsers also vary in the navigation mechanisms that they offer to users. While all browsers support basic backtracking and jumping movements, the more advanced browsers also incorporate features identified in hypertext literature as aiding navigation - features like history lists, bookmarking, and footprinting]. At the level of the individual web site, user navigation is affected by the access mechanisms that are presented (e.g., site overview maps, tables of contents, navigation bars, etc.), as well as the hypertext links embedded within the pages. Because the "dual-interface" will affect user's interaction with and navigation through a web site, and because the platformindependent nature of the WWW means that site designers cannot know which types of systems and browsers will access their sites, the creators of these sites have only limited control over the user interface that will be presented to a visitor. Site designers need to carefully consider a number of issues, therefore, regarding the functionality and navigation facilities which their site provides, and how these will relate to and be affected by a variety of browser platforms. Access Speed A third unique issue that confronts designers of web sites relates to the question of access speed. Because assess to a web sites comes via a connection to the global Internet, and is therefore affected by bandwidth constraints and network traffic , users of the WWW will likely experience some (greater or lesser) delay in the system's response to their actions. This can cause a number of problems. Slow connections, whatever their cause, not only serve to frustrate a user -- and increase the chance that they will abandon a site if it is responding too slowly -- but it also delays feedback to the user as well. Because connections to web sites are typically asynchronous, the system will respond to a user only after she takes some action. And if there is too great a delay between action and reaction, confusion, anxiety, or frustration may result. Discussing hypermedia systems, Jakob Nielsen notes that."...the response time for the display of the destination node is critical for the user's feeling of navigating an information space freely". And if the connection to a particular site is slow, users may feel that they are not fully in control. While the need for adequate speed is largely taken for grant ed in most software application development, and in usability research on these products, it is an important issue that faces web users and designers alike. Although some of the factors that affect access time, such as user's connection speed and network traffic levels, are beyond the control of web designers, there are obviously some steps that site creators can take to Compiled by Omorogbe Harry 51 HCI minimize the potential difficulties. In general, web pages that are smaller and less graphically-intensive will load faster than those which are larger and more graphically rich. Web designers, therefore, can insure that their sites will be accessed as quickly as possible by keeping the file size of their pages fairly low. But such a solution may not always be considered optimal for designers, who might want to capitalize upon the multimedia capabilities that the WWW offers. Thus, trade-offs are inevitable, and there is no single best solution for any case. Such trade-offs between access speed and presentation are much less important of an issue for developers of other software products, and as Shum notes, "web designers must therefore priorities different criteria to ones they might use in designing a smaller scale hypertext or multimedia CD-ROM, in order to balance interactivity with acceptable speed of access". Interface Tools The issues of platform independence, dual user interfaces, and access time all pose challenges to web authors, who must carefully consider the issues raised by these factors when deciding how to best design their sites. Unfortunately, they are also faced with the additional problem of having a much more limited set of interface tools to work with. Compared to the range of potential tools and techniques available to authors of standalone software applications, the web designer has a relatively primitive set of resources to work with. According to Richard Miller, "HTML's limited set of objects and interaction styles is a step backwards for interface design compared to the growth of interactive computing over the last 30 years". Not only do web designers have fewer interface widgets as their disposal, but nature of the web medium also makes it difficult or impossible to tightly couple relationships between interface elements], or to utilize some navigation aids identified as beneficial in hypermedia research (such as multiple windowing, user annotation, zooming, etc.). Thus, web site designers are faced not only with a lack of control over their interfaces that their sites present, but they also have fewer resources to draw upon to maximize the potential of these interfaces. Nature of the Web The final special issue regarding web site design to be discussed is how the dynamic nature of web sites affects their creation. Whereas the first four issues that have been examined all present problems to the web designer, the dynamism inherent in the WWW may actually prove advantageous for these authors. In the case of traditional software development, the design cycle is fairly well bounded, and when the product is released to the public, there is little or nothing that can be done to change it. This places a great burden on the development team, who much ensure that the product meets all of its predefined requirements and is relatively bug-free before it can be released. If problems arise afterwards, they can only be remedied through costly and time-consuming methods, and significant changes to the product may have to wait until the next version is developed. Web sites, however, are much easier to change after they have been "released" to the public. While this does not mean that site creators can afford to be lax in their initial design efforts, it does mean that if problems with the site become apparent after it has been mounted, they are relatively easy to change. This means that the iterative design cycle for web sites can be much less bounded, and may continue after the site is Compiled by Omorogbe Harry 52 HCI implemented in order modify problem areas. In fact, because of the dynamic nature of the web medium, it is probable that a site will undergo constant revision and change. While this offers site designers a greater degree of flexibility, some care needs to be taken to make sure that the site is not changed so often or so much as to create confusion among repeat visitors. The five issues that were discussed above all relate to differences that exist between designing web sites and traditional software applications. Although these issues may present special conditions that web site designers must consider, the discussion w as not intended to imply that designing for the web is a more difficult process than creating other forms of software. In fact, by almost any measure, web authoring is a much simpler task than creating stand-alone software products. The above discussion was merely intended to highlight the fact that the process of creating web sites is in some ways unique, and that designers in this medium are faced with different types of considerations than those faced by individuals in the software industry. To be sure, there are also some common considerations that creators in both field face, such as how to structure the design process, how to construct a meaningful navigation system for a hyperspace, and how to create a usable interface. This discussion of the differences between web authoring and traditional software publishing, therefore, should not suggest that the existing areas of the HCI literature which are oriented toward "traditional" software issues are not useful to web designers. In fact, there are many areas within the HCI field that have a great deal to offer web designers. And with the growing importance of the WWW, more attention within the HCI community has been directed at this new medium. Too many individuals who are currently producing web sites seem to feel that this activity is somehow suigeneris, and has little to learn from the body of accumulated knowledge about such issues as design methodology, hypermedia development, and interface design. While I agree that the web medium is in some ways unique, I would reject any contention that designing for the web is so different as to render existing work in the field of HCI irrelevant to it. In fact, it is apparent to me that individuals who produce web sites should be more familiar with what the field of HCI has to offer. The question can be raised, then, as to what areas of HCI are most relevant to web designers. The following section of this chapter will address this issue. Areas in HCI that is important to Web Design A blanket statement such as "the HCI literature is important for web designers" is not very useful because the field itself is so broad and varied. Although arguments could be made for including many different strands of HCI into a discussion of relevant areas for web design, we will discuss only four areas that we think are particularly significant: the literatures on software design methodology, hypermedia, user interface design, and usability. Before moving on to discuss these areas, we feel that a few caveats are in order. First, given the context of the assignment and the fact that we are addressing several different segments of HCI, my review of the literature in these areas will be fairly selective. we make no pretensions of having thoroughly surveyed these four areas. Also, we have attempted to the degree that is possible include fairly recent works that are explicitly oriented toward issues involving the WWW. Finally, we have included a few relevant works that are outside of the HCI field, strictly defined. My discussion of the literature will not be formally segmented into different sections. Instead, we will Compiled by Omorogbe Harry 53 HCI examine various relevant threads in the course of proposing a method for designing web sites that is based upon my interpretation of these literatures. Although we have spend a considerable amount of time identifying some ways that web design differs from other types of software development, the general processes involved in this activity can be similar to those employed by authors of traditional software . Levi and Conrad argue that building web sites "can and should be viewed as a major software development effort.... The life cycle of web creation is identical to that of traditional software: requirements gathering, analysis, design, implementation, testing, and deployment". Although they do not identify it specifically by name, it seems apparent that the general type of methodology that they see being suited for web design is the User Centered Design (UCD) approach. we would concur that a design effort for the web would be well suited by employing a UCD perspective, but would argue that it should be specifically tailored to take into account specific types of tasks required for authoring a hypermedia application. While t he general UCD approach is fairly generic, therefore lending itself to a wide range of projects and design sequences, we believe that it is also flexible enough to be applied different types of design efforts. Before suggesting such specifications for a UCD approach to be used in the context of web development, however, we will identify the basic aspects of the user-centered design process that we feel make it particularly valuable for web site creators. Then we will examine in greater detail the specific stages of the web design approach that we believe is most valuable, drawing on the different areas of the HCI literature that were identified above for support. The main strength of the UCD approach, in my opinion, is that it represents a set of general principles that underline the process of design rather than any specific sequence of tasks to be carried out. These general principles include an early and continuous focus upon users and their requirements, an iterative approach that intersperses design efforts and user testing throughout various stages of the development cycle, and an that emphasis upon operational criteria for usability assessments. While such a philosophical underpinning can lend itself to different types of design-phase sequences, the UCD approach is often used with the fairly standard software design process of requirements analysis, design, implementation, testing, and maintenance. In general, this type of process can be useful to employ in the task of designing web sites. Some modifications should be made in a few areas, however, to recognize the specific challenges involved in creating a hypermedia information product, to emphasize the value of user testing throughout the design process, and to recognize that web design is often carried out in different contexts and by different types of individuals than is the case with traditional software products. The earliest stages of designing a web site should involve a modified type of requirements analysis suggested by the basic software design model. As the general principles of the UCD approach suggest, much of the emphasis here should be devoted to identifying the prospective audience for the site and specifying what their needs may be. Given the distributed nature of the WWW, and the fact that the audience for a particular site can conceivably be very broad, it is likely that this task can be carried out only at the level of generalities. But as Shneiderman points out, even when broad user communities are anticipated, there are usually underlying assumptions about who the primary audiences may be, and it is helpful to make these assumptions explicit. After identifying Compiled by Omorogbe Harry 54 HCI potential users, it is also helpful to assess what kinds of tasks they will likely want or need to perform when visiting a web site. How this is to be done is a matter of some controversy. In the development of many traditional software products, a formal task analysis is carried out, and some authors writing about web site design, such as Rice et al., seem to favor such an approach. Other works on software development, however, believe that task analysis can be carried out in a more informal manner, utilizing methods such as user observation or imagined scenarios. I believe that a formalized approach to task analysis is unlikely to be widely practical or appealing to the web design community. As Dillon and McKnight note, "...the fact that hypermedia-based interfaces are frequently being used in novel applications renders it very difficult to perform formal task analysis, specific ally in the context of usage, or elicit user requirements to any degree of precision". While Dillon and McKnight were not discussing the WWW specifically, the extremely distributed nature of the web's user population should only amplify their sentiments. Beyond the fact that the potentially broad nature of web site audiences makes it hard of impossible to conduct formal task analysis upon users, the level of specialized knowledge required for utilizing this method is likely to be absent in many real-world cases of web design. Thus, more informal methods to identify user requirements may be a more realistic alternative. While the identification of potential users and their tasks should be an important element of the early stage of web site development, care must be taken to also consider the goals and requirements of the site's stakeholders as well Taken in tandem with the information gained through an analysis of users and their tasks, the articulation of the site owner's purposes should help designers identify the basic information content to be included in the site and the types of features that w ill need to be incorporated into the design. Such preparatory work is important to provide a firm foundation for the subsequent design phases in the site's development. The actual stage of design for a web site should be carried out in line with the general principles of the UCD approach. In other words, the process should be an iterative one that involves developing and testing prototypes at various stages, and the results of these tests should be fed back into the design efforts. But the generalized model of software development identified above, which portrays design as a sort of undifferentiated stage, is not very helpful here, as it provides little guidance about what types of tasks need to be carried out to effectively design a web site's architecture and interface. It is in this respect that the Object-Oriented Hypermedia Design Method (OOHDM) proposed by Schwabe et al. seems to be particularly useful to consider. Schwabe and his collaborators contend that designing a web site is tantamount to designing a hypermedia application, and believe that their OOHDM model is directly applicable to this process. Their model is partially compatible with a UCD approach, in that the different stages of the design process are "performed in a mix of incremental, iterative, and prototype-based development styles". The central core of their methodology, however, is based upon formal modeling, and the y eschew the type of user testing that we believe is important to include in web site development. Nonetheless, the fact that this model is based explicitly upon the specific requirements of hypermedia development, and the general structure of the design process that they set out makes this method important to consider. For my purposes, the most valuable and interesting aspect of OOHDM is that it breaks the design process into separate "activities," each of which Compiled by Omorogbe Harry 55 HCI focuses on a different aspect of an application's architecture or interface: concept design, navigational design, and abstract interface design (which are followed by implementation). This general structure, and the specific types of concerns and "products" that they identify as the foci of their different "activities," are very useful to a web design process, and can, I believe, be incorporated within a generalized usercentered approach. But the specific modeling techniques which they employ are probably less practical in the context of the web design community, which seems to be largely comprised of individuals who are not HCI experts. Therefore, we would propose to keep the "outer shell" of the OOHDM model and incorporate it within a user-centered design approach, while jettisoning the methodological core of formal modeling. It should be recognized, therefore, that the discussion of the various phases of the design process that follows represent my own adaptation of the basic OOHDM structure within a usercentered approach. The first activity suggested by the OOHDM model is conceptual design. In this stage of design, the basic topography of the web site will begin to be specified. The earlier work carried out in the requirements analysis stage should have identified the basic information content of the web site. The primary task at this stage is to organize this content into meaningful and understandable categories. General issues that need to be addresses in this phase are what types of information should be grouped together and how to organize these groupings within some coherent categorization scheme. More specific issues may involve decisions on page length (whether to divide related content into fewer but longer pages, or shorter pages) and labels to be applied to the categories that have been identified. The product of these efforts will be the identification and specification of the information nodes that will constitute the core of the web site. Even in this early stage of design, it is a good idea to conduct user tests, for as Miller notes, "...the earlier one starts [testing], the larger the payoffs in time savings and user satisfaction". It is quite possible that the designers may have grouped information and created categories in ways that do not make sense to potential users, and their assumptions should therefore be tested. The terminology adopted by designers also needs to be examined, because as researchers like Gray have found, users often understand categories and words to have meanings other than the ones the author intended them to have. One method that can be used as a test of conceptual clarity and terminology is card sorting. According to Nielsen, "card sorting is a common usability technique that is often used to discover user's mental models of an information space",. In designing the internal web site for Sun Microsystems, Nielsen has used this method with a small number of individuals to examine how they think information should be grouped together and what labels they feel should be applied to the groupings. If users have rather different ideas from the designers about how information should be organized within the site (or how items should be labeled), the designers should reconsider their initial categorizations and redesign as they feel necessary. While the concept design phase begins to provide the site with some organization, by virtue of preparing the information nodes that will be offered, the second stage of structural and navigational design shapes the way that these nodes will be related to each other and identifies the means by which the site's structure will be made apparent and accessible to visitors. There are two primary types of tasks that should be carried out in this stage. First, the designers need to establish the basic structure and relationships of Compiled by Omorogbe Harry 56 HCI the information categories identified in the concept design phase, and determine how the various nodes will be connected. Decisions have to be made about what type of organizational structure will be imposed upon the site, whether it be linear, hierarchical, or some other form. Such decisions may be influenced by the predetermined purposes of the site and the expected types of tasks that prospective users will perform, as different kinds of structures lend themselves better to different tasks. Identifying the basic structure of the site will also allow designers to plan the relationships of categories at both the global level (relations between different levels of categories) and the local level (relations between nodes within similar levels), and connect them accordingly. After the primary structural framework of the site has been specified, the designers then need to decide how the topography of the information space will be made apparent and accessible to visitors. From analysis, once site creators have developed a model of the information in a site, they should begin to prepare navigation tools that will clarify it's organization. This is a critical task, because as research has shown, users of hypertext systems can often suffer from the problems of disorientation and large cognitive overhead. Since users may have trouble understanding the structure of the hyper-space they are in, and since electronic text often suffers from a problem of homogeneity [61], designers need to take care to make the organization of their site explicit to visitors, and to provide mechanisms that will allow users to understand their present location and successfully navigate throughout the site. These issues can be addressed by determining what types of access structures and navigation tools will be provided to visitors. As was mentioned earlier, all web browsers provide at least minimal navigation support (backtracking and jumping), and some of the more popular versions also provide more advanced options as well (history lists, bookmarking). While these mechanisms can be of use to a visitor, the site designer can not count on any particular range of features (except for the most basic ones offered in all browsers) being available or understandable to the individuals who are viewing their site. Designers must focus on what they can control, and therefore must develop a suite o f access structures and navigation aids that are clear and accessible to all visitors, independent of the particular software they are using to access the site. Consulting the HCI literature, particularly in the areas of hypertext development, can offer some guidance to site creators on what types of mechanisms can be adopted. Thuring et al., for example, argue that designers can help increase the coherence of a site for the user and convey the structure of a hyperspace by providing a graphical overview. Such a mechanism is widely cited in the literature as being of value. But because not all users will be able (or choose) to view graphics, other types of access mechanisms should also be provided. Suggestion that designers employ an array of navigation devices, including detailed, text-based tables of contents and topical indexes should be use. Overviews, tables of contents, and indices can help a visitor develop a sense of the site's organization and structure, and provide means for them to navigate to desired locations. Designers should also consider how to develop more localized navigation tools to be used on individual pages as well. Providing users with well designed navigation bars on pages can help them maintain a sense of location and context, while also providing them with an important means to move freely throughout the information space. In order to be useful to visitors, however, such tools need to be created so that they are predictable and consistent in the ways that they can be used and the results that they produce. Compiled by Omorogbe Harry 57 HCI As was the case with the first stage of the design process, the structural and navigational design phase should be accompanied by user testing. Basic questions that should be addressed in the testing are whether potential users understand the overall structure of the site, whether they can find information in the site, and whether they can effectively navigate between different sections of the site. The tests might be conducted in a free, exploratory fashion, in which users are allowed to determine their own course of action, and designers look for areas of user confusion, slow-down, or mistakes. Or the users can be given specific scenarios and tasks to accomplish, with designers gauging how well they performed. In either case, designers will probably want to ask that users to "think aloud" while they work so that their thoughts are made explicit. Because the site's interface has not yet been developed, the tests will likely have to be conducted through the use of paper prototypes. The use of such "lo-fi" prototypes is widely accepted as being a valid technique for usability testing, and as Nielsen points out, "for some projects, user reactions to prototypes with few or even no working features can give you significant insight into the usability of your design". When using paper prototypes, however, testers must take care to "...explain the limitations and missing features to users. Once this is clear, you can learn a lot from user interaction with what is there -and learn what their expectations are for what's not". If users experience significant problems with the design that is presented to them in these prototypes, the creators of the site need to make necessary adjustments and test their revisions accordingly. The final stage of the design process is interface design. While the earlier phases of the site's development have specified it's content, organization, and structure, the site still does not have a "face" to present to a visitor. Developing the "look and feel" of the site takes place in this stage. There are actually a number of different types of tasks that have to be performed here: interface elements (including things like icons, buttons, graphics, etc.) have to be created and selected, basic features of the site (forms, search engines, applets, etc.) have to be incorporated, and all of these things -- along with the basic information content -- need to be combined in detailed page lay outs. It is likely that the stage will be carried out in an iterative fashion, in which successively more detailed and specified interfaces are developed, instead of trying to produce a "final" interface all at once. As Nielsen notes, "current practice in usability engineering is to refine user interfaces iteratively since one cannot design them exact right the first time around". In this chapter, we have examined the activity of designing World Wide Web sites and how this relates to the field of Human Computer Interaction. Although I discussed at some length the ways in which the medium of the web presents unique challenges to designers -- challenges not yet adequately addressed in the HCI literature -- we have also attempted to demonstrate that the process of developing web sites can be grounded within the existing body of work in this field. In doing so, we have proposed a method for creating web sites that builds upon the several strands from within the HCI literature. (A short summary of this design method is included below) Whether or not this particular model is useful to the people who are actually designing web sites, it is important that these individuals become more aware of what the field of HCI has to offer them. For the vast potential of this exciting new medium is being threatened by the proliferation of confusing and unusable sites. Simon Buckingham Shum feels that "...the Web, as the fastest growing interactive system in the world, offers a golden opportunity for HCI to make a difference". And as the web becomes increasingly important as a means of communication, information sharing, and commerce, we believe that HCI will Compiled by Omorogbe Harry 58 HCI begin to have a larger impact upon the web design community. The stakes will be too high for this field to be ignored. Compiled by Omorogbe Harry 59 HCI CHAPTER SIX CURRENT RESEARCH (UP-AND-COMING AREA) Gesture Recognition A primary goal of gesture recognition research is to create a system which can identify specific human gestures and use them to convey information or for device control. Also, the primary goal of virtual environments (VE) is to provide natural, efficient, powerful, and flexible interaction. Gesture as an input modality can help meet these requirements. Human gestures are certainly natural and flexible, and may often be efficient and powerful, especially as compared with alternative interaction modes. This section will cover automatic gesture recognition, particularly computer vision based techniques that do not require the user to wear extra sensors, clothing or equipment. The traditional two-dimensional (2D), keyboard- and mouse- oriented graphical user interface (GUI) is not well suited for virtual environments. Synthetic environments provide the opportunity to utilize several different sensing modalities and technologies and to integrate them into the user experience. Devices which sense body position and orientation, direction of gaze, speech and sound, facial expression, galvanic skin response, and other aspects of human behavior or state can be used to mediate communication between the human and the environment. Combinations of communication modalities and sensing devices can produce a wide range of unimodal and multimodal interface techniques. The potential for these techniques to support natural and powerful interfaces for communication in VEs appears promising. If interaction technologies are overly obtrusive, awkward, or constraining, the user’s experience with the synthetic environment is severely degraded. If the interaction itself draws attention to the technology, rather than the task at hand, or imposes a high cognitive load on the user, it becomes a burden and an obstacle to a successful VE experience. Therefore, there is focused interest in technologies that are unobtrusive and passive. To support gesture recognition, human position and movement must be tracked and interpreted in order to recognize semantically meaningful gestures. While tracking of a user’s head position or hand configuration may be quite useful for directly controlling objects or inputting parameters, people naturally express communicative acts through higher-level constructs. The output of position (and other) sensing must be interpreted to allow users to communicate more naturally and effortlessly through gesture. Gesture is used for control and navigation in CAVEs (Cave Automatic Virtual Environments) and in other VEs, such as smart rooms, virtual work environments, and performance spaces. In addition, gesture may be perceived by the environment in order to be transmitted elsewhere (e.g., as a compression technique, to be reconstructed at the receiver). Gesture recognition may also influence – intentionally or unintentionally – a system’s model of the user’s state. For example, a look of frustration may cause a system Compiled by Omorogbe Harry 60 HCI to slow down its presentation of information, or the urgency of a gesture may cause the system to speed up. Gesture may also be used as a communication backchannel (i.e., visual or verbal behaviors such as nodding or saying “uh-huh” to indicate “I’m with you, continue”, or raising a finger to indicate the desire to interrupt) to indicate agreement, participation, attention, conversation turn taking, etc. Given that the human body can express a huge variety of gestures, what is appropriate to sense? Clearly the position and orientation of each body part – the parameters of an articulated body model – would be useful, as well as features that are derived from those measurements, such as velocity and acceleration. Facial expressions are very expressive. More subtle cues such as hand tension, overall muscle tension, locations of self-contact, and even pupil dilation may be of use. To help understand what gestures are, an examination of how other researchers view gestures is useful. How do biologists and sociologists define "gesture"? How is information encoded in gestures? We also explore how humans use gestures to communicate with and command other people. Furthermore, engineering researchers have designed a variety of "gesture" recognition systems - how do they define and use gestures? Biological and Sociological Definition and Classification of Gestures From a biological and sociological perspective, gestures are loosely defined, thus, researchers are free to visualize and classify gestures as they see fit. Speech and handwriting recognition research provide methods for designing recognition systems and useful measures for classifying such systems. Gesture recognition systems which are used to control memory and display, devices in a local environment, and devices in a remote environment are examined for the same reason. People frequently use gestures to communicate. Gestures are used for everything from pointing at a person to get their attention to conveying information about space and temporal characteristics. Evidence indicates that gesturing does not simply embellish spoken language, but is part of the language generation process. Biologists define "gesture" broadly, stating, "the notion of gesture is to embrace all kinds of instances where an individual engages in movements whose communicative intent is paramount, manifest, and openly acknowledged". Gestures associated with speech are referred to as gesticulation. Gestures which function independently of speech are referred to as autonomous. Autonomous gestures can be organized into their own communicative language, such as American Sign Language (ASL). Autonomous gestures can also represent motion commands. In the following subsections, some various ways in which biologists and sociologists define gestures are examined to discover if there are gestures ideal for use in communication and device control. Gesture Dichotomies One classification method categorizes gestures using four dichotomies: act-symbol, opacity-transparency, autonomous semiotic (semiotic refers to a general philosophical Compiled by Omorogbe Harry 61 HCI theory of signs and system that deals with their function in both artificially constructed and natural languages) -multisemiotic, and centrifugal-centripetal (intentional). The act-symbol dichotomy refers to the notion that some gestures are pure actions, while others are intended as symbols. For instance, an action gesture occurs when a person chops wood or counts money, while a symbolic gesture occurs when a person makes the "okay" sign or puts their thumb out to hitchhike. Naturally, some action gestures can also be interpreted as symbols (semiogenesis), as illustrated in a spy novel, when an agent carrying an object in one hand has important meaning. This dichotomy shows that researchers can use gestures which represent actual motions for use in controlling devices. The opacity-transparency dichotomy refers to the ease with which others can interpret gestures. Transparency is often associated with universality, a belief which states that some gestures have standard cross-cultural meanings. In reality, gesture meanings are very culturally dependent. Within a society, gestures have standard meanings, but no known body motion or gesture has the same meaning in all societies. Even in ASL, few signs are so clearly transparent that a non-signer can guess their meaning without additional clues. Fortunately, this means that gestures used for device control can be freely chosen. Additionally, gestures can be culturally defined to have specific meaning. The centrifugal-centripetal dichotomy refers to the intentionality of a gesture. Centrifugal gestures are directed toward a specific object, while centripetal gestures are not. Researchers usually are concerned with gestures which are directed toward the control of a specific object or the communication with a specific person or group of people. Gestures which are elements of an autonomous semiotic system are those used in a gesture language, such as ASL. On the other hand, gestures which are created as partial elements of multisemiotic activity are gestures which accompany other languages, such as oral ones. Gesture recognition researchers are usually concerned with gestures which are created as their own independent, semiotic language, though there are some exceptions. The Nature of Gesture Gestures are expressive, meaningful body motions – i.e., physical movements of the fingers, hands, arms, head, face, or body with the intent to convey information or interact with the environment. Cadoz (1994) described three functional roles of human gesture: Semiotic – to communicate meaningful information. Ergotic – to manipulate the environment. Epistemic – to discover the environment through tactile experience. Gesture recognition is the process by which gestures made by the user are made known to the system. One could argue that in GUI-based systems, standard mouse and keyboard actions used for selecting items and issuing commands are gestures; here the interest is in less trivial cases. While static position (also referred to as posture, configuration, or pose) is not technically considered gesture, it is included for the purposes of this section. Compiled by Omorogbe Harry 62 HCI In VEs users need to communicate in a variety of ways, to the system itself and also to other users or remote environments. Communication tasks include specifying commands and/or parameters for: navigating through a space; specifying items of interest; manipulating objects in the environment; changing object values; controlling virtual objects; and issuing task-specific commands. In addition to user-initiated communication, a VE system may benefit from observing a user’s behavior for purposes such as: analysis of usability; analysis of user tasks; monitoring of changes in a user’s state; better understanding a user’s intent or emphasis; and communicating user behavior to other users or environments. Messages can be expressed through gesture in many ways. For example, an emotion such as sadness can be communicated through facial expression, a lowered head position, relaxed muscles, and lethargic movement. Similarly, a gesture to indicate “Stop!” can be simply a raised hand with the palm facing forward, or an exaggerated waving of both hands above the head. In general, there exists a many-to-one mapping from concept to gesture (i.e., gestures are ambiguous); there is also a many-to-one mapping from gesture to concept (i.e., gestures are not completely specified). And, like speech and handwriting, gestures vary among individuals, they vary from instance to instance for a given individual, and they are subject to the effects of co-articulation. An interesting real-world example of the use of gestures in visual communications is a U.S. Army field manual (Anonymous, 1987) that serves as a reference and guide to commonly used visual signals, including hand and arm gestures for a variety of situations. The manual describes visual signals used to transmit standardized messages rapidly over short distances. Despite the richness and complexity of gestural communication, researchers have made progress in beginning to understand and describe the nature of gesture. Kendon (1972) described a “gesture continuum,” depicted in Figure 2, defining five different kinds of gestures: Gesticulation. Spontaneous movements of the hands and arms that accompany speech. Language-like gestures. Gesticulation that is integrated into a spoken utterance, replacing a particular spoken word or phrase. Pantomimes. Gestures that depict objects or actions, with or without accompanying speech. Emblems. Familiar gestures such as “V for victory”, “thumbs up”, and assorted rude gestures (these are often culturally specific). Compiled by Omorogbe Harry 63 HCI Sign languages. Linguistic systems, such as American Sign Language, which are well defined. As the list progresses, the association with speech declines, language properties increase, spontaneity decreases, and social regulation increases. Within the first category – spontaneous, speech-associated gesture – McNeill (1992) defined four gesture types: Iconic. Representational gestures depicting some feature of the object, action or event being described. Metaphoric. Gestures that represent a common metaphor, rather than the object or event directly. Beat. Small, formless gestures, often associated with word emphasis. Deictic. Pointing gestures that refer to people, objects, or events in space or time. These types of gesture modify the content of accompanying speech and may often help to disambiguate speech – similar to the role of spoken intonation. Cassell et al. (1994) describe a system that models the relationship between speech and gesture and generates interactive dialogs between three-dimensional (3D) animated characters that gesture as they speak. These spontaneous gestures (gesticulation in Kendon’s Continuum) make up some 90% of human gestures. People even gesture when they are on the telephone, and blind people regularly gesture when speaking to one another. Across cultures, speechassociated gesture is natural and common. For human-computer interaction (HCI) to be truly natural, technology to understand both speech and gesture together must be developed. Despite the importance of this type of gesture in normal human-to-human interaction, most research to date in HCI, and most VE technology, focuses on the right side, where gestures tend to be less ambiguous, less spontaneous and natural, more learned, and more culture-specific..Emblematic gestures and gestural languages, although perhaps less spontaneous and natural, carry more clear semantic meaning and may be more appropriate for the kinds of command-and-control interaction that VEs tend to support. The main exception to this is work in recognizing and integrating deictic (mainly pointing) gestures, beginning with the well-known Put That There system by Bolt (1980). Some part of this section will focus on symbolic gestures (which include emblematic gestures and predefined gesture languages) and deictic gestures. Representations of Gesture The concept of gesture is loosely defined, and depends on the context of the interaction. Recognition of natural, continuous gestures requires temporally segmenting gestures. Automatically segmenting gestures is difficult, and is often finessed or ignored in current systems by requiring a starting position in time and/or space. Similar to this is the problem of distinguishing intentional gestures from other “random” movements. There is no standard way to do gesture recognition – a variety of representations and classification schemes are used. However, most gesture recognition systems share some common structure. Compiled by Omorogbe Harry 64 HCI Gestures can be static, where the user assumes a certain pose or configuration, or dynamic, defined by movement. McNeill (1992) defines three phases of a dynamic gesture: pre-stroke, stroke, and post-stroke. Some gestures have both static and dynamic elements, where the pose is important in one or more of the gesture phases; this is particularly relevant in sign languages. When gestures are produced continuously, each gesture is affected by the gesture that preceded it, and possibly by the gesture that follows it. These co-articulations may be taken into account as a system is trained. There are several aspects of a gesture that may be relevant and therefore may need to be represented explicitly. Hummels and Stappers (1998) describe four aspects of a gesture which may be important to its meaning: Spatial information – where it occurs, locations a gesture refers to. Pathic information – the path that a gesture takes. Symbolic information – the sign that a gesture makes. Affective information – the emotional quality of a gesture. In order to infer these aspects of gesture, human position, configuration, and movement must be sensed. This can be done directly with sensing devices such as magnetic field trackers, instrumented gloves, and datasuits, which are attached to the user, or indirectly using cameras and computer vision techniques. Each sensing technology differs along several dimensions, including accuracy, resolution, latency, range of motion, user comfort, and cost. The integration of multiple sensors in gesture recognition is a complex task, since each sensing technology varies along these dimensions. Although the output from these sensors can be used to directly control parameters such as navigation speed and direction or movement of a virtual object, here the interest is primarily in the interpretation of sensor data to recognize gestural information. The output of initial sensor processing is a time-varying sequence of parameters describing positions, velocities, and angles of relevant body parts and features. These should (but often do not) include a representation of uncertainty that indicates limitations of the sensor and processing algorithms. Recognizing gestures from these parameters is a pattern recognition task that typically involves transforming input into the appropriate representation (feature space) and then classifying it from a database of predefined gesture representations. The parameters produced by the sensors may be transformed into a global coordinate space, processed to produce sensor-independent features, or used directly in the classification step. Because gestures are highly variable, from one person to another and from one example to another within a single person, it is essential to capture the essence of a gesture – its invariant properties – and use this to represent the gesture. Besides the choice of representation itself, a significant issue in building gesture recognition systems is how to create and update the database of known gestures. Hand-coding gestures to be recognized only works for trivial systems; in general, a system needs to be trained through some kind of learning. As with speech recognition systems, there is often a tradeoff between accuracy and generality – the more accuracy desired, the more userspecific training is required. In addition, systems may be fully trained when in use, or they may adapt over time to the current user. Compiled by Omorogbe Harry 65 HCI Static gesture, or pose, recognition can be accomplished by a straightforward implementation, using template matching, geometric feature classification, neural networks, or other standard pattern recognition techniques to classify pose. Dynamic gesture recognition, however, requires consideration of temporal events. This is typically accomplished through the use of techniques such as time-compressing templates, dynamic time warping, hidden Markov models (HMMs), and Bayesian networks. Some examples will be presented in the following sections. Gesture Typologies Another standard gesture classification scheme uses three categories: arbitrary, mimetic, and deictic. In mimetic gestures, motions form an object's main shape or representative feature. For instance, a chin sweeping gesture can be used to represent a goat by alluding to its beard. These gestures are intended to be transparent. Mimetic gestures are useful in gesture language representations. Deictic gestures are used to point at important objects, and each gesture is transparent within its given context. These gestures can be specific, general, or functional. Specific gestures refer to one object. General gestures refer to a class of objects. Functional gestures represent intentions, such as pointing to a chair to ask for permission to sit. Deictic gestures are also useful in gesture language representations. Arbitrary gestures are those whose interpretation must be learned due to their opacity. Although they are not common in a cultural setting, once learned they can be used and understood without any complimentary verbal information. An example is the set of gestures used for crane operation. Arbitrary gestures are useful because they can be specifically created for use in device control. These gesture types are already arbitrarily defined and understood without any additional verbal information. Voice and Handwriting Recognition: Parallel Issues for Gesture Recognition Speech and handwriting recognition systems are similar to gesture recognition systems, because all of these systems perform recognition of something that moves, leaving a "trajectory" in space and time. By exploring the literature of speech and handwriting recognition, classification and identification schemes can be studied which might aid in developing a gesture recognition system. Typical speech recognition systems match transformed speech against a stored representation. Most systems use some form of spectral representation, such as spectral templates or hidden Markov models (HMM). Speech recognition systems are classified along the following dimensions: Speaker dependent versus Independent: Can the system recognize the speech of many different individuals without training or does it have to be trained for a specific voice? Currently, speaker dependent systems are more accurate, because they do not need to account for large variations in words. Compiled by Omorogbe Harry 66 HCI Discrete or Continuous: Does the speaker need to separate individual words by short silences or can the system recognize continuous sentences? Isolated-word recognition systems have a high accuracy rate, in part because the systems know when each word has ended. Vocabulary size: This is usually a task dependent vocabulary. All other things being equal, a small vocabulary is easier to recognize than a large one. Recognition Rate: Commercial products strive for at least a 95% recognition rate. Although this rate seems very high, these results occur in laboratory environments. Also, studies have shown that humans have an individual word recognition rate of 99.2%. State of the art speech recognition systems, which have the capability to understand a large vocabulary, use HMMs. HMMs are also used by a number of gesture recognition systems (you can also research on Control of Memory and Display). In some speech recognition systems, the states of an HMM represent phonetic units. A state transition defines the probability of the next state's occurrence. The term hidden refers to the type of Markov model in which the observations are a probabilistic function of the current state. A complete specification of a hidden Markov model requires the following information: the state transition probability distribution, the observation symbol probability distribution, and the initial state distribution. An HMM is created for each word (string of phonemes) in a given lexicon. One of the tasks in isolated speech recognition is to measure an observed sequence of phonetic units and determine which HMM was most likely to generate such a sequence. From some points of view, handwriting can be considered a type of gesture. On-line (also called "real time" or "dynamic") recognition machines identify handwriting as a user writes. On-line devices have the advantage of capturing the dynamic information of writing, including the number of strokes, the ordering of strokes, and the direction and velocity profile of each stroke. On-line recognition systems are also interactive, allowing users to correct recognition errors, adapt to the system, or see the immediate results of an editing command. Most on-line tablets capture writing as a sequence of coordinate points. Recognition is complicated in part, because there are many different ways of generating the same character. For example, the letter E's four lines can be drawn in any order. Handwriting tablets must take into account character blending and merging, which is similar to the continuous speech problem. Also, different characters can look quite similar. To tackle these problems, handwriting tablets pre-process the characters, and then perform some type of shape recognition. Preprocessing typically involves properly spacing the characters and filtering out noise from the tablet. The more complicated processing occurs during character recognition. Features based on both static and dynamic character information can be used for recognition. Some systems using binary decision trees prune possible characters by examining simple features first, such as searching for the dots above the letters "i" and "j". Other systems create zones which define the directions a pen point can travel Compiled by Omorogbe Harry 67 HCI (usually eight), and a character is defined in terms of a connected set of zones. A lookup table or a dictionary is used to classify the characters. Another scheme draws its classification method from signal processing, in which curves from unknown forms are matched against prototype characters. They are matched as functions of time or as Fourier coefficients. To reduce errors, an elastic matching scheme (stretching and bending drawn curves) is used. These methods tend to be computationally intensive. Alternatively, pen strokes can be divided into basic components, which are then connected by rules and matched to characters. This method is called Analysis-bySynthesis. Similar systems use dynamic programming methods to match real and modeled strokes. This examination of handwriting tablets reveals that the dynamic features of characters make on-line recognition possible and, as in speech, it is easier to recognize isolated characters. Most systems lag in recognition by more than a second, and the recognition rates are not very high. They reach reported rates of 95% due only to very careful writing. They are best used for filling out forms which have predefined prototypes and set areas for characters. For a more detailed overview of handwriting tablets, consult. Pen-based gesture recognition Recognizing gestures from 2D input devices such as a pen or mouse has been considered for some time. The early Sketchpad system in 1963 used light-pen gestures, for example. Some commercial systems have used pen gestures since the 1970s. There are examples of gesture recognition for document editing, for air traffic control, and for design tasks such as editing splines. More recently, systems such as the OGI QuickSet system have demonstrated the utility of pen-based gesture recognition in concert with speech recognition to control a virtual environment. QuickSet recognizes 68 pen gestures, including map symbols, editing gestures, route indicators, area indicators, and taps. Oviatt (1996) has demonstrated significant benefits of using speech and pen gestures together in certain tasks. Zeleznik (1996) and Landay and Myers (1995) developed interfaces that recognize gestures from pen-based sketching. A significant benefit of pen-based gestural systems is that sensing and interpretation is relatively straightforward as compared with vision-based techniques. There have been commercially available Personal Digital Assistants (PDAs) for several years, starting with the Apple Newton, and more recently the 3Com PalmPilot and various Windows CE devices. These PDAs perform handwriting recognition and allow users to invoke operations by various, albeit quite limited, pen gestures. Long, Landay, and Rowe (1998) survey problems and benefits of these gestural interfaces and provide insight for interface designers. Although pen-based gesture recognition is promising for many HCI environments, it presumes the availability of, and proximity to, a flat surface or screen. In VEs, this is often too constraining – techniques that allow the user to move around and interact in more natural ways are more compelling. The next two sections cover two primary Compiled by Omorogbe Harry 68 HCI technologies for gesture recognition in virtual environments: instrumented gloves and vision-based interfaces. Tracker-based gesture recognition There are a number of commercially available tracking systems, which can be used as input to gesture recognition, primarily for tracking eye gaze, hand configuration, and overall body position. Each sensor type has its strengths and weaknesses in the context of VE interaction. While eye gaze can be useful in a gestural interface, the focus here is on gestures based on input from tracking the hands and body. Instrumented gloves People naturally use their hands for a wide variety of manipulation and communication tasks. Besides being quite convenient, hands are extremely dexterous and expressive, with approximately 29 degrees of freedom (including the wrist). In his comprehensive thesis on whole hand input, Sturman (1992) showed that the hand can be used as a sophisticated input and control device in a wide variety of application domains, providing real-time control of complex tasks with many degrees of freedom. He analyzed task characteristics and requirements, hand action capabilities, and device capabilities, and discussed important issues in developing whole-hand input techniques. Sturman suggested a taxonomy of whole-hand input that categorizes input techniques along two dimensions: Classes of hand actions: continuous or discrete. Interpretation of hand actions: direct, mapped,orsymbolic. The resulting six categories describe the styles of whole-hand input. A given interaction task, can be evaluated as to which style best suits the task. Mulder (1996) presented an overview of hand gestures in human-computer interaction, discussing the classification of hand movement, standard hand gestures, and hand gesture interface design. For several years, commercial devices have been available which measure, to various degrees of precision, accuracy, and completeness, the position and configuration of the hand. These include “data gloves” and exoskeleton devices mounted on the hand and fingers (the term “instrumented glove” is used to include both types). Some advantages of instrumented gloves include: direct measurement of hand and finger parameters (joint angles, 3D spatial information, wrist rotation); provides data at a high sampling frequency;.easy to use; no line-of-sight occlusion problems; relatively low cost versions available; and data is translation-independent (within the range of motion). Disadvantages of instrumented gloves include: calibration can be difficult; tethered gloves reduce range of motion and comfort; data from inexpensive systems can be very noisy; Compiled by Omorogbe Harry 69 HCI accurate systems are expensive; and the user is forced to wear a somewhat cumbersome device. Many projects have used hand input from instrumented gloves for “point, reach, and grab” operations or more sophisticated gestural interfaces. Latoschik and Wachsmuth (1997) present a multi-agent architecture for detecting pointing gestures in a multimedia application. Väänänen and Böhm (1992) developed a neural network system that recognized static gestures and allows the user to interactively teach new gestures to the system. Böhm et al. (1994) extend that work to dynamic gestures using a Kohohen Feature Map (KFM) for data reduction. Baudel and Beaudouin-Lafon (1993) developed a system to provide gestural input to a computer while giving a presentation – this work included a gesture notation and set of guidelines for designing gestural command sets. Fels and Hinton (1995) used an adaptive neural network interface to translate hand gestures to speech. Kadous (1996) used glove input to recognize Australian sign language; Takahashi and Kishino (1991) for the Japanese Kana manual alphabet. The system of Lee and Xu (1996) could learn and recognize new gestures online. Despite the fact that many, if not most, gestures involve two hands, most of the research efforts in glove-based gesture recognition use only one glove for input. The features that are used for recognition, and the degree to which dynamic gestures are considered vary quite a bit. The HIT Lab at the University of Washington developed GloveGRASP, a C/C++ class library that allows software developers to add gesture recognition capabilities to SGI systems, including user-dependent training and one- or two-handed gesture recognition. A commercial version of this system is available from General Reality. Body suits It is well known that by viewing only a small number of strategically placed dots on the human body; people can easily perceive complex movement patterns such as the activities, gestures, identities, and other aspects of bodies in motion. One way to approach the recognition of human movements and postures is to optically measure the 3D positions of several such markers attached to the body and then recover the timevarying articulated structure of the body. The articulated structure may also be measured more directly by sensing joint angles and positions using electromechanical body sensors. Although some of the optical systems only require dots or small balls to be placed on top of a user’s clothing, all of these body motion capture systems are referred to herein generically as “body suits.” Body suits have advantages and disadvantages that are similar to those of instrumented gloves: they can provide reliable data at a high sampling rate (at least for electromagnetic devices), but they are expensive and very cumbersome. Calibration is typically nontrivial. The optical systems typically use several cameras and process their data offline – their major advantage is the lack of wires and a tether. Compiled by Omorogbe Harry 70 HCI Body suits have been used, often along with instrumented gloves, in several gesture recognition systems. Wexelblat (1994) implemented a continuous gesture analysis system using a data suit, “data gloves,” and an eye tracker. In this system, data from the sensors is segmented in time (between movement and inaction), key features are extracted, motion is analyzed, and a set of special-purpose gesture recognizers look for significant changes. Marrin and Picard (1998) have developed an instrumented jacket for an orchestral conductor that includes physiological monitoring to study the correlation between affect, gesture, and musical expression. Although current optical and electromechanical tracking technologies are cumbersome and therefore contrary to the desire for more natural interfaces, it is likely that advances in sensor technology will enable a new generation of devices (including stationary field sensing devices, gloves, watches, and rings) that are just as useful as current trackers but much less obtrusive. Similarly, instrumented body suits, which are currently exceedingly cumbersome, may be displaced by sensing technologies embedded in belts, shoes, eyeglasses, and even shirts and pants. While sensing technology has a long way to go to reach these ideals, passive sensing using computer vision techniques is beginning to make headway as a user-friendly interface technology. Note that although some of the body tracking methods in this section uses cameras and computer vision techniques to track joint or limb positions, they require the user to wear special markers. In the next section only passive techniques that do not require the user to wear any special markers or equipment are considered. Passive vision-based gesture recognition The most significant disadvantage of the tracker-based systems is that they are cumbersome. This detracts from the immersive nature of a VE by requiring the user to don an unnatural device that cannot easily be ignored, and which often requires significant effort to put on and calibrate. Even optical systems with markers applied to the body suffer from these shortcomings, albeit not as severely. What many have wished for is a technology that provides real-time data useful for analyzing and recognizing human motion that is passive and non-obtrusive. Computer vision techniques have the potential to meet these requirements. Vision-based interfaces use one or more cameras to capture images, at a frame rate of 30 Hz or more, and interpret those images to produce visual features that can be used to interpret human activity and recognize gestures. Typically the camera locations are fixed in the environment, although they may also be mounted on moving platforms or on other people. For the past decade, there has been a significant amount of research in the computer vision community on detecting and recognizing faces, analyzing facial expression, extracting lip and facial motion to aid speech recognition, interpreting human activity, and recognizing particular gestures. Unlike sensors worn on the body, vision approaches to body tracking have to contend with occlusions. From the point of view of a given camera, there are always parts of the user’s body that are occluded and therefore not visible – e.g., the backside of the user is not visible when the camera is in front. More significantly, self-occlusion often prevents Compiled by Omorogbe Harry 71 HCI a full view of the fingers, hands, arms, and body from a single view. Multiple cameras can be used, but this adds correspondence and integration problems. The occlusion problem makes full body tracking difficult, if not impossible, without a strong model of body kinematics and perhaps dynamics. However, recovering all the parameters of body motion may not be a prerequisite for gesture recognition. The fact that people can recognize gestures leads to three possible conclusions: (1) The parameters that cannot be directly observed are inferred. (2) These parameters are not needed to accomplish the task. (3) Some are inferred and others are ignored. It is a mistake to consider vision and tracking devices (such as instrumented gloves and body suits) as alternative paths to the same end. Although there is overlap in what they can provide, these technologies in general produce qualitatively and quantitatively different outputs which enable different analysis and interpretation. For example, tracking devices can in principle detect fast and subtle movements of the fingers while a user is waving his hands, while human vision in that case will at best get a general sense of the type of finger motion. Similarly, vision can use properties like texture and color in its analysis of gesture, while tracking devices do not. From a research perspective, these observations imply that it may not be an optimal strategy to merely substitute vision at a later date into a system that was developed to use an instrumented glove or a body suit – or vice versa. Unlike special devices that measure human position and motion, vision uses a multipurpose sensor; the same device used to recognize gestures can be used to recognize other objects in the environment and also to transmit video for teleconferencing, surveillance, and other purposes. There is a growing interest in CMOSbased cameras, which promise miniaturized, low cost, low power cameras integrated with processing circuitry on a single chip. With its integrated processing, such a sensor could conceivably output motion or gesture parameters to the virtual environment. Currently, most computer vision systems for recognition look something like Figure 3. Analog cameras feed their signal into a digitizer board, or framegrabber, which may do a DMA transfer directly to host memory. Digital cameras bypass the analog-to-digital conversion and go straight to memory. There may be a preprocessing step, where images are normalized, enhanced, or transformed in some manner, and then a feature extraction step. The features – which may be any of a variety of 2D or 3D features, statistical properties, or estimated body parameters – are analyzed and classified as a particular gesture if appropriate. Vision-based systems for gesture recognition vary along a number of dimensions, most notably: Number of cameras. How many cameras are used? If more than one, are they combined early (stereo) or late (multi-view)? Speed and latency. Is the system real-time (i.e., fast enough, with low enough latency, to support interaction)? Structured environment. Are there restrictions on the background, the lighting, the speed of movement, etc.? Compiled by Omorogbe Harry 72 HCI User requirements. Must the user wear anything special (e.g., markers, gloves, long sleeves)? Is anything disallowed (e.g., glasses, beard, rings)? Primary features. What low-level features are computed (edges, regions, silhouettes, moments, histograms, etc.)? Two- or three-dimensional representation. Does the system construct a 3D model of the body part(s), or is classification done on some other (view-based) representation? Representation of time. How is the temporal aspect of gesture represented and used in recognition (e.g., via a state machine, dynamic time warping, HMMs, time-compressed template)? Head and face gestures When people interact with one another, they use an assortment of cues from the head and face to convey information. These gestures may be intentional or unintentional, they may be the primary communication mode or backchannels, and they can span the range from extremely subtle to highly exaggerate. Some examples of head and face gestures include: nodding or shaking the head; direction of eye gaze; raising the eyebrows; opening the mouth to speak; winking; flaring the nostrils; and looks of surprise, happiness, disgust, anger, sadness, etc. People display a wide range of facial expressions. Ekman and Friesen (1978) developed a system called FACS for measuring facial movement and coding expression; this description forms the core representation for many facial expression analysis systems. A real-time system to recognize actions of the head and facial features was developed by Zelinsky and Heinzmann (1996), who used feature template tracking in a Kalman filter framework to recognize thirteen head/face gestures. Moses et al. (1995) used fast contour tracking to determine facial expression from a mouth contour. Essa and Pentland (1997) used optical flow information with a physical muscle model of the face to produce accurate estimates of facial motion. This system was also used to generate spatiotemporal motion-energy templates of the whole face for each different expression – these templates were then used for expression recognition. Oliver et al. (1997) describe a realtime system for tracking the face and mouth that recognized facial expressions and head movements. Otsuka and Ohya (1998) model coarticulation in facial expressions and use an HMM for recognition. Black and Yacoob (1995) used local parametric motion models to track and recognize both rigid and non-rigid facial motions. Demonstrations of this system show facial expressions being detected from television talk show guests and news anchors (in nonreal time). La Cascia et al. (1998) extended this approach using texture-mapped surface models and non-planar parameterized motion models to better capture the facial motion. Compiled by Omorogbe Harry 73 HCI Hand and Arm gestures Hand and arm gestures receive the most attention among those who study gesture – in fact, many (if not most) references to gesture recognition only consider hand and arm gestures. The vast majority of automatic recognition systems are for deictic gestures (pointing), emblematic gestures (isolated signs), and sign languages (with a limited vocabulary and syntax). Some are components of bimodal systems, integrated with speech recognition. Some produce precise hand and arm configuration while others only coarse motion. Stark and Kohler (1995) developed the ZYKLOP system for recognizing hand poses and gestures in real-time. After segmenting the hand from the background and extracting features such as shape moments and fingertip positions, the hand posture is classified. Temporal gesture recognition is then performed on the sequence of hand poses and their motion trajectory. A small number of hand poses comprises the gesture catalog, while a sequence of these makes a gesture. Similarly, Maggioni and Kämmerer (1998) described the GestureComputer, which recognized both hand gestures and head.movements. Other systems that recognize hand postures amidst complex visual backgrounds are reported by Weng and Cui (1998) and Triesch and von der Malsburg (1996). There has been a lot of interest in creating devices to automatically interpret various sign languages to aid the deaf community. One of the first to use computer vision without requiring the user to wear anything special was built by Starner (1995), who used HMMs to recognize a limited vocabulary of ASL sentences. A more recent effort, which uses HMMs to recognize Sign Language of the Netherlands is described by Assan and Grobel (1997). The recognition of hand and arm gestures has been applied to entertainment applications. Freeman et al. (1996) developed a real-time system to recognize hand poses using image moments and orientation histograms, and applied it to interactive video games. Cutler and Turk (1998) described a system for children to play virtual instruments and interact with lifelike characters by classifying measurements based on optical flow. A nice overview of work up to 1995 in hand gesture modeling, analysis, and synthesis is presented by Huang and Pavlovic (1995). Body gestures This section includes tracking full body motion, recognizing body gestures, and recognizing human activity. Activity may be defined over a much longer period of time than what is normally considered a gesture; for example, two people meeting in an open area, stopping to talk and then continuing on their way may be considered a recognizable activity. Bobick (1997) proposed taxonomy of motion understanding in terms of: Movement. The atomic elements of motion. Activity. A sequence of movements or static configurations. Action. High-level description of what is happening in context. Most research to date has focused on the first two levels. Compiled by Omorogbe Harry 74 HCI The Pfinder system (Wren et al., 1996) developed at the MIT Media Lab has been used by a number of groups to do body tracking and gesture recognition. It forms a 2D representation of the body, using statistical models of color and shape. The body model provides an effective interface for applications such as video games, interpretive dance, navigation, and interaction with virtual characters. Lucente et al. (1998) combined Pfinder with speech recognition in an interactive environment called.Visualization Space, allowing a user to manipulate virtual objects and navigate through virtual worlds. Paradiso and Sparacino (1997) used Pfinder to create an interactive performance space where a dancer can generate music and graphics through their body movements – for example, hand and body gestures can trigger rhythmic and melodic changes in the music. Systems that analyze human motion in VEs may be quite useful in medical rehabilitation (see Chapter 46, this Volume) and athletic and military training (see Chapter 43, this Volume). For example, a system like the one developed by Boyd and Little (1998) to recognize human gaits could potentially be used to evaluate rehabilitation progress. Yamamoto et al. (1998) describe a system that used computer vision to analyze body motion in order to evaluate the performance of skiers. Davis and Bobick (1997) used a view-based approach by representing and recognizing human action based on “temporal templates,” where a single image template captures the recent history of motion. This technique was used in the KidsRoom system, an interactive, immersive, narrative environment for children. Video surveillance and monitoring of human activity has received significant attention in recent years. For example, the W4 system developed at the University of Maryland (Haritaoglu et al., 1998) tracks people and detects patterns of activity. System Architecture Concepts for Gesture Recognition Systems Based on the use of gestures by humans, the analysis of speech and handwriting recognizing systems and the analysis of other gesture recognition systems requirements for a gesture recognition system can be detailed. Some requirements and tasks are: Choose gestures which fit a useful environment. Create a system which can recognize non-perfect human created gestures. Create a system which can use a both a gesture's static and dynamic Information components. Perform gesture recognition with image data presented at field rate (or as fast as possible). Recognize the gesture as quickly as possible, even before the full gesture is completed. Use a recognition method which requires a small amount of computational time and memory. Create an expandable system which can recognize additional types of gestures. Pair gestures with appropriate responses (language definitions or device command responses). Create an environment which allows the use of gestures for remote control of devices. Conclusions Compiled by Omorogbe Harry 75 HCI Although several research efforts have been referenced in this chapter, these are just a sampling; many more have been omitted for the sake of brevity. Good sources for much of the work in gesture recognition can be found in the proceedings of the Gesture Workshops and the International Conference on Automatic Face and Gesture Recognition. There is still much to be done before gestural interfaces, which track and recognize human activities, can become pervasive and cost-effective for the masses. However, much progress has been made in the past decade and with the continuing march towards computers and sensors that are faster, smaller, and more ubiquitous, there is cause for optimism. As PDAs and pen-based computing continue to proliferate, pen-based 2D gestures should become more common, and some of the technology will transfer to 3D hand, head, and body gestural interfaces. Similarly, technology developed in surveillance and security areas will also find uses in gesture recognition for virtual environments. There are many open questions in this area. There has been little activity in evaluating usability (see Chapter 34, this Volume) and understanding performance requirements and limitations of gestural interaction. Error rates are reported from 1% to 50%, depending on the difficulty and generality of the scenario. There are currently no common databases or metrics with which to compare research results. Can gesture recognition systems adapt to variations among individuals, or will extensive individual training be required? What about individual variation due to fatigue and other factors? How good do gesture recognition systems need to be to become truly useful in mass applications? Each technology discussed in this chapter has its benefits and limitations. Devices that are worn or held – pens, gloves, body suits – are currently more advanced, as evidenced by the fact that there are many commercial products available. However, passive sensing (using cameras or other sensors) promises to be more powerful, more general, and less obtrusive than other technologies. It is likely that both camps will continue to improve and co-exist, often be used together in systems, and that new sensing technologies will arise to give even more choice to VE developers. Augmented Reality Overview This section surveys the field of Augmented Reality, in which 3-D virtual objects are integrated into a 3-D real environment in real time. It describes the medical, manufacturing, visualization, path planning, entertainment and military applications that have been explored. This paper describes the characteristics of Augmented Reality systems, including a detailed discussion of the tradeoffs between optical and video blending approaches. Registration and sensing errors are two of the biggest problems in building effective Augmented Reality systems, so this paper summarizes current efforts to overcome these problems. Future directions and areas requiring further research are discussed. This survey provides a starting point for anyone interested in researching or using Augmented Reality. Compiled by Omorogbe Harry 76 HCI Introduction This section surveys the current state-of-the-art in Augmented Reality. It describes work performed at many different sites and explains the issues and problems encountered when building Augmented Reality systems. It summarizes the tradeoffs and approaches taken so far to overcome these problems and speculates on future directions that deserve exploration. The survey paper does not present new research results. The contribution comes from consolidating existing information from many sources and publishing an extensive bibliography of papers in this field. While several other introductory papers have been written on this subject, this survey is more comprehensive and up-to-date. This survey provides a good beginning point for anyone interested in starting research in this area. Definition Augmented Reality (AR) is a variation of Virtual Environments (VE), or Virtual Reality as it is more commonly called. VE technologies completely immerse a user inside a synthetic environment. While immersed, the user cannot see the real world around him. In contrast, AR allows the user to see the real world, with virtual objects superimposed upon or composited with the real world. Therefore, AR supplements reality, rather than completely replacing it. Ideally, it would appear to the user that the virtual and real objects coexisted in the same space, similar to the effects achieved in the film "Who Framed Roger Rabbit?" Figure 1 shows an example of what this might look like. It shows a real desk with a real phone. Inside this room are also a virtual lamp and two virtual chairs. Note that the objects are combined in 3-D, so that the virtual lamp covers the real table, and the real table covers parts of the two virtual chairs. AR can be thought of as the "middle ground" between VE (completely synthetic) and telepresence (completely real). Figure 5: Real desk with virtual lamp and two virtual chairs. (Courtesy ECRC) Some researchers define AR in a way that requires the use of Head-Mounted Displays (HMDs). To avoid limiting AR to specific technologies, this survey defines AR as systems that have the following three characteristics: 1) Combines real and virtual 2) Interactive in real time 3) Registered in 3-D Compiled by Omorogbe Harry 77 HCI This definition allows other technologies besides HMDs while retaining the essential components of AR. For example, it does not include film or 2-D overlays. Films like "Jurassic Park" feature photorealistic virtual objects seamlessly blended with a real environment in 3-D, but they are not interactive media. 2-D virtual overlays on top of live video can be done at interactive rates, but the overlays are not combined with the real world in 3-D. However, this definition does allow monitor-based interfaces, monocular systems, see-through HMDs, and various other combining technologies. Potential system configurations are discussed further in Section 3. Motivation Why is Augmented Reality an interesting topic? Why is combining real and virtual objects in 3-D useful? Augmented Reality enhances a user's perception of and interaction with the real world. The virtual objects display information that the user cannot directly detect with his own senses. The information conveyed by the virtual objects helps a user perform real-world tasks. AR is a specific example of what Fred Brooks calls Intelligence Amplification (IA): using the computer as a tool to make a task easier for a human to perform. At least six classes of potential AR applications have been explored: medical visualization, maintenance and repair, annotation, robot path planning, entertainment, and military aircraft navigation and targeting. The next section describes work that has been done in each area. While these do not cover every potential application area of this technology, they do cover the areas explored so far. Applications Medical Doctors could use Augmented Reality as a visualization and training aid for surgery. It may be possible to collect 3-D datasets of a patient in real time, using non-invasive sensors like Magnetic Resonance Imaging (MRI), Computed Tomography scans (CT), or ultrasound imaging. These datasets could then be rendered and combined in real time with a view of the real patient. In effect, this would give a doctor "X-ray vision" inside a patient. This would be very useful during minimally-invasive surgery, which reduces the trauma of an operation by using small incisions or no incisions at all. A problem with minimally-invasive techniques is that they reduce the doctor's ability to see inside the patient, making surgery more difficult. AR technology could provide an internal view without the need for larger incisions. AR might also be helpful for general medical visualization tasks in the surgical room. Surgeons can detect some features with the naked eye that they cannot see in MRI or CT scans, and vice-versa. AR would give surgeons access to both types of data simultaneously. This might also guide precision tasks, such as displaying where to drill a hole into the skull for brain surgery or where to perform a needle biopsy of a tiny tumor. The information from the non-invasive sensors would be directly displayed on the patient, showing exactly where to perform the operation. Compiled by Omorogbe Harry 78 HCI AR might also be useful for training purposes. Virtual instructions could remind a novice surgeon of the required steps, without the need to look away from a patient to consult a manual. Virtual objects could also identify organs and specify locations to avoid disturbing. Several projects are exploring this application area. At UNC Chapel Hill, a research group has conducted trial runs of scanning the womb of a pregnant woman with an ultrasound sensor, generating a 3-D representation of the fetus inside the womb and displaying that in a see-through HMD (Figure 12). The goal is to endow the doctor with the ability to see the moving, kicking fetus lying inside the womb, with the hope that this one day may become a "3-D stethoscope". More recent efforts have focused on a needle biopsy of a breast tumor. Figure 3 shows a mockup of a breast biopsy operation, where the virtual objects identify the location of the tumor and guide the needle to its target. Other groups at the MIT AI Lab, General Electric, and elsewhere are investigating displaying MRI or CT data, directly registered onto the patient. Figure6: Virtual fetus inside womb of pregnant patient. (Courtesy UNC Chapel Hill Dept. of Computer Science.) Figure7: Mockup of breast tumor biopsy. 3-D graphics guide needle insertion.(Courtesy UNC Chapel Hill Dept. of Computer Science.) Manufacturing and repair. Another category of Augmented Reality applications is the assembly, maintenance, and repair of complex machinery. Instructions might be easier to understand if they were available, not as manuals with text and pictures, but rather as 3-D drawings superimposed upon the actual equipment, showing step-by-step the tasks that need to be done and how to do them. These superimposed 3-D drawings can be animated, making the directions even more explicit. Several research projects have demonstrated prototypes in this area. Steve Feiner's group at Columbia built a laser printer maintenance application, shown in Figures 8 and 9. Figure 8 shows an external view, and Figure 9 shows the user's view, where the computer-generated wireframe is telling the user to remove the paper tray. A group at Boeing is developing AR technology to guide a technician in building a wiring harness that forms part of an airplane's electrical system. Storing these instructions in electronic form will save space and reduce costs. Currently, technicians use large physical layout boards to construct such harnesses, and Boeing requires several warehouses to store all Compiled by Omorogbe Harry 79 HCI these boards. Such space might be emptied for other use if this application proves successful. Boeing is using a Technology Reinvestment Program (TRP) grant to investigate putting this technology onto the factory floor. Figure 10 shows an external view of Adam Janin using a prototype AR system to build a wire bundle. Eventually, AR might be used for any complicated machinery, such as automobile engines. Figure 8: External view of Columbia printer maintenance application. Note that all objects must be tracked. (Courtesy Steve Feiner, Blair MacIntyre, and Dorée Seligmann, Columbia University.) Figure9: Prototype laser printer maintenance application, displaying how to remove the paper tray. (Courtesy Steve Feiner, Blair MacIntyre, and Dorée Seligmann, Columbia University.) Figure 10: Adam Janin demonstrates Boeing's prototype wire bundle assembly application. (Courtesy David Mizell, Boeing) Annotation and visualization AR could be used to annotate objects and environments with public or private information. Applications using public information assume the availability of public databases to draw upon. For example, a hand-held display could provide information about the contents of library shelves as the user walks around the library. At the European Computer-Industry Research Centre (ECRC), a user can point at parts of an engine model and the AR system displays the name of the part that is being pointed at. Figure 11 shows this, where the user points at the exhaust manifold on an engine model and the label "exhaust manifold" appears. Figure 11: Engine model part labels appear as user points at them. (Courtesy ECRC) Compiled by Omorogbe Harry 80 HCI Alternately, these annotations might be private notes attached to specific objects. Researchers at Columbia demonstrated this with the notion of attaching windows from a standard user interface onto specific locations in the world, or attached to specific objects as reminders. Figure 12 shows a window superimposed as a label upon a student. He wears a tracking device, so the computer knows his location. As the student moves around, the label follows his location, providing the AR user with a reminder of what he needs to talk to the student about. Figure 12: Windows displayed on top of specific real-world objects. (Courtesy Steve Feiner, Blair MacIntyre, Marcus Haupt, and Eliot Solomon, Columbia University.) AR might aid general visualization tasks as well. An architect with a see-through HMD might be able to look out a window and see how a proposed new skyscraper would change her view. If a database containing information about a building's structure was available, AR might give architects "X-ray vision" inside a building, showing where the pipes, electric lines, and structural supports are inside the walls. Researchers at the University of Toronto have built a system called Augmented Reality through Graphic Overlays on Stereovideo, which among other things is used to make images easier to understand during difficult viewing conditions. Figure 13 shows wireframe lines drawn on top of a space shuttle bay interior, while in orbit. The lines make it easier to see the geometry of the shuttle bay. Similarly, virtual lines and objects could aid navigation and scene understanding during poor visibility conditions, such as underwater or in fog. Figure 13: Virtual lines help display geometry of shuttle bay, as seen in orbit. (Courtesy David Drascic and Paul Milgram, U. Toronto.) Robot path planning Teleoperation of a robot is often a difficult problem, especially when the robot is far away, with long delays in the communication link. Under this circumstance, instead of controlling the robot directly, it may be preferable to instead control a virtual version of the robot. The user plans and specifies the robot's actions by manipulating the local virtual version, in real time. The results are directly displayed on the real world. Once the plan is tested and determined, then user tells the real robot to execute the specified plan. This avoids pilot-induced oscillations caused by the lengthy delays. The virtual versions can also predict the effects of manipulating the environment, thus serving as a planning and previewing tool to aid the user in performing the desired task. The ARGOS system has demonstrated that stereoscopic AR is an easier and more accurate way of doing robot path planning than traditional monoscopic interfaces. Others have also used registered overlays with telepresence systems. Figure 14 shows how a virtual outline can represent a future location of a robot arm. Compiled by Omorogbe Harry 81 HCI Figure 14: Virtual lines show a planned motion of a robot arm (Courtesy David Drascic and Paul Milgram, U. Toronto.) Entertainment At SIGGRAPH '95, several exhibitors showed "Virtual Sets" that merge real actors with virtual backgrounds, in real time and in 3-D. The actors stand in front of a large blue screen, while a computer-controlled motion camera records the scene. Since the camera's location is tracked, and the actor's motions are scripted, it is possible to digitally composite the actor into a 3-D virtual background. For example, the actor might appear to stand inside a large virtual spinning ring, where the front part of the ring covers the actor while the rear part of the ring is covered by the actor. The entertainment industry sees this as a way to reduce production costs: creating and storing sets virtually is potentially cheaper than constantly building new physical sets from scratch. The ALIVE project from the MIT Media Lab goes one step further by populating the environment with intelligent virtual creatures that respond to user actions [Maes95]. Military aircraft For many years, military aircraft and helicopters have used Head-Up Displays (HUDs) and Helmet-Mounted Sights (HMS) to superimpose vector graphics upon the pilot's view of the real world. Besides providing basic navigation and flight information, these graphics are sometimes registered with targets in the environment, providing a way to aim the aircraft's weapons. For example, the chin turret in a helicopter gunship can be slaved to the pilot's HMS, so the pilot can aim the chin turret simply by looking at the target. Future generations of combat aircraft will be developed with an HMD built into the pilot's helmet. Characteristics This section discusses the characteristics of AR systems and design issues encountered when building an AR system. The section describes the basic characteristics of augmentation. There are two ways to accomplish this augmentation: optical or video technologies. Also in this section discusses their characteristics and relative strengths and weaknesses. Blending the real and virtual poses problems with focus and contrast, and some applications require portable AR systems to be truly effective. Finally, in this section a summarizes the characteristics by comparing the requirements of AR against those for Virtual Environments. Augmentation Besides adding objects to a real environment, Augmented Reality also has the potential to remove them. Current work has focused on adding virtual objects to a real environment. However, graphic overlays might also be used to remove or hide parts of Compiled by Omorogbe Harry 82 HCI the real environment from a user. For example, to remove a desk in the real environment, draw a representation of the real walls and floors behind the desk and "paint" that over the real desk, effectively removing it from the user's sight. This has been done in feature films. Doing this interactively in an AR system will be much harder, but this removal may not need to be photorealistic to be effective. Augmented Reality might apply to all senses, not just sight. So far, researchers have focused on blending real and virtual images and graphics. However, AR could be extended to include sound. The user would wear headphones equipped with microphones on the outside. The headphones would add synthetic, directional 3–D sound, while the external microphones would detect incoming sounds from the environment. This would give the system a chance to mask or cover up selected real sounds from the environment by generating a masking signal that exactly canceled.10 the incoming real sound. While this would not be easy to do, it might be possible. Another example is haptics. Gloves with devices that provide tactile feedback might augment real forces in the environment. For example, a user might run his hand over the surface of a real desk. Simulating such a hard surface virtually is fairly difficult, but it is easy to do in reality. Then the tactile effectors in the glove can augment the feel of the desk, perhaps making it feel rough in certain spots. This capability might be useful in some applications, such as providing an additional cue that a virtual object is at a particular location on a real desk. Optical vs. video A basic design decision in building an AR system is how to accomplish the combining of real and virtual. Two basic choices are available: optical and video technologies. Each has particular advantages and disadvantages. This section compares the two and notes the tradeoffs. A see-through HMD is one device used to combine real and virtual. Standard closedview HMDs do not allow any direct view of the real world. In contrast, a see-through HMD lets the user see the real world, with virtual objects superimposed by optical or video technologies. Optical see-through HMDs work by placing optical combiners in front of the user's eyes. These combiners are partially transmissive, so that the user can look directly through them to see the real world. The combiners are also partially reflective, so that the user sees virtual images bounced off the combiners from head-mounted monitors. This approach is similar in nature to Head-Up Displays (HUDs) commonly used in military aircraft, except that the combiners are attached to the head. Thus, optical see-through HMDs have sometimes been described as a "HUD on a head" [Wanstall89]. Figure 11 shows a conceptual diagram of an optical see-through HMD. Figure 12 shows two optical see-through HMDs made by Hughes Electronics. The optical combiners usually reduce the amount of light that the user sees from the real world. Since the combiners act like half-silvered mirrors, they only let in some of the light from the real world, so that they can reflect some of the light from the monitors into the user's eyes. For example, the HMD described in [Holmgren92] transmits about 30% of the incoming light from the real world. Choosing the level of blending is a design problem. More sophisticated combiners might vary the level of contributions based upon Compiled by Omorogbe Harry 83 HCI the wavelength of light. For example, such a combiner might be set to reflect all light of a certain wavelength and none at any other wavelengths. This would be ideal with a monochrome monitor. Virtually all the light from the monitor would be reflected into the user's eyes, while almost all the light from the real world (except at the particular wavelength) would reach the user's eyes. However, most existing optical see-through HMDs do reduce the amount of light from the real world, so they act like a pair of sunglasses when the power is cut off. Figure 15: Optical see-through HMD conceptual diagram Figure 16: Two optical see-through HMDs, made by Hughes Electronics In contrast, video see-through HMDs work by combining a closed-view HMD with one or two head-mounted video cameras. The video cameras provide the user's view of the real world. Video from these cameras is combined with the graphic images created by the scene generator, blending the real and virtual. The result is sent to the monitors in front of the user's eyes in the closed-view HMD. Figure 17 shows a conceptual diagram of a video see-through HMD. Figure 18 shows an actual video see-through HMD, with two video cameras mounted on top of a Flight Helmet. Figure 17: Video see-through HMD conceptual diagram Compiled by Omorogbe Harry 84 HCI Figure 18: An actual video see-through HMD. (Courtesy Jannick Rolland, Frank Biocca, and UNC Chapel Hill Dept. of Computer Science. Photo by Alex Treml.) Video composition can be done in more than one way. A simple way is to use chromakeying: a technique used in many video special effects. The background of the computer graphic images is set to a specific color, say green, which none of the virtual objects use. Then the combining step replaces all green areas with the corresponding parts from the video of the real world. This has the effect of superimposing the virtual objects over the real world. A more sophisticated composition would use depth information. If the system had depth information at each pixel for the real world images, it could combine the real and virtual images by a pixel-by-pixel depth comparison. This would allow real objects to cover virtual objects and vice-versa. AR systems can also be built using monitor-based configurations, instead of see-through HMDs. Figure 19 shows how a monitor-based system might be built. In this case, one or two video cameras view the environment. The cameras may be static or mobile. In the mobile case, the cameras might move around by being attached to a robot, with their locations tracked. The video of the real world and the graphic images generated by a scene generator are combined, just as in the video see-through HMD case, and displayed in a monitor in front of the user. The user does not wear the display device. Optionally, the images may be displayed in stereo on the monitor, which then requires the user to wear a pair of stereo glasses. Figure 20 shows an external view of the ARGOS system, which uses a monitor-based configuration. Figure 19: Monitor-based AR conceptual diagram Figure 20: External view of the ARGOS system, an example of monitor-based AR. (Courtesy David Drascic and Paul Milgram, U. Toronto.) Finally, a monitor-based optical configuration is also possible. This is similar to Figure 18 except that the user does not wear the monitors or combiners on her head. Instead, the monitors and combiners are fixed in space, and the user positions her head to look Compiled by Omorogbe Harry 85 HCI through the combiners. This is typical of Head-Up Displays on military aircraft, and at least one such configuration has been proposed for a medical application. The rest of this section compares the relative advantages and disadvantages of optical and video approaches, starting with optical. An optical approach has the following advantages over a video approach: Simplicity: Optical blending is simpler and cheaper than video blending. Optical approaches have only one "stream" of video to worry about: the graphic images. The real world is seen directly through the combiners, and that time delay is generally a few nanoseconds. Video blending, on the other hand, must deal with separate video streams for the real and virtual images. Both streams have inherent delays in the tens of milliseconds. Digitizing video images usually adds at least one frame time of delay to the video stream, where a frame time is how long it takes to completely update an image. A monitor that completely refreshes the screen at 60 Hz has a frame time of 16.67ms. The two streams of real and virtual images must be properly synchronized or temporal distortion results. Also, optical see-through HMDs with narrow field-of-view combiners offer views of the real world that have little distortion. Video cameras almost always have some amount of distortion that must be compensated for, along with any distortion from the optics in front of the display devices. Since video requires cameras and combiners that optical approaches do not need, video will probably be more expensive and complicated to build than optical-based systems. Resolution: Video blending limits the resolution of what the user sees, both real and virtual, to the resolution of the display devices. With current displays, this resolution is far less than the resolving power of the fovea. Optical see-through also shows the graphic images at the resolution of the display device, but the user's view of the real world is not degraded. Thus, video reduces the resolution of the real world, while optical see-through does not. Safety: Video see-through HMDs are essentially modified closed-view HMDs. If the power is cut off, the user is effectively blind. This is a safety concern in some applications. In contrast, when power is removed from an optical see-through HMD, the user still has a direct view of the real world. The HMD then becomes a pair of heavy sunglasses, but the user can still see. No eye offset: With video see-through, the user's view of the real world is provided by the video cameras. In essence, this puts his "eyes" where the video cameras are. In most configurations, the cameras are not located exactly where the user's eyes are, creating an offset between the cameras and the real eyes. The distance separating the cameras may also not be exactly the same as the user's interpupillary distance (IPD). This difference between camera locations and eye locations introduces displacements from what the user sees compared to what he expects to see. For example, if the cameras are above the user's eyes, he will see the world from a vantage point slightly taller than he is used to. Video seethrough can avoid the eye offset problem through the use of mirrors to create another set of optical paths that mimic the paths directly into the user's eyes. Using those paths, the cameras will see what the user's eyes would normally see Compiled by Omorogbe Harry 86 HCI without the HMD. However, this adds complexity to the HMD design. Offset is generally not a difficult design problem for optical see-through displays. While the user's eye can rotate with respect to the position of the HMD, the resulting errors are tiny. Using the eye's center of rotation as the viewpoint in the computer graphics model should eliminate any need for eye tracking in an optical seethrough HMD. Video blending offers the following advantages over optical blending: Flexibility in composition strategies: A basic problem with optical see-through is that the virtual objects do not completely obscure the real world objects, because the optical combiners allow light from both virtual and real sources. Building an optical see-through HMD that can selectively shut out the light from the real world is difficult. In a normal optical system, the objects are designed to be in focus at only one point in the optical path: the user's eye. Any filter that would selectively block out light must be placed in the optical path at a point where the image is in focus, which obviously cannot be the user's eye. Therefore, the optical system must have two places where the image is in focus: at the user's eye and the point of the hypothetical filter. This makes the optical design much more difficult and complex. No existing optical see-through HMD blocks incoming light in this fashion. Thus, the virtual objects appear ghost-like and semi-transparent. This damages the illusion of reality because occlusion is one of the strongest depth cues. In contrast, video see-through is far more flexible about how it merges the real and virtual images. Since both the real and virtual are available in digital form, video see-through compositors can, on a pixel-by-pixel basis, take the real, or the virtual, or some blend between the two to simulate transparency. Because of this flexibility, video see-through may ultimately produce more compelling environments than optical see-through approaches. Wide field-of-view: Distortions in optical systems are a function of the radial distance away from the optical axis. The further one looks away from the center of the view, the larger the distortions get. A digitized image taken through a distorted optical system can be undistorted by applying image processing techniques to unwarp the image, provided that the optical distortion is well characterized. This requires significant amounts of computation, but this constraint will be less important in the future as computers become faster. It is harder to build wide field-of-view displays with optical see-through techniques. Any distortions of the user's view of the real world must be corrected optically, rather than digitally, because the system has no digitized image of the real world to manipulate. Complex optics are expensive and add weight to the HMD. Wide field-of-view systems are an exception to the general trend of optical approaches being simpler and cheaper than video approaches. Real and virtual view delays can be matched: Video offers an approach for reducing or avoiding problems caused by temporal mismatches between the real and virtual images. Optical see-through HMDs offer an almost instantaneous view of the real world but a delayed view of the virtual. This temporal mismatch can cause problems. With video approaches, it is possible to delay the video of the real world to match the delay from the virtual image stream. Compiled by Omorogbe Harry 87 HCI Additional registration strategies: In optical see-through, the only information the system has about the user's head location comes from the head tracker. Video blending provides another source of information: the digitized image of the real scene. This digitized image means that video approaches can employ additional registration strategies unavailable to optical approaches. Easier to match the brightness of real and virtual objects: This is discussed in previous section. Both optical and video technologies have their roles, and the choice of technology depends on the application requirements. Many of the mechanical assembly and repair prototypes use optical approaches, possibly because of the cost and safety issues. If successful, the equipment would have to be replicated in large numbers to equip workers on a factory floor. In contrast, most of the prototypes for medical applications use video approaches, probably for the flexibility in blending real and virtual and for the additional registration strategies offered. Focus and contrast Focus can be a problem for both optical and video approaches. Ideally, the virtual should match the real. In a video-based system, the combined virtual and real image will be projected at the same distance by the monitor or HMD optics. However, depending on the video camera's depth-of-field and focus settings, parts of the real world may not be in focus. In typical graphics software, everything is rendered with a pinhole model, so all the graphic objects, regardless of distance, are in focus. To overcome this, the graphics could be rendered to simulate a limited depth-of-field, and the video camera might have an autofocus lens. In the optical case, the virtual image is projected at some distance away from the user. This distance may be adjustable, although it is often fixed. Therefore, while the real objects are at varying distances from the user, the virtual objects are all projected to the same distance. If the virtual and real distances are not matched for the particular objects that the user is looking at, it may not be possible to clearly view both simultaneously. Contrast is another issue because of the large dynamic range in real environments and in what the human eye can detect. Ideally, the brightness of the real and virtual objects should be appropriately matched. Unfortunately, in the worst case scenario, this means the system must match a very large range of brightness levels. The eye is a logarithmic detector, where the brightest light that it can handle is about eleven orders of magnitude greater than the smallest, including both dark-adapted and light-adapted eyes. In any one adaptation state, the eye can cover about six orders of magnitude. Most display devices cannot come close to this level of contrast. This is a particular problem with optical technologies, because the user has a direct view of the real world. If the real environment is too bright, it will wash out the virtual image. If the real environment is too dark, the virtual image will wash out the real world. Contrast problems are not as severe with video, because the video cameras themselves have limited dynamic response, and the view of both the real and virtual is generated by the monitor, so everything must be clipped or compressed into the monitor's dynamic range. Compiled by Omorogbe Harry 88 HCI Portability In almost all Virtual Environment systems, the user is not encouraged to walk around much. Instead, the user navigates by "flying" through the environment, walking on a treadmill, or driving some mockup of a vehicle. Whatever the technology, the result is that the user stays in one place in the real world. Some AR applications, however, will need to support a user who will walk around a large environment. AR requires that the user actually be at the place where the task is to take place. "Flying," as performed in a VE system, is no longer an.option. If a mechanic needs to go to the other side of a jet engine, she must physically move herself and the display devices she wears. Therefore, AR systems will place a premium on portability, especially the ability to walk around outdoors, away from controlled environments. The scene generator, the HMD, and the tracking system must all be self-contained and capable of surviving exposure to the environment. If this capability is achieved, many more applications that have not been tried will become available. For example, the ability to annotate the surrounding environment could be useful to soldiers, hikers, or tourists in an unfamiliar new location. Comparison against virtual environments The overall requirements of AR can be summarized by comparing them against the requirements for Virtual Environments, for the three basic subsystems that they require. 1) Scene generator: Rendering is not currently one of the major problems in AR. VE systems have much higher requirements for realistic images because they completely replace the real world with the virtual environment. In AR, the virtual images only supplement the real world. Therefore, fewer virtual objects need to be drawn, and they do not necessarily have to be realistically rendered in order to serve the purposes of the application. For example, in the annotation applications, text and 3-D wireframe drawings might suffice. Ideally, photorealistic graphic objects would be seamlessly merged with the real environment, but more basic problems have to be solved first. 2) Display device: The display devices used in AR may have less stringent requirements than VE systems demand, again because AR does not replace the real world. For example, monochrome displays may be adequate for some AR applications, while virtually all VE systems today use full color. Optical see-through HMDs with a small field-of-view may be satisfactory because the user can still see the real world with his peripheral vision; the see-through HMD does not shut off the user's normal field-of-view. Furthermore, the resolution of the monitor in an optical see-through HMD might be lower than what a user would tolerate in a VE application, since the optical see-through HMD does not reduce the resolution of the real environment. 3) Tracking and sensing: While in the previous two cases AR had lower requirements than VE, that is not the case for tracking and sensing. In this area, the requirements for AR are much stricter than those for VE systems. A major reason for this is the registration problem, which is described in the next section. The other factors that make the tracking and sensing requirements higher are described in the next few page. Compiled by Omorogbe Harry 89 HCI Registration The registration problem One of the most basic problems currently limiting Augmented Reality applications is the registration problem. The objects in the real and virtual worlds must be properly aligned with respect to each other, or the illusion that the two worlds coexist will be compromised. More seriously, many applications demand accurate registration. For example, recall the needle biopsy application. If the virtual object is not where the real tumor is, the surgeon will miss the tumor and the biopsy will fail. Without accurate registration, Augmented Reality will not be accepted in many applications. Registration problems also exist in Virtual Environments, but they are not nearly as serious because they are harder to detect than in Augmented Reality. Since the user only sees virtual objects in VE applications, registration errors result in visual-kinesthetic and visual-proprioceptive conflicts. Such conflicts between different human senses may be a source of motion sickness [Pausch92]. Because the kinesthetic and proprioceptive systems are much less sensitive than the visual system, visual-kinesthetic and visualproprioceptive conflicts are less noticeable than visual-visual conflicts. For example, a user wearing a closed-view HMD might hold up her real hand and see a virtual hand. This virtual hand should be displayed exactly where she would see her real hand, if she were not wearing an HMD. But if the virtual hand is wrong by five millimeters, she may not detect that unless actively looking for such errors. The same error is much more obvious in a see-through HMD, where the conflict is visual-visual. Furthermore, a phenomenon known as visual capture makes it even more difficult to detect such registration errors. Visual capture is the tendency of the brain to believe what it sees rather than what it feels, hears, etc. That is, visual information tends to override all other senses. When watching a television program, a viewer believes the sounds come from the mouths of the actors on the screen, even though they actually come from a speaker in the TV. Ventriloquism works because of visual capture. Similarly, a user might believe that her hand is where the virtual hand is drawn, rather than where her real hand actually is, because of visual capture. This effect increases the amount of registration error users can tolerate in Virtual Environment systems. If the errors are systematic, users might even be able to adapt to the new environment, given a long exposure time of several hours or days. Augmented Reality demands much more accurate registration than Virtual Environments [Azuma93]. Imagine the same scenario of a user holding up her hand, but this time wearing a see-through HMD. Registration errors now result in visual-visual conflicts between the images of the virtual and real hands. Such conflicts are easy to detect because of the resolution of the human eye and the sensitivity of the human visual system to differences. Even tiny offsets in the images of the real and virtual hands are easy to detect. What angular accuracy is needed for good registration in Augmented Reality? A simple demonstration will show the order of magnitude required. Take out a dime and hold it at arm's length, so that it looks like a circle. The diameter of the dime covers about 1.2 to 2.0 degrees of arc, depending on your arm length. In comparison, the width of a full moon is about 0.5 degrees of arc! Now imagine a virtual object superimposed on a real Compiled by Omorogbe Harry 90 HCI object, but offset by the diameter of the full moon. Such a difference would be easy to detect. Thus, the angular accuracy required is a small fraction of a degree. The lower limit is bounded by the resolving power of the human eye itself. The central part of the retina is called the fovea, which has the highest density of color-detecting cones, about 120 per degree of arc, corresponding to a spacing of half a minute of arc. Observers can differentiate between a dark and light bar grating when each bar subtends about one minute of arc, and under special circumstances they can detect even smaller differences. However, existing HMD trackers and displays are not capable of providing one minute of arc in accuracy, so the present achievable accuracy is much worse than that ultimate lower bound. In practice, errors of a few pixels are detectable in modern HMDs. Registration of real and virtual objects is not limited to AR. Special-effects artists seamlessly integrate computer-generated 3-D objects with live actors in film and video. The difference lies in the amount of control available. With film, a director can carefully plan each shot, and artists can spend hours per frame, adjusting each by hand if necessary, to achieve perfect registration. As an interactive medium, AR is far more difficult to work with. The AR system cannot control the motions of the HMD wearer. The user looks where she wants, and the system must respond within tens of milliseconds. Registration errors are difficult to adequately control because of the high accuracy requirements and the numerous sources of error. These sources of error can be divided into two types: static and dynamic. Static errors are the ones that cause registration errors even when the user's viewpoint and the objects in the environment remain completely still. Dynamic errors are the ones that have no effect until either the viewpoint or the objects begin moving. For current HMD-based systems, dynamic errors are by far the largest contributors to registration errors, but static errors cannot be ignored either. The next two sections discuss static and dynamic errors and what has been done to reduce them. See [Holloway95] for a thorough analysis of the sources and magnitudes of registration errors. Static errors The four main sources of static errors are: Optical distortion Errors in the tracking system Mechanical misalignments.20 Incorrect viewing parameters (e.g., field of view, tracker-to-eye position and orientation, interpupillary distance) 1) Distortion in the optics: Optical distortions exist in most camera and lens systems, both in the cameras that record the real environment and in the optics used for the display. Because distortions are usually a function of the radial distance away from the optical axis, wide field-of-view displays can be especially vulnerable to this error. Near the center of the field-of-view, images are relatively undistorted, but far away from the center, image distortion can be large. For example, straight lines may appear curved. In a see-through HMD with narrow field-of-view displays, the optical combiners add Compiled by Omorogbe Harry 91 HCI virtually no distortion, so the user's view of the real world is not warped. However, the optics used to focus and magnify the graphic images from the display monitors can introduce distortion. This mapping of distorted virtual images on top of an undistorted view of the real world causes static registration errors. The cameras and displays may also have nonlinear distortions that cause errors. Optical distortions are usually systematic errors, so they can be mapped and compensated. This mapping may not be trivial, but it is often possible. For example, describes the distortion of one commonly-used set of HMD optics. The distortions might be compensated by additional optics describes such a design for a video see-through HMD. This can be a difficult design problem, though, and it will add weight, which is not desirable in HMDs. An alternate approach is to do the compensation digitally. This can be done by image warping techniques, both on the digitized video and the graphic images. Typically, this involves predistorting the images so that they will appear undistorted after being displayed. Another way to perform digital compensation on the graphics is to apply the predistortion functions on the vertices of the polygons, in screen space, before rendering. This requires subdividing polygons that cover large areas in screen space. Both digital compensation methods can be computationally expensive, often requiring special hardware to accomplish in real time. Holloway determined that the additional system delay required by the distortion compensation adds more registration error than the distortion compensation removes, for typical head motion. 2) Errors in the tracking system: Errors in the reported outputs from the tracking and sensing systems are often the most serious type of static registration errors. These distortions are not easy to measure and eliminate, because that requires another "3-D ruler" that is more accurate than the tracker being tested. These errors are often nonsystematic and difficult to fully characterize. Almost all commercially-available tracking systems are not accurate enough to satisfy the requirements of AR systems. Section 5 discusses this important topic further. 3) Mechanical misalignments: Mechanical misalignments are discrepancies between the model or specification of the hardware and the actual physical properties of the real system. For example, the combiners, optics, and monitors in an optical see-through HMD may not be at the expected distances or orientations with respect to each other. If the frame is not sufficiently rigid, the various component parts may change their relative positions as the user moves around, causing errors. Mechanical misalignments can cause subtle changes in the position and orientation of the projected virtual images that are difficult to compensate. While some alignment errors can be calibrated, for many others it may be more effective to "build it right" initially. 4) Incorrect viewing parameters: Incorrect viewing parameters, the last major source of static registration errors, can be thought of as a special case of alignment errors where calibration techniques can be applied. Viewing parameters specify how to convert the reported head or camera locations into viewing matrices used by the scene generator to draw the graphic images. For an HMD-based system, these parameters include: Center of projection and viewport dimensions Offset, both in translation and orientation, between the location of the head tracker and the user's eyes Field of view Compiled by Omorogbe Harry 92 HCI Incorrect viewing parameters cause systematic static errors. Take the example of a head tracker located above a user's eyes. If the vertical translation offsets between the tracker and the eyes are too small, all the virtual objects will appear lower than they should. In some systems, the viewing parameters are estimated by manual adjustments, in a nonsystematic fashion. Such approaches proceed as follows: place a real object in the environment and attempt to register a virtual object with that real object. While wearing the HMD or positioning the cameras, move to one viewpoint or a few selected viewpoints and manually adjust the location of the virtual object and the other viewing parameters until the registration "looks right." This may achieve satisfactory results if the environment and the viewpoint remain static. However, such approaches require a skilled user and generally do not achieve robust results for many viewpoints. Achieving good registration from a single viewpoint is much easier than registration from a wide variety of viewpoints using a single set of parameters. Usually what happens is satisfactory registration at one viewpoint, but when the user walks to a significantly different viewpoint, the registration is inaccurate because of incorrect viewing parameters or tracker distortions. This means many different sets of parameters must be used, which is a less than satisfactory solution. Another approach is to directly measure the parameters, using various measuring tools and sensors. For example, a commonly-used optometrist's tool can measure the interpupillary distance. Rulers might measure the offsets between the tracker and eye positions. Cameras could be placed where the user's eyes would normally be in an optical see-through HMD. By recording what the camera sees, through the see-through HMD, of the real environment, one might be able to determine several viewing parameters. So far, direct measurement techniques have enjoyed limited success. View-based tasks are another approach to calibration. These ask the user to perform various tasks that set up geometric constraints. By performing several tasks, enough information is gathered to determine the viewing parameters. For example, [Azuma94] asked a user wearing an optical see-through HMD to look straight through a narrow pipe mounted in the real environment. This sets up the constraint that the user's eye must be located along a line through the center of the pipe. Combining this with other tasks created enough constraints to measure all the viewing parameters. [Caudell92] used a different set of tasks, involving lining up two circles that specified a cone in the real environment. [Oishi96] moves virtual cursors to appear on top of beacons in the real environment. All view-based tasks rely upon the user accurately performing the specified task and assume the tracker is accurate. If the tracking and sensing equipment is not accurate, then multiple measurements must be taken and optimizers used to find the "best-fit" solution. For video-based systems, an extensive body of literature exists in the robotics and photogrammetry communities on camera calibration techniques. Such techniques compute a camera's viewing parameters by taking several pictures of an object of fixed and sometimes unknown geometry. These pictures must be taken from different locations. Matching points in the 2-D images with corresponding 3-D points on the object sets up mathematical constraints. With enough pictures, these constraints determine the viewing parameters and the 3-D location of the calibration object. Alternately, they can serve to drive an optimization routine that will search for the best Compiled by Omorogbe Harry 93 HCI set of viewing parameters that fits the collected data. Several AR systems have used camera calibration techniques. Dynamic errors Dynamic errors occur because of system delays, or lags. The end-to-end system delay is defined as the time difference between the moment that the tracking system measures the position and orientation of the viewpoint to the moment when the generated images corresponding to that position and orientation appear in the displays. These delays exist because each component in an Augmented Reality system requires some time to do its job. The delays in the tracking subsystem, the communication delays, the time it takes the scene generator to draw the appropriate images in the frame buffers, and the scanout time from the frame buffer to the displays all contribute to end-to-end lag. End-to-end delays of 100 ms are fairly typical on existing systems. Simpler systems can have less delay, but other systems have more. Delays of 250 ms or more can exist on slow, heavily loaded, or networked systems. End-to-end system delays cause registration errors only when motion occurs. Assume that the viewpoint and all objects remain still. Then the lag does not cause registration errors. No matter how long the delay is, the images generated are appropriate, since nothing has moved since the time the tracker measurement was taken. Compare this to the case with motion. For example, assume a user wears a see-through HMD and moves her head. The tracker measures the head at an initial time t. The images corresponding to time t will not appear until some future time t2 , because of the end-to-end system delays. During this delay, the user's head remains in motion, so when the images computed at time t finally appear, the user sees them at a different location than the one they were computed for. Thus, the images are incorrect for the time they are actually viewed. To the user, the virtual objects appear to "swim around" and "lag behind" the real objects. This was graphically demonstrated in a videotape of UNC's ultrasound experiment shown at SIGGRAPH '92. In Figure 17, the picture on the left shows what the registration looks like when everything stands still. The virtual gray trapezoidal region represents what the ultrasound wand is scanning. This virtual trapezoid should be attached to the tip of the real ultrasound wand. This is the case in the picture on the left, where the tip of the wand is visible at the bottom of the picture, to the left of the "UNC" letters. But when the head or the wand moves, large dynamic registration errors occur, as shown in the picture on the right. The tip of the wand is now far away from the virtual trapezoid. Also note the motion blur in the background, which is caused by the user's head motion. Figure 21: Effect of motion and system delays on registration. Picture on the left is a static scene. Picture on the right shows motion. (Courtesy UNC Chapel Hill Dept. of Computer Science) Compiled by Omorogbe Harry 94 HCI System delays seriously hurt the illusion that the real and virtual worlds coexist because they cause large registration errors. With a typical end-to-end lag of 100 ms and a moderate head rotation rate of 50 degrees per second, the angular dynamic error is 5 degrees. At a 68 cm arm length, these results in registration errors of almost 60 mm. System delay is the largest single source of registration error in existing AR systems, outweighing all others combined. Methods used to reduce dynamic registration fall under four main categories: Reduce system lag Reduce apparent lag Match temporal streams (with video-based systems) Predict future locations 1) Reduce system lag: The most direct approach is simply to reduce, or ideally eliminate, the system delays. If there are no delays, there are no dynamic errors. Unfortunately, modern scene generators are usually built for throughput, not minimal latency. It is sometimes possible to reconfigure the software to sacrifice throughput to minimize latency. For example, the SLATS system completes rendering a pair of interlaced NTSC images in one field time (16.67 ms) on Pixel-Planes 5. Being careful about synchronizing pipeline tasks can also reduce the end-to-end lag System delays are not likely to completely disappear anytime soon. Some believe that the current course of technological development will automatically solve this problem. Unfortunately, it is difficult to reduce system delays to the point where they are no longer an issue. Recall that registration errors must be kept to a small fraction of a degree. At the moderate head rotation rate of 50 degrees per second, system lag must be 10 ms or less to keep angular errors below 0.5 degrees. Just scanning out a frame buffer to a display at 60 Hz requires 16.67 ms. It might be possible to build an HMD system with less than 10 ms of lag, but the drastic cut in throughput and the expense required to construct the system would make alternate solutions attractive. Minimizing system delay is important, but reducing delay to the point where it is no longer a source of registration error is not currently practical. 2) Reduce apparent lag: Image deflection is a clever technique for reducing the amount of apparent system delay for systems that only use head orientation. It is a way to incorporate more recent orientation measurements into the late stages of the rendering pipeline. Therefore, it is a feed-forward technique. The scene generator renders an image much larger than needed to fill the display. Then just before scanout, the system reads the most recent orientation report. The orientation value is used to select the fraction of the frame buffer to send to the display, since small orientation changes are equivalent to shifting the frame buffer output horizontally and vertically. Image deflection does not work on translation, but image warping techniques might. After the scene generator renders the image based upon the head tracker reading, small adjustments in orientation and translation could be done after rendering by warping the image. These techniques assume knowledge of the depth at every pixel, and the warp must be done much more quickly than rerendering the entire image. Compiled by Omorogbe Harry 95 HCI 3) Match temporal streams: In video-based AR systems, the video camera and digitization hardware impose inherent delays on the user's view of the real world. This is potentially a blessing when reducing dynamic errors, because it allows the temporal streams of the real and virtual images to be matched. Additional delay is added to the video from the real world to match the scene generator delays in generating the virtual images. This additional delay to the video streeam will probably not remain constant, since the scene generator delay will vary with the complexity of the rendered scene. Therefore, the system must dynamically synchronize the two streams. Note that while this reduces conflicts between the real and virtual, now both the real and virtual objects are delayed in time. While this may not be bothersome for small delays, it is a major problem in the related area of telepresence systems and will not be easy to overcome. For long delays, this can produce negative effects such as pilot-induced oscillation. 4) Predict: The last method is to predict the future viewpoint and object locations. If the future locations are known, the scene can be rendered with these future locations, rather than the measured locations. Then when the scene finally appears, the viewpoints and objects have moved to the predicted locations, and the graphic images are correct at the time they are viewed. For short system delays (under ~80 ms), prediction has been shown to reduce dynamic errors by up to an order of magnitude [Azuma94]. Accurate predictions require a system built for real-time measurements and computation. Using inertial sensors makes predictions more accurate by a factor of 2-3. Predictors have been developed for a few AR systems, but the majority were implemented and evaluated with VE systems. More work needs to be done on ways of comparing the theoretical performance of various predictors and in developing prediction models that better match actual head motion. Vision-based techniques Mike Bajura and Ulrich Neumann point out that registration based solely on the information from the tracking system is like building an "open-loop" controller. The system has no feedback on how closely the real and virtual actually match. Without feedback, it is difficult to build a system that achieves perfect matches. However, videobased approaches can use image processing or computer vision techniques to aid registration. Since video-based AR systems have a digitized image of the real environment, it may be possible to detect features in the environment and use those to enforce registration. They call this a "closed-loop" approach, since the digitized image provides a mechanism for bringing feedback into the system. This is not a trivial task. This detection and matching must run in real time and must be robust. This often requires special hardware and sensors. However, it is also not an "AIcomplete" problem because this is simpler than the general computer vision problem. For example, in some AR applications it is acceptable to place fiducials in the environment. These fiducials may be LEDs or special markers. Recent ultrasound experiments at UNC Chapel Hill have used colored dots as fiducials. The locations or patterns of the fiducials are assumed to be known. Image processing detects the locations Compiled by Omorogbe Harry 96 HCI of the fiducials, and then those are used to make corrections that enforce proper registration. These routines assume that one or more fiducials are visible at all times; without them, the registration can fall apart. But when the fiducials are visible, the results can be accurate to one pixel, which is as about close as one can get with video techniques. Figure 2, taken from [Bajura95], shows a virtual arrow and a virtual chimney exactly aligned with their desired points on two real objects. The real objects each have an LED to aid the registration. Figures 23 through 25 show registration from [Mellor95a], which uses dots with a circular pattern as the fiducials. The registration is also nearly perfect. Figure 31 demonstrates merging virtual objects with the real environment, using colored dots as the fiducials in a video-based approach. In the picture on the left, the stacks of cards in the center are real, but the ones on the right are virtual. Notice that they penetrate one of the blocks. In the image on the right, a virtual spiral object interpenetrates the real blocks and table and also casts virtual shadows upon the real objects. Figure 22: A virtual arrow and virtual chimney aligned with two real objects. (Courtesy Mike Bajura, UNC Chapel Hill Dept. of Computer Science, and Ulrich Neumann, USC) Figure 23: Real skull with five fiducials. (Courtesy J.P. Mellor, MIT AI Lab) Figure 24: Virtual wireframe skull registered with real skull. (Courtesy J.P. Mellor, MIT AI Lab) Compiled by Omorogbe Harry 97 HCI Figure 25: Virtual wireframe skull registered with real skull moved to a different position. (Courtesy J.P. Mellor, MIT AI Lab) Figure 26: Virtual cards and spiral object merged with real blocks and able.(Courtesy Andrei State, UNC Chapel Hill Dept. of Computer Science.) Instead of fiducials, [Uenohara95] uses template matching to achieve registration. Template images of the real object are taken from a variety of viewpoints. These are used to search the digitized image for the real object. Once that is found, a virtual wireframe can be superimposed on the real object. Recent approaches in video-based matching avoid the need for any calibration. [Kutukalos96] represents virtual objects in a non-Euclidean, affine frame of reference that allows rendering without knowledge of camera parameters. [Iu96] extracts contours from the video of the real world, and then uses an optimization technique to match the contours of the rendered 3-D virtual object with the contour extracted from the video. Note that calibration-free approaches may not recover all the information required to perform all potential AR tasks. For example, these two approaches do not recover true depth information, which is useful when compositing the real and the virtual. Techniques that use fiducials as the sole tracking source determine the relative projective relationship between the objects in the environment and the video camera. While this is enough to ensure registration, it does not provide all the information one might need in some AR applications, such as the absolute (rather than relative) locations of the objects and the camera. Absolute locations are needed to include virtual and real objects that are not tracked by the video camera, such as a 3-D pointer or other virtual objects not directly tied to real objects in the scene. Additional sensors besides video cameras can aid registration. Both [Mellor95a] [Mellor95b] and [Grimson94] [Grimson95] use a laser rangefinder to acquire an initial depth map of the real object in the environment. Given a matching virtual model, the system can match the depth maps from the real and virtual until they are properly aligned, and that provides the information needed for registration. Compiled by Omorogbe Harry 98 HCI Another way to reduce the difficulty of the problem is to accept the fact that the system may not be robust and may not be able to perform all tasks automatically. Then it can ask the user to perform certain tasks. The system in [Sharma94] expects manual intervention when the vision algorithms fail to identify a part because the view is obscured. The calibration techniques in [Tuceryan95] are heavily based on computer vision techniques, but they ask the user to manually intervene by specifying correspondences when necessary. Current status The registration requirements for AR are difficult to satisfy, but a few systems have achieved good results. [Azuma94] is an open-loop system that shows registration typically within ±5 millimeters from many viewpoints for an object at about arm's length. Closed-loop systems, however, have demonstrated nearly perfect registration, accurate to within a pixel. The registration problem is far from solved. Many systems assume a static viewpoint, static objects, or even both. Even if the viewpoint or objects are allowed to move, they are often restricted in how far they can travel. Registration is shown under controlled circumstances, often with only a small number of real-world objects, or where the objects are already well-known to the system. For example, registration may only work on one object marked with fiducials, and not on any other objects in the scene. Much more work needs to be done to increase the domains in which registration is robust. Duplicating registration methods remains a nontrivial task, due to both the complexity of the methods and the additional hardware required. If simple yet effective solutions could be developed, that would speed the acceptance of AR systems. Sensing Accurate registration and positioning of virtual objects in the real environment requires accurate tracking of the user's head and sensing the locations of other objects in the environment. The biggest single obstacle to building effective Augmented Reality systems is the requirement of accurate, long-range sensors and trackers that report the locations of the user and the surrounding objects in the environment. Commercial trackers are aimed at the needs of Virtual Environments and motion capture applications. Compared to those two applications, Augmented Reality has much stricter accuracy requirements and demands larger working volumes. No tracker currently provides high accuracy at long ranges in real time. More work needs to be done to develop sensors and trackers that can meet these stringent requirements. Specifically, AR demands more from trackers and sensors in three areas: Greater input variety and bandwidth Higher accuracy Longer range Input variety and bandwidth VE systems are primarily built to handle output bandwidth: the images displayed, sounds generated, etc. The input bandwidth is tiny: the locations of the user's head and hands, the outputs from the buttons and other control devices, etc. AR systems, however, will Compiled by Omorogbe Harry 99 HCI need a greater variety of input sensors and much more input bandwidth. There are a greater variety of possible input sensors than output displays. Outputs are limited to the five human senses. Inputs can come from anything a sensor can detect. Robinett speculates that Augmented Reality may be useful in any application that requires displaying information not directly available or detectable by human senses by making that information visible (or audible, touchable, etc.). Recall that the proposed medical applications in Section 2.1 use CT, MRI and ultrasound sensors as inputs. Other future applications might use sensors to extend the user's visual range into infrared or ultraviolet frequencies, and remote sensors would let users view objects hidden by walls or hills. Conceptually, anything not detectable by human senses but detectable by machines might be transduced into something that a user can sense in an AR system. Range data is a particular input that is vital for many AR applications. The AR system knows the distance to the virtual objects, because that model is built into the system. But the AR system may not know where all the real objects are in the environment. The system might assume that the entire environment is measured at the beginning and remains static thereafter. However, some useful applications will require a dynamic environment, in which real objects move, so the objects must be tracked in real time. However, for some applications a depth map of the real environment would be sufficient. That would allow real objects to occlude virtual objects through a pixel-by-pixel depth value comparison. Acquiring this depth map in real time is not trivial. Sensors like laser rangefinders might be used. Many computer vision techniques for recovering shape through various strategies (e.g., "shape from stereo," or "shape from shading") have been tried. A recent work uses intensity-based matching from a pair of stereo images to do depth recovery. Recovering depth through existing vision techniques is difficult to do robustly in real time. Finally, some annotation applications require access to a detailed database of the environment, which is a type of input to the system. For example, the architectural application of "seeing into the walls" assumes that the system has a database of where all the pipes, wires and other hidden objects are within the building. Such a database may not be readily available, and even if it is, it may not be in a format that is easily usable. For example, the data may not be grouped to segregate the parts of the model that represent wires from the parts that represent pipes. Thus, a significant modelling effort may be required and should be taken into consideration when building an AR application. High accuracy The accuracy requirements for the trackers and sensors are driven by the accuracies needed for visual registration, as described in the previous section. For many approaches, the registration is only as accurate as the tracker. Therefore, the AR system needs trackers that are accurate to around a millimeter and a tiny fraction of a degree, across the entire working range of the tracker. Few trackers can meet this specification, and every technology has weaknesses. Some mechanical trackers are accurate enough, although they tether the user to a limited working volume. Magnetic trackers are vulnerable to distortion by metal in the environment, which exists in many desired AR application environments. Ultrasonic Compiled by Omorogbe Harry 100 HCI trackers suffer from noise and are difficult to make accurate at long ranges because of variations in the ambient temperature. Optical technologies have distortion and calibration problems. Inertial trackers drift with time. Of the individual technologies, optical technologies show the most promise due to trends toward high-resolution digital cameras, real-time photogrammetric techniques, and structured light sources that result in more signal strength at long distances. Future tracking systems that can meet the stringent requirements of AR will probably be hybrid systems, such as a combination of inertial and optical technologies. Using multiple technologies opens the possibility of covering for each technology's weaknesses by combining their strengths. Attempts have been made to calibrate the distortions in commonly-used magnetic tracking systems. These have succeeded at removing much of the gross error from the tracker at long ranges, but not to the level required by AR systems. For example, mean errors at long ranges can be reduced from several inches to around one inch. The requirements for registering other sensor modes are not nearly as stringent. For example, the human auditory system is not very good at localizing deep bass sounds, which is why subwoofer placement is not critical in a home theater system. Long range Few trackers are built for accuracy at long ranges, since most VE applications do not require long ranges. Motion capture applications track an actor's body parts to control a computer-animated character or for the analysis of an actor's movements. This is fine for position recovery, but not for orientation. Orientation recovery is based upon the computed positions. Even tiny errors in those positions can cause orientation errors of a few degrees, which is too large for AR systems. Two scalable tracking systems for HMDs have been described in the literature. A scalable system is one that can be expanded to cover any desired range, simply by adding more modular components to the system. This is done by building a cellular tracking system, where only nearby sources and sensors are used to track a user. As the user walks around, the set of sources and sensors changes, thus achieving large working volumes while avoiding long distances between the current working set of sources and sensors. While scalable trackers can be effective, they are complex and by their very nature have many components, making them relatively expensive to construct. The Global Positioning System (GPS) is used to track the locations of vehicles almost anywhere on the planet. It might be useful as one part of a long range tracker for AR systems. However, by itself it will not be sufficient. The best reported accuracy is approximately one centimeter, assuming that many measurements are integrated (so that accuracy is not generated in real time), when GPS is run in differential mode. That is not sufficiently accurate to recover orientation from a set of positions on a user. Tracking an AR system outdoors in real time with the required accuracy has not been demonstrated and remains an open problem. Compiled by Omorogbe Harry 101 HCI Future directions This section identifies areas and approaches that require further research to produce improved AR systems. Hybrid approaches: Future tracking systems may be hybrids, because combining approaches can cover weaknesses. The same may be true for other problems in AR. For example, current registration strategies generally focus on a single strategy. Future systems may be more robust if several techniques are combined. An example is combining vision-based techniques with prediction. If the fiducials are not available, the system switches to open-loop prediction to reduce the registration errors, rather than breaking down completely. The predicted viewpoints in turn produce a more accurate initial location estimate for the vision-based techniques. Real-time systems and time-critical computing: Many VE systems are not truly run in real time. Instead, it is common to build the system, often on UNIX, and then see how fast it runs. This may be sufficient for some VE applications. Since everything is virtual, all the objects are automatically synchronized with each other. AR is a different story. Now the virtual and real must be synchronized, and the real world "runs" in real time. Therefore, effective AR systems must be built with real-time performance in mind. Accurate timestamps must be available. Operating systems must not arbitrarily swap out the AR software process at any time, for arbitrary durations. Systems must be built to guarantee completion within specified time budgets, rather than just "running as quickly as possible." These are characteristics of flight simulators and a few VE systems. Constructing and debugging real-time systems is often painful and difficult, but the requirements for AR demand real-time performance. Perceptual and psychophysical studies: Augmented Reality is an area ripe for psychophysical studies. How much lag can a user detect? How much registration error is detectable when the head is moving? Besides questions on perception, psychological experiments that explore performance issues are also needed. How much does head-motion prediction improve user performance on a specific task? How much registration error is tolerable for a specific application before performance on that task degrades substantially? Is the allowable error larger while the user moves her head versus when she stands still? Furthermore, not much is known about potential optical illusions caused by errors or conflicts in the simultaneous display of real and virtual objects. Few experiments in this area have been performed. Jannick Rolland, Frank Biocca and their students conducted a study of the effect caused by eye displacements in video see-through HMDs. They found that users partially adapted to the eye displacement, but they also had negative aftereffects after removing the HMD. Steve Ellis' group at NASA Ames has conducted work on perceived depth in a see-through HMD. ATR has also conducted a study. Portability: The previous section explained why some potential AR applications require giving the user the ability to walk around large environments, even Compiled by Omorogbe Harry 102 HCI outdoors. This requires making the equipment self-contained and portable. Existing tracking technology is not capable of tracking a user outdoors at the required accuracy. Multimodal displays: Almost all work in AR has focused on the visual sense: virtual graphic objects and overlays. But in the previous section I explained that augmentation might apply to all other senses as well. In particular, adding and removing 3-D sound is a capability that could be useful in some AR applications. Social and political issues: Technological issues are not the only ones that need to be considered when building a real application. There are also social and political dimensions when getting new technologies into the hands of real users. Sometimes, perception is what counts, even if the technological reality is different. For example, if workers perceive lasers to be a health risk, they may refuse to use a system with lasers in the display or in the trackers, even if those lasers are eye safe. Ergonomics and ease of use are paramount considerations. Whether AR is truly a cost-effective solution in its proposed applications has yet to be determined. Another important factor is whether or not the technology is perceived as a threat to jobs, as a replacement for workers, especially with many corporations undergoing recent layoffs. AR may do well in this regard, because it is intended as a tool to make the user's job easier, rather than something that completely replaces the human worker. Although technology transfer is not normally a subject of academic papers, it is a real problem. Social and political concerns should not be ignored during attempts to move AR out of the research lab and into the hands of real users. Conclusion Augmented Reality is far behind Virtual Environments in maturity. Several commercial vendors sell complete, turnkey Virtual Environment systems. However, no commercial vendor currently sells an HMD-based Augmented Reality system. A few monitor-based "virtual set" systems are available, but today AR systems are primarily found in academic and industrial research laboratories. The first deployed HMD-based AR systems will probably be in the application of aircraft manufacturing. Both Boeing and McDonnell Douglas are exploring this technology. The former uses optical approaches, while the latter is pursuing video approaches. Boeing has performed trial runs with workers using a prototype system but has not yet made any deployment decisions. Annotation and visualization applications in restricted, limitedrange environments are deployable today, although much more work needs to be done to make them cost effective and flexible. Applications in medical visualization will take longer. Prototype visualization aids have been used on an experimental basis, but the stringent registration requirements and ramifications of mistakes will postpone common usage for many years. AR will probably be used for medical training before it is commonly used in surgery. The next generation of combat aircraft will have Helmet-Mounted Sights with graphics registered to targets in the environment [Wanstall89]. These displays, combined with short-range steerable missiles that can shoot at targets off-boresight, give a tremendous Compiled by Omorogbe Harry 103 HCI combat advantage to pilots in dogfights. Instead of having to be directly behind his target in order to shoot at it, a pilot can now shoot at anything within a 60-90 degree cone of his aircraft's forward centerline. Russia and Israel currently have systems with this capability, and the U.S. is expected to field the AIM-9X missile with its associated Helmet-Mounted Sight in 2002. Registration errors due to delays are a major problem in this application. Augmented Reality is a relatively new field, where most of the research efforts have occurred in the past four years, as shown by the references listed at the end of this paper. The SIGGRAPH "Rediscovering Our Fire" report identified Augmented Reality as one of four areas where SIGGRAPH should encourage more submissions. Because of the numerous challenges and unexplored avenues in this area, AR will remain a vibrant area of research for at least the next several years. One area where a breakthrough is required is tracking an HMD outdoors at the accuracy required by AR. If this is accomplished, several interesting applications will become possible. Two examples are described here: navigation maps and visualization of past and future environments. The first application is a navigation aid to people walking outdoors. These individuals could be soldiers advancing upon their objective, hikers lost in the woods, or tourists seeking directions to their intended destination. Today, these individuals must pull out a physical map and associate what they see in the real environment around them with the markings on the 2–D map. If landmarks are not easily identifiable, this association can be difficult to perform, as anyone lost in the woods can attest. An AR system makes navigation easier by performing the association step automatically. If the user's position and orientation are known, and the AR system has access to a digital map of the area, then the AR system can draw the map in 3-D directly upon the user's view. The user looks at a nearby mountain and sees graphics directly overlaid on the real environment explaining the mountain's name, how tall it is, how far away it is, and where the trail is that leads to the top. The second application is visualization of locations and events as they were in the past or as they will be after future changes are performed. Tourists that visit historical sites, such as a Civil War battlefield or the Acropolis in Athens, Greece, do not see these locations as they were in the past, due to changes over time. It is often difficult for a modern visitor to imagine what these sites really looked like in the past. To help, some historical sites stage "Living History" events where volunteers wear ancient clothes and reenact historical events. A tourist equipped with an outdoors AR system could see a computergenerated version of Living History. The HMD could cover up modern buildings and monuments in the background and show, directly on the grounds at Gettysburg, where the Union and Confederate troops were at the fateful moment of Pickett's charge. The gutted interior of the modern Parthenon would be filled in by computer-generated representations of what it looked like in 430 BC, including the long-vanished gold statue of Athena in the middle. Tourists and students walking around the grounds with such AR displays would gain a much better understanding of these historical sites and the important events that took place there. Similarly, AR displays could show what proposed architectural changes would look like before they are carried out. An urban designer could show clients and politicians what a new stadium would look like as they walked Compiled by Omorogbe Harry 104 HCI around the adjoining neighborhood, to better understand how the stadium project will affect nearby residents. After the basic problems with AR are solved, the ultimate goal will be to generate virtual objects that are so realistic that they are virtually indistinguishable from the real environment. Photorealism has been demonstrated in feature films, but accomplishing this in an interactive application will be much harder. Lighting conditions, surface reflections, and other properties must be measured automatically, in real time. More sophisticated lighting, texturing, and shading capabilities must run at interactive rates in future scene generators. Registration must be nearly perfect, without manual intervention or adjustments. While these are difficult problems, they are probably not insurmountable. It took about 25 years to progress from drawing stick figures on a screen to the photorealistic dinosaurs in "Jurassic Park." Within another 25 years, we should be able to wear a pair of AR glasses outdoors to see and interact with photorealistic dinosaurs eating a tree in our backyard. Computer Supported Cooperative Work (CSCW) Overview The power of the web as a new medium derives not only from its ability to allow people to communicate across vast distances and to different times, but also from the ability of machines to help people communicate and manage information. The web is a complex distributed system, and object technology has been an important part of the managing of the complexity of the web from its creation. Despite the growth of interest in the field of Computer Supported Cooperative Work (CSCW), and the increasingly large number of systems, which have been developed, it is still the case that few systems have been adopted for widespread use. This is particularly true for widely dispersed, cross-organisational working groups where problems of heterogeneity in computing hardware and software environments inhibit the deployment of CSCW technologies. With a lightweight and extensible client-server architecture, client implementations for all popular computing platforms, and an existing user base numbered in millions, the World Wide Web offers great potential in solving some of these problems to provide an `enabling technology' for CSCW applications. I illustrate this potential using the work with the BSCW shared workspace system--an extension to the Web architecture, which provides basic facilities for collaborative information sharing from unmodified Web browsers. I conclude that despite limitations in the range of applications, which can be directly supported, building on the strengths of the Web can give significant benefits in easing the development and deployment of CSCW applications. Introduction Over the last decade the level of interest in the field of Computer Supported Cooperative Work (CSCW) has grown enormously and an ever-increasing number of systems have been developed with the goal of supporting collaborative work. These efforts have led to a greater understanding of the complexity of group work and the implications of this Compiled by Omorogbe Harry 105 HCI complexity, in terms of the flexibility required of supporting computer systems, have driven much of the recent work in the field. Despite these advances, however, it is still the case that few cooperative systems are in widespread use and most exist only as laboratory-based prototypes. This is particularly true for widely dispersed working groups, where electronic mail and simple file-transfer programs remain the state-of-theart in providing computer support for collaborative work. In this section I examine the World Wide Web as a technology for enabling development of more effective Computer Supported Cooperative Work (CSCW) systems. The Web provides simple client-server architecture with client programs (browsers) implemented for all popular computing platforms and a central server component that can be extended through a standard API. The Web has been extremely successful in providing a simple method for users to search, browse and retrieve information as well as publish information of their own, but does not currently offer features for more collaborative forms of information sharing such as joint document production. There are a number of reasons to suggest the Web might be a suitable focus for developers of CSCW systems. For widely dispersed working groups, where members may be in different organisations and different countries, issues of integration and interoperability often make it difficult to deploy existing groupware applications. Although non computer-based solutions such as telephone and video conferencing technologies provide some support for collaboration, empirical evidence suggests that computer systems providing access to shared information, at any time and place and using minimal technical infrastructure, are the main requirement of groups collaborating in decentralised working environments. By offering an extensible centralised architecture and cross-platform browser implementations, increasingly deployed and integrated with user environments, the Web may provide a means of introducing CSCW systems which offer much richer support for collaboration than email and FTP, and thus serve as an `enabling technology' for CSCW. In the following section I discuss the need for such enabling technologies for CSCW to address problems of system development and deployment. I then give an overview of the Web architecture and components and critically examine these in the context of CSCW systems development. I suggest that the Web is limited in the range of CSCW systems that can be developed on the basic architecture and, in its current form, is most suited for asynchronous, centralised CSCW applications with no strong requirements for notification, disconnected working and rich user interfaces. I reveal benefits of the Web as a platform for deploying such applications in real work domains, and conclude with a discussion of some current developments, which may ease the limitations of the Web as a platform for system development and increase its utility as an enabling technology for CSCW. What is CSCW? Computer Supported Cooperative Work, or CSCW, is a rapidly growing multidisciplinary field. As personal workstations get more powerful and as networks get faster and wider, the stage seems to be set for using computers not only to help accomplish our everyday, personal tasks but also to help us communicate and work with others. Indeed, group activities occupy a large amount of our time: meetings, telephone calls, mail Compiled by Omorogbe Harry 106 HCI (electronic or not), but also informal encounters in corridors, coordination with secretaries, team workers or managers, etc. In fact, work is so much group work that it is surprising to see how poorly computer systems support group activities. For example, many documents (such as this research work) are created by multiple authors but yet no commercial tool currently allows a group of authors to create such shared documents as easily as one can create a single-author document. We have all experienced the nightmares of multiple copies being edited in parallel, format conversion, mail and file transfers, etc. CSCW is a research area that examines issues relating to the design of computer systems to support people working together. This seemingly all-encompassing definition is in part a reaction to what has been seen as a set of implicit design assumptions in many computer applications - that they are intended to support users to do their work on their own. In cases where a scarce resource (such as early computers themselves, or a database, or even a digital library) has to be shared; systems designers have minimised the effects of this shared activity and tried to create the illusion of the (presumed ideal) case of exclusive access to resources. We see the same assumptions in discussion of digital libraries as a way of offering access to resources without the need to compete with (or even be aware of the existence of) other library users. By contrast, CSCW acknowledges that people work together as a way of managing complex tasks. Despite the wilder claims of Artificial Intelligence, not all these tasks can be automated. Thus it is sensible to design systems that allow people to collaborate more effectively. This can also open up opportunities for collaboration that have previously been impossible, overly complex or too expensive; such as working not merely with colleagues in the same office, but via video and audio links with colleagues in a different building or on a different continent. CSCW has a strong interdisciplinary tradition, drawing of researchers from computer science, sociology, management, psychology and communication. Although the bulk of this article is about how CSCW might be used in libraries, it is also the contention that CSCW should also be informed by work in library and information science. The world of CSCW is often described in terms of the time and space in which a collaborative activity occurs. Collaboration can be between people in the same place (colocated) or different places (remote). Collaboration can be at the same time (synchronous) or separated in time (asynchronous). Figure 9 illustrates the possibilities. Figure 27 - The CSCW spatial and temporal quadrants Compiled by Omorogbe Harry 107 HCI Examples from the various quadrants are: same time, same place: meeting support tools. same time, different place: video conferencing. different time, same place: A design team's shared room containing specialist equipment. different time, different place: email systems. CSCW radically changes the status of the computer. Until now, the computer has been used as a tool to solve problems. With CSCW, the computer/network is a medium: a means to communicate with other human beings, a vector for information rather than a box that stores and crunches data. If we look at the history of technology, new media have been much more difficult to invent, create and operate than new tools. From this perspective, it is not surprising that CSCW has not yet realized its full potential, even in the research community. I hope this report will help readers to better understand the challenges and promises of CSCW and encourage new developments both in research and in industry. CSCW is not recent. Back in the late 1960s, Doug Engelbart created the NLS/Augment system that featured most of the functions that today's systems are trying to implement such as real-time shared editing of outlines, shared annotations of documents, and videoconferencing. The field really emerged in the 1980s and has been growing since then, boosted in the recent years by the explosion of the Internet and the World Wide Web. The Web itself is not a very collaborative system: pages can be easily published but it is impossible (or very difficult) to share them, e.g. to know when someone is reading a particular page or when a page has been modified. The range and complexity of the problems to solve and to support cooperative activities is rapidly overwhelming: data sharing, concurrency control, conflict management, access control, performance, reliability, the list goes on. In addition to these technical difficulties, there is another, maybe harder, problem in implementing groupware: people. For a medium to work, there must be an audience that accepts using it. Usability issues have stressed the need to take the users into account when designing, developing and evaluating an interactive software. For groupware, usability issues go beyond the now well-understood (if not always well-applied) methods from psychology and design. They involve social sciences to understand how people work together, how an organization imposes and/or adapts to the work practices of its workers, etc. In many CSCW projects, ethnographic studies have been conducted to better understand the nature of the problem and the possible solutions. A large body of the research work in CSCW is conducted by social scientists, often within multidisciplinary teams. Computer scientists often ignore or look down upon this aspect of CSCW and almost always misunderstand it. User-centered design is essential to ensure that computer scientists solve the right problems in the right way. Traditional software works as soon as it "does the job"; Interactive software works better if it is easy to use Compiled by Omorogbe Harry 108 HCI rather than if it has more functions; Groupware works only if it is compatible with the work practices of its users. A large part of this section is devoted to the exploration of these problems and the state of the art of their solutions. In fact, CSCW is challenging most of the assumptions that were explicitly or implicitly embodied in the design of our current computer systems. CSCW tools, or groupware, are by nature distributed and interactive. To succeed in the marketplace, they must be safe (authentication), interoperable (from network protocols to operating systems and GUI platforms), fault-tolerant and robust (you don't want to be slowed down or loose your data if another participant in the session uses a slow connection or experiences a crash). The need for enabling technologies for CSCW Most of the CSCW systems, which have been developed to date, have been constructed in laboratories as research prototypes. This is perhaps not surprising, as CSCW systems place novel requirements on underlying technology such as distributed systems and databases, and many of the mechanisms developed to support multi-user interaction do not address issues of cooperation such as activity awareness and coordination. This has focused much attention on the development of mechanisms to support floor management, user interface `coupling', update propagation and so on, and has resulted in a range of experimental systems tailored to the particular issues being investigated. The proprietary and incompatible architectures on which many are based, the esoteric hardware and software required and the lack of integration with existing application programs and data formats inhibits deployment outside the laboratory and within the intended application domain. It might be argued that this situation is not unduly problematic; issues of system deployment are `implementation concerns' and would be addressed by re-implementation of system prototypes. The lack of system deployment does however pose a serious question to CSCW: if systems built to investigate particular models or mechanisms are never deployed and evaluated in use, how can we determine the effectiveness of these models and mechanisms in supporting cooperative work? A central concern of CSCW is the need for systems which are sensitive to their contexts of use, and a body of empirical data exists to show the problems caused when systems are introduced which do not resonate with existing work practice. When systems do not leave the research laboratory it is difficult to see how the models and mechanisms they propose can be assessed other than from a technical perspective. Recent calls for CSCW systems to be designed so they can be evaluated in use and for a more situated approach to system evaluation reflect this need to migrate CSCW systems out of the laboratory and into the field if we are to eventually provide more effective systems. This migration is far from trivial, as the diversity of machines, operating systems and application software, which characterises the real work domain, is often far removed from the homogeneity of the laboratory. This is particularly true for working groups, which cross departmental or organisational boundaries, where issues of integration and interoperability mean it is extremely unlikely that systems developed as research prototypes can be directly deployed. Adaptation or re-implementation of system prototypes for deployment outside the laboratory is usually beyond the resources of most research projects, suggesting that the issue of system deployment and the attendant Compiled by Omorogbe Harry 109 HCI problems should not be tackled at the end of the prototype development, but should be a central focus of the system design. Developing CSCW systems that integrate smoothly with systems, an applications and data format already in place in the work domain adds considerably to what is already a complex design task. A number of researchers have pointed to the need for tools to assist with the development of CSCW systems, removing some of the complexity of user interface, application and distributed systems programming which developers currently face. Such `enabling technologies' would ease problems of system development and allow a more evolutionary approach--an approach otherwise prohibited by the investment necessary to create system prototypes and the need to commit to policy decisions at an early stage in a system's design. Work in CSCW is already addressing these issues through development of toolkits or application frameworks with components, which can be instantiated and combined to create groupware systems. Toolkits such as GroupKit are by now relatively mature, and seem to reduce the complexity of CSCW system development in much the same way that user interface toolkits allow rapid development of single-user interfaces. As I have shown, the desire for enabling technologies for CSCW lies not only in easing problems of prototype construction but also facilitating deployment and thereby evaluation of system prototypes in real work domains. As yet, most CSCW toolkits focus primarily on system development and issues of crossplatform deployment, integration with existing applications and so on are secondary. In this regard more than any other the World Wide Web seems to offer potential as an enabling technology for CSCW: Web client programs (browsers) are available for all popular computing platforms and operating systems, providing access to information in a platform independent manner, Browsers offer a simple user interface and consistent information presentation across these platforms, and are themselves extensible through association of external `helper applications', Browsers are already part of the computing environment in an increasing number of organisations, requiring no additional installation or maintenance of software for users to cooperate using the Web, Many organisations have also installed their own Web servers as part of an Internet presence or a corporate Intranet and have familiarity with server maintenance and, in many cases, server extension through programming the server API. As a basis for deployment of CSCW applications in real work domains, the level of acceptance and penetration of Web technology in commercial and academic environments is grounds alone for suggesting that CSCW should pay serious attention to the World Wide Web. Compiled by Omorogbe Harry 110 HCI Supporting Collaboration within Widely-distributed Work-groups Most shared Workspace system were conceived as a means of supporting the work of widely-dispersed work- groups, particularly those involved in large research and development projects. Members of such projects may come from a number of organisations, in different countries, yet have a need to share and exchange information and often collaborate over its production. The geographical distribution prohibits frequent face-to-face meetings, and would clearly benefit from computer support for the collaborative aspects of the work. Unfortunately, the lack of common computing infrastructure within the group often prohibits deployment of such technology and causes serious problems for system developers who must pay close attention to issues of heterogeneous machines, networks, and application software. As a consequence of these problems, despite over 10 years of research in the field of CSCW (Computer Supported Cooperative Work), email and ftp remain the state-of-theart in supporting collaboration within widely-distributed work- groups. Although such tools facilitate information exchange, they provide little support for information sharing, whereby details of users' changes, annotations and so on are made visible and available to all other participants. A conclusion drawn by many is that for more powerful CSCW technologies to flourish, a common infrastructure that addresses problems of integration is required, allowing developers to focus on application details rather than complexities of different system configurations. The W3 is the first real example of such a common infrastructure, offering huge potential to CSCW system developers, through: platform, network and operating system transparency, integration with end-user environments and application programs, a simple and consistent user interface across platforms, an application programmer interface for 'bolt-on' functionality, and ease of deployment facilitating rapid system prototyping. Given this potential, it is unsurprising that a number of W3- based collaboration systems have been developed. We can classify these systems in four broad categories, based on the extent to which they depart from existing W3 standards: Purely W3-based: Such systems use standard W3 clients, comply with HTML and HTTP standards, and only extend server functionality using the CGI interface. Any additional client functionality is provided by helper applications (we do not include client APIs here, such as CCI, as they are not standardised across clients and platforms). An example of such a purely W3- based system is reported in. Customised servers: As 1, but require special-purpose servers, to provide behaviour beyond the possibilities offered by CGI. Such systems still support standard W3 clients and protocols, but the enhancements may reduce the portability of the server itself. InterNotes is an example of such a customised server. Customised clients: As 1 (and sometimes 2), but require particular or modified clients (often to support non-standard HTML tags), or non-standard client APIs, and could not necessarily be used with different platforms or different clients. These systems do Compiled by Omorogbe Harry 111 HCI however support the HTTP protocol. The Sesame client for Ubique's Virtual Places system is an example. Web-related: Such systems may provide a W3 interface, but support only limited interaction using the HTTP protocol. The Worlds system is an example of this category. In this classification, the degree of W3 compliance decreases from 1 to 4; one might say that a system in Category 1 inherits all the benefits of the W3 listed above, while a system in Category 4 gives the developer a free- hand in choice of protocols, interface toolkits and so on but few of the benefits. A major goal of this work was to produce a useful and usable system--one that could be deployed in the target domain and refined on the basis of actual usage feedback. It therefore means that some set of design goals has to be set as follows: No modification to the HTTP protocol No modifications to HTML, or client customisation other than through Helper applications All server customisation to be performed through the CGI interface The following section describes the current version of the system we developed, and we then return to these three design goals to discuss the system's implementation. The Web as enabling technology for CSCW ``[The Web] was developed to be a pool of human knowledge, which would allow collaborators in remote sites to share their ideas and all aspects of a common project'' (Berners-Lee et al. 1994, page 76). From its inception the Web was intended as a tool to support a richer, more active form of information sharing than is currently the case. Early implementations at CERN allowed the browsing of pages as is common today, but also supported annotation of these pages and addition of links between arbitrary pages, not just from pages on local servers the user can access and edit. Some of these concepts were carried through to early drafts of the standards for Web protocols and architecture which described features such as remote publishing of hypertext pages and check in/out support for locking documents to ensure consistency in a multi-author environment. To date these aspects have largely been sidelined while development of Web browsers, servers and protocols has focused on more `passive' aspects of information browsing. In this section I examine the Web as it currently exists as a platform for developing and deploying CSCW technologies, following a brief overview of the components on which it is based. Developing Web-based CSCW applications Despite the lack of direct support for collaboration, the current Web architecture does hide some of the complexity of deploying applications in a distributed, heterogeneous environment. The most common method of doing this is by extending a Web server through the CGI with new application functionality or `glue' code to an existing application, presenting the application user interface as a series of HTML pages, which can be displayed by standard Web browsers. With this approach developers can take advantage of the existing base of browsers as client programs for their applications but must accept the constraints of the basic Web architecture and protocols as currently Compiled by Omorogbe Harry 112 HCI implemented and the limitations of existing Web browsers. These constraints are severe, inhibiting the development and deployment of CSCW applications in a number of areas: Communication: There is no support for server-server, (server initiated) server-client or client-client communication, problematic for applications where the server needs to play an active role (e.g. to notify users of changes to information or maintain information consistency over several servers). One consequence is that applications are now in common use which poll Web servers periodically to check if pages have been updated, allowing users to monitor Web sites of interest (e.g. Netscape's SmartMarks). Users can specify a very small time interval between checks, even for pages which change rarely, leading to huge amounts of unnecessary traffic on the Internet and `hits' on Web servers. Pure centralised architecture: The architecture provides no support for distribution of information or computation between clients and servers or replication across servers. Expensive, powerful and fault-tolerant machines are required to run a Web server if it is to scale to a large number of users. Even simple computations are not performed by the client, for example to check if a user has filled in all the fields of a form, resulting in unnecessary network traffic, server loading and slow feedback times for the user. The lack of support for replication means that disconnected working is not possible. No guaranteed `Quality of Service': The HTTP protocol does not support the specification of guaranteed transmission rates between servers and clients. Data transfer is often `bursty', subject to network and server loading which might vary considerably during a single transmission. This is unsuitable for transmission of (real-time) continuous media like audio and video, and alternative protocols such as RTP, `the Real-Time Protocol', have been proposed for these media types. User interface design: HTML is not a user interface design toolkit, and although markup tags are provided for simple form-filling widgets like input fields these do not support features now common in desktop user interfaces such as drag and drop, multiple selection and semantic feedback. Although some browser vendors have introduced new tags to provide features like multiple, independent screen areas (Netscape Frames) they do little to broaden the possibilities for user interface design (and are not supported by all browsers). A fundamental problem here is the lack of server-client notification (see above); it is easy for the interface to become inconsistent with the information on the central server and is only updated when the user reloads the entire page. Some of these limitations are not so much problems with Web components like HTTP and HTML but more with the current implementations of browsers and servers. For example there is no reason why a server could not act as a client and vice versa to allow a form of update propagation and notification. (In fact some servers can send requests as well as handle them, often to act as a `proxy' for routing requests through a firewall.) These limitations do however restrict the kinds of CSCW systems which can be developed as extensions of the Web using the CGI, and suggest that the Web in its current form is largely unsuitable for developing systems which require highlyinteractive user interfaces, rapid feedback and `feedthrough' (user interface updates in response to others' interactions) or a high degree of synchronous notification. Compiled by Omorogbe Harry 113 HCI Of course, extending a Web server through the CGI programming interface is not the only method of deploying a CSCW system on the Web, and more radical approaches can remove some of the constraints of the basic architecture. Based on the extent to which a developer must modify the basic Web components, we can identify the following approaches: Extending through CGI: As described above, where no modifications are required to protocols, browsers or servers. Any additional client functionality is provided through `helper' applications. The BSCW system described in the next section is an example of such a system. Customising/building a server: Building a special-purpose Web server may be necessary to achieve adequate performance or security, or to introduce new functionality such as server-initiated notification. This approach requires deployment of the server software and any other application code, but is sometimes a better method of enabling access to existing applications from the Web in a more flexible and secure manner than CGI. The BASIS WEBserver, which enables Web access to the BASISplus document management system, is a good example of this. Customising/building a client: Building a special-purpose client allows applications other than Web browsers to communicate with Web servers using HTTP, such as the `coordinator' clients developed for the WebFlow distributed workflow system (Grasso et al. 1997). Customising a client may also be necessary to interpret non-standard HTML tags such as those proposed by Vitali and Durand (1995) for version control to support collaborative editing of HTML documents. Custom clients can be used in conjunction with custom servers to provide additional services; as part of their Virtual Places system, the Ubique client can interact with the Virtual Places server to provide synchronous communication and a form of `presence awareness'. Providing a Web interface: Some systems such as Worlds (Fitzpatrick et al. 1995) provide a Web interface but are not designed specifically for deployment on the Web. These applications use other means of providing the user interface, managing data and event distribution and so on, and only limited interaction is possible using a Web browser and HTTP. Using this classification the flexibility for the developer increases from 1 to 4, and many of the problems identified above can be solved by specialising or replacing components such as clients and servers to provide richer mechanisms for the user interface, update propagation and so on. Of course, this level of flexibility is bought at the price of the innovation required from developers to build or integrate these components, and it should be obvious that very soon we may find ourselves back to square one; with a system which cannot be deployed outside the laboratory due to particular hardware and software requirements and a lack of integration with existing user environments. In this case, if our goal is eventual deployment and evaluation in real work domains, there seems little point in using the Web as a platform for CSCW system development. Despite these problems however I would strongly argue that the Web is an `enabling technology for CSCW'. The limitations identified above mean the Web is more suited to asynchronous, centralised applications with no strong requirements for synchronous Compiled by Omorogbe Harry 114 HCI notification, disconnected working and rich user interfaces. The advantages however--an accepted technology, integrated with existing user environments and extensible through the server API without requiring additional client software on users' machines--indicate that here we have a method of deploying and evaluating basic mechanisms to support collaboration in real work domains. Further, the rapid pace of development in Web technologies suggests that many proprietary and experimental features, which address some of the current limitations, could become standards in the future. Of course much depends on the willingness of the main browser vendors (currently Netscape and Microsoft) to agree on and implement these features, but this does not seem to have been a problem to date. As Web technology matures some of the current problems with CSCW development on the Web should be solved. Experiences and perspectives of the Web as enabling technology for CSCW In this section I am concerned primarily with the Web as a potential enabling technology for CSCW systems, rather than possibilities for enhancing the Web itself with mechanisms to make it more `collaborative'. I therefore focus my discussion on the role of the Web as a vehicle for developing and deploying CSCW systems, instead of a target of CSCW research in its own right, and thus orient more to the utility of current and future Web standards for CSCW systems rather than possible modifications to these standards as informed by CSCW research. This last however is clearly an area where CSCW and the Web have much to say to each other; for example, the phenomenon that the Web is currently a `lonely place' is an argument put forward by Lea et al. (1997) for their work on Virtual Societies, and the goal of adding group `awareness' mechanisms to augment the Web is receiving increasing attention from the CSCW community (see for example Greenberg and Roseman 1996, Palfreyman and Rodden 1996). The topic of awareness is only one of several issues, which might be included in a research agenda for CSCW with respect to augmenting the basic Web architecture, protocols and technologies. I have taken the position that for CSCW the role of an enabling technology is twofold, easing problems of both development and deployment of CSCW systems in real-world domains, and that deployment is best achieved when systems integrate smoothly with existing Web technologies. I now discuss the possibilities and problems of developing CSCW systems on the Web, before reviewing recent developments, which might address these problems and broaden the range of CSCW systems, which can be supported. Experiences of developing Web-based CSCW systems The current standards for Web components like HTML and HTTP reflect the emphasis to date on the Web as a tool for information browsing. This allows information providers to design and retain control of the form and content of their information and `publish' it via a Web server. Consumers can then access the information by sending requests via their Web browsers. The CGI server programming interface allows extension of the Web within this `provider-consumer' framework, so that servers can generate responses onthe-fly as well as serve static Web pages stored in files on the server file system. Our experiences with most collaborative system suggest that, as a tool for application development, it is straightforward to extend the Web with application functionality or Compiled by Omorogbe Harry 115 HCI interface to an existing application. The method of passing request details through the CGI programming interface is simple and allows developers to write extension programs in most programming languages, with no need to link extension code with the server. Extension programs must generate and return HTML, which again is straightforward. In combination with a high-level, interpreted programming language such as Python, this arrangement allows extremely rapid prototyping and testing using a standard Web client and server. The CGI approach does however inherit all the problems of the request-response model of the Web. One of these is the feedback delay caused by the round-trip to the server to service every user interaction. When requesting documents or even HTML pages this delay may be acceptable, but for simple requests, especially those that change only the state of the interface, this delay is a problem. For example, with the BSCW system it is possible for users to fold in/out the object action and description lines using the `A' and `D' buttons, and with the adjacent checkbox buttons select all/none of the objects in a folder listing. Using these features requires a request to the server to generate a modified HTML page, and when interacting via the Internet (as do most of the users of our public server) network delays represent a much larger component of the total time to service the request than processing time. In designing a user interface for a Web-based application, developers must take care to reduce the number of required trips to the server, possibly by allowing the user to `batch' requests at the client (using multiple HTML forms for example). At the server side the simplicity of the CGI approach can also be problematic. The execution of extension programs in separate processes which are passed details of the request may allow rapid development, but gives the developer no chance to modify server behaviour or request information which is not passed explicitly through the CGI. Where the default behaviour is adequate, as is the case for the user authentication features used directly by BSCW for example, there are no problems. Where features are inadequate for an application's needs the developer cannot modify these but must either re-implement them using the CGI or build a custom HTTP server (Trevor et al. 1996). The Web is best suited as a development platform for applications which do not need to step outside the information provider-consumer model, currently enshrined in existing standards and browser and server implementations. When this is required, it is often necessary to provide additional components at the server or client (in the form of helper applications). The latter removes one of the main advantages of the Web, which is the ability to deploy systems without requiring development of client programs that run across platforms or installation of additional software by users. For BSCW, the need to upload documents to the server has required considerable effort to produce versions of the (very simple) helper, which operate on PC, Macintosh and Unix machines. For this and other aspects such as synchronous notification, information replication and so on the basic Web standards and components offer no support and developers must provide their own solutions. Much of the work in the Web standards community is focusing on refinement of protocols, client and server architectures to improve the speed and reliability with which requests can be handled, and not on providing more flexible and powerful components for application development. This emphasis is not surprising; the growth of the Web has Compiled by Omorogbe Harry 116 HCI been so rapid that aspects of the HTTP protocol in particular must urgently be redesigned to ensure the Web architecture can continue to scale to millions of users worldwide. However, this growth has also led to demand from users and third-party vendors for extensions to Web components to allow richer support for different media types, user interfaces and so on. To meet this demand, server and browser vendors have proposed a number of mechanisms and built support for these in their products. There is some evidence that this practice is key in the continuing development of the Web. An example of this is the support for HTML page editing and remote publishing, identified as an area requiring support by a number of vendors including Netscape (with the Navigator Gold browser), Microsoft (with FrontPage) and GNN's GNNPress. Although the solutions offered are currently incompatible, all have a need for uploading documents to a Web server and this has prompted efforts to agree a standard method for doing this. The World Wide Web Consortium (W3C) has recently established a working group on "Distributed Authoring and Versioning on the World Wide Web" to examine requirements and work towards specifications to support this activity. Similarly, the need for richer Web pages has led to tools like Java and JavaScript being supported by the major browser vendors and becoming de-facto standards. Where relevant, these de-facto standards have also filtered into the documented standards process, as is the case with some of the proprietary extensions to HTML now part of the latest proposed standard, HTML 3.2. Broadening the possibilities for CSCW As Internet technologies continue to penetrate and impact upon marketing, finance, publishing, organisational IT and so on, the demand for extension and innovation will increase. The growth of the corporate Intranet, for example, raises requirements of information replication, workflow services and the like, while commerce applications require much higher levels of security and privacy than are currently supported. Although many vendors will seek to provide proprietary solutions to these requirements, and thus lock corporate customers into particular technological solutions, it is also clear that technologies are emerging which have the potential to broaden the possibilities for third-parties to customise Web components and develop new extensions in a more generic fashion. The problems of the CGI approach to extending an existing Web server are well known, and vendors of Web server technology are seeking to provide more flexible solutions for developers. For example, in designing the API for the Apache server (currently the most well-deployed Web server on the Internet), the developers sought to allow ``third-party developers to easily change aspects of the server functionality which you can't easily access from CGI'' (Thau 1996, page 1113). Similar developments in browser programming interfaces such as Netscape's `Plug-in' development kit and Microsoft's `ActiveX' environment are intended to extend the capabilities of standard Web browsers to handle new media types directly, embed Web browsers in other applications and more. Such advances in client and server programming interfaces allow development of much richer CSCW systems, better integrated with desktop environments than is possible with basic Web components. In the main however these developments are specialised to Compiled by Omorogbe Harry 117 HCI particular browsers or servers or operate only on particular platforms, and do not offer the same advantages as the basic components for cross-platform deployment of CSCW systems. Although some vendors have announced support for others' programming interfaces, it remains to be seen how this will work in practice as they (particularly browser vendors) seek to differentiate their products on the basis of the richness of features supported. An approach, which is independent of particular client and server programming interfaces, seems to offer more potential in this regard. One area receiving much attention is that of `mobile code' where, in addition to data in HTML or other formats, a browser might download small application programs or `applets' which are executed on the local machine, taking input and displaying output via the Web browser. This should remove many of the constraints on the application developer: applets can be designed which provide much richer user interfaces than are possible with HTML; computation can be moved to the client, for example to check for valid input data and thus reduce network traffic, server loading and feedback lags; applets supporting special protocols can be developed which handle different media types and so on. Although there are many problems to be overcome, most notably security concerns when code downloaded over the Internet is executed on the user's machine, significant progress has been made in this area. The latest Web browsers now provide support for applets written in Sun’s Java programming language from Netscape, Microsoft and IBM. For tasks requiring less power than a full programming language, scripting tools like Netscape's JavaScript follow similar principles but are less ambitious, allowing HTML pages to be extended with code fragments to pass responsibility for simple computations from the server to the client. I see these developments as broadening the role of the Web as an enabling technology for CSCW, increasing the range of CSCW systems, which can be developed while not compromising the benefits of cross-platform system deployment. In the BSCW project they are making use of both Java and JavaScript to overcome problems with the basic Web components and provide richer collaboration services to users of the BSCW system. With JavaScript they have augmented the HTML user interface of the BSCW system to remove the need to send requests to the server for changes in user interface state, including the folding of actions and descriptions and the select all/none behaviour discussed above. With Java they are being more ambitious, designing applets which provide synchronous collaboration services such as event notification, presence awareness, simple text chat and more, which can be presented in standard Web browsers alongside the existing BSCW HTML user interface. An early prototype of this work is discussed in (Bentley et al. 1995). Collaborative Computing – A chronic examples Collaborative computing allows users to work together on documents and projects, usually in real time, by taking advantage of underlying network communication systems. Whole new categories of software have been developed for collaborative computing, and many existing applications now include features that let people work together over networks. Here are some examples: Compiled by Omorogbe Harry 118 HCI Application suites such as Microsoft Office and Exchange, Lotus Notes, and Novell Groupwise that provide messaging, scheduling, document coauthoring, rulesbased message management, workflow routing, and discussion groups. Videoconferencing applications that allow users to collaborate over local networks, private WANs, or over the Internet. See “Videoconferencing and Desktop Video” for more information. Internet collaboration tools that provide virtual meetings, group discussions, chat rooms, whiteboards, document exchange, workflow routing, and many other features. Multicasting is an enabling technology for groupware and collaborative work on the Internet that reduces bandwidth requirements. A single packet can be addressed to a group, rather than having to send a packet to each member of the group. See “Multicasting” for more details. Good examples of collaborative applications designed for Internet use are Microsoft’s NetMeeting and NetShow. NetMeeting allows intranet and Internet users to collaborate with applications over the Internet while NetShow let’s users set up audio and graphic (nonvideo) conferences. These products are described below as examples of the type of collaborative applications available in the intranet/Internet environment. More information about the products is available at http://www.microsoft.com. NetMeeting NetMeeting uses Internet phone voice communications and conferencing standards to provide multiuser applications and data sharing over intranets or the Internet. Two or more users can work together and collaborate in real time using application sharing, whiteboard, and chat functionality. NetMeeting is included in Microsoft’s Internet Explorer. NetMeeting can be used for common collaborative activities such as virtual meetings. It can also be used for customer service applications, telecommuting, distance learning, and technical support. The product is based on ITU (International Telecommunication Union) standards, so it is compatible with other products based on the same standards. Some of NetMeeting’s built in features are listed here. INTERNET PHONE Provides point-to-point audio conferencing over the Internet. A sound card with attached microphone and speaker is required. ULS (USER LOCATION SERVICE) DIRECTORY Locates users who are currently running NetMeeting so you can participate in a conference. Internet service providers can implement their own ULS server to establish a community of NetMeeting users. MULTIPOINT DATA CONFERENCING Provides a multipoint link among people who require virtual meetings. Users can share applications, exchange information through a shared clipboard, transfer files, use a shared whiteboard, and use text-based chat features. APPLICATION SHARING Compiled by Omorogbe Harry 119 HCI Allows a user to share an application running on his computer with other people in a conference. Works with most Windows-based programs. As one user works with a program, other people in the conference see the actions of that user. Users may take turns editing or controlling the application. SHARED CLIPBOARD Allows users to easily exchange information by using familiar cut, copy, and paste operations. FILE TRANSFER Lets you transfer a file to another person by simply choosing a person in the conference and specifying a file. File transfers occur in the background as the meeting progresses. WHITEBOARD Provides a common drawing surface that is shared by all users in a conference. Users can sketch pictures, draw diagrams, or paste in graphics from other applications and make changes as necessary for all to see. CHAT Provides real-time text-based messaging among members of a conference. NetShow NetShow is basically a low-bandwidth alternative to videoconferencing. It provides live multicast audio, file transfer and on-demand streamed audio, illustrated audio, and video. It is also a development platform on which software developers can create add-on products. According to Microsoft, NetShow provides “complete information-sharing solutions, spanning the spectrum from one-to-one, fully interactive meetings to broadly distributed, one-way, live, or stored presentations.” NetShow takes advantage of important Internet and network communication technologies to minimize traffic while providing useful tools for multiuser collaboration. IP multicasting is used to distribute identical information to many users at the same time. This avoids the need to send the same information to each user separately and dramatically reduces network traffic. Routers on the network must be multicast-enabled to take advantage of these features. NetShow also uses streaming technology, which allows users to see or hear information as it arrives, rather than wait for it to be completely transferred. Other Products A number of other companies are working on collaborative products that do many of the same things as NetMeeting and NetShow. Netscape Conference and SuiteSpot are similar products. SuiteSpot integrates up to ten collaborative applications into a single package. Additional information is available at http://www.netscape.com. Netscape Collabra Server, which is included in its SuiteSpot enterprise suite of applications lets people work together over intranets or over the Internet. Companies can Compiled by Omorogbe Harry 120 HCI create discussion forums and open those forums to partners and customers. Collabra Server employs a standards-based NNTP (Network News Transport Protocol) and it allows discussions to be opened to any NNTP-compliant client on the Internet. In addition, discussions can be secured and encrypted. Another interesting product is one called CyberHub from Blaxxun Interactive (http://www.blaxxun.com). It provides a high-end virtual meeting environment that uses 3-D graphics and VRML (Virtual Reality Modeling Language). Compiled by Omorogbe Harry 121