1) Introduction - [Background and History] Early computers were usually batch programmed: Tasks (programmed and data) prepared off-line. Batch of tasks loaded and run in sequence. Operator intervention limited to loading batches. No intervention whilst batch is running. In recent years, new challenges and possibilities have emerged: Computers have become smaller and cheaper New interaction technologies have been developed. Legislation has raised standards, particularly with regards to accessibility. Accessibility means giving users with special needs the same level of access as other users. For example, people who are blind/visually impaired can operate GUIs with the aid of a screenreader. However, compared to other users, screen-reader usually tends to: work more slowly make more errors report higher level of fatigue Thus, while visually impaired people can user GUIs, it cannot be said that they have the same level of access as other users. As computers become cheaper and more powerful, we are seeing a move from reactive systems to adaptive/proactive systems: Reactive Systems User always initiates actions Large screens, focus on user attention Little need for adaptivity Proactive Systems System or user can initiate actions No-screen/hands free, user attention elsewhere Adaptivity essential for effective operation 2) Introduction - [Issues and Topics] "A user interface should be so simple that a beginner in an emergency can understand it within ten seconds" "..Any application designed for people should be: Easy to learn( and remember ) Useful, that is, contain functions that people really need in their work, and Be easy and pleasant to use." Four Key Concepts Learnability - the time and effort required to reach a specified level of user performance. Throughput - tasks accomplished, speed of execution, errors made, etc. Flexibility - the extent to which the systems can accommodate changes to the tasks and environments beyond those first specified. Attitude - the attitude engendered in users by the application. For example: A travel agent (primary user) may use a system to search for hotels, flights, trains, etc., on behalf of... A customer( secondary user ) 3) Human Memory & Perception - [Memory] Human memory as three distinct stages. Information is: 1. 2. 3. Received through one or more of the sensory memories e.g. Iconic( visual ) memory Ethoic( auditory ) memory Haptic memory Selectively held in short term memory while it is analyzed, after which is may be either discarded or... Stored permanently in long term memory The short term memory can hold around seven items of information However, it is not easy to define an "item of information". An item might be: A single digit or character or word, or... A long number or entire phrase, if that number or phrase is already known by the person There are two types of long term memory: Episodic Memory represents our memory of events and experiences - It stores items in serial form, allowing us to reconstruct sequences of events and experiences from earlier points in our lives. Semantic Memory is structured so that is represents relationships between information it stores information without regard to the order in which it was acquired or the sense through which it was acquired. There are three main processes associated with long term memory (LTM): Storage/remembering Forgetting Information Retrieval Information passes into long term memory via the short term memory. However, not everything that is held in short term memory is eventually stored in long term memory. The main factors that determine what is stored are: Rehersal Meaning Rehersal, i.e. repeated exposure to data, or consideration of it, increases the likelihood that it will be stored in LTM. Meaningful information is more likely to be stored in LTM than meaningless data. There are two main theories to explain the loss of information from long term memory: decay or interference. Decay: Ebbinghaus concluded that information is lost through natural decay. Interference: new information may replace or corrupts older information. For example, changing your telephone number may cause you to forget your old number. This is known as retroactive interference. However, there may also be times when older information 'resurfaces' and becomes confused with newer information. For example, you may suddenly recall an old telephone number and confuse it with your new one. This is known as proactive inhibition. There are two types of information retrieval from LTM: Recall: the recovery of information as a result of a conscious search. Recognition: the automatic recovery of information as a result of an external stimulus. Recognition is around twice as fast and three times as accurate as recall. 4) Human Memory & Perception - [Visual Perception] The human visual system can be divided into two stages: Physical reception of light Processing and interpretation The human visual system has both strengths and weaknesses: Certain things cannot be seen when present Processing allows images to be constructed from incomplete information Light passes through the cornea and is focused by the lens producing and inverted image on the retina. The iris regulates the amount of light entering the eye. The retina is covered with photoreceptors. These are of two types: Rods; high sensitivity to light, see's black and white colors, low resolution Cones; low sensitivity to light, see's color( red, green, black ), high resolution The eye contains: around 120 million rods, most of which are located around the periphery of the retina around 6 million cones, most of which are located in the fovea The lens is flexible and can focus the image on different parts of the retina. This makes it possible to adapt between light and dark conditions: In bright conditions, light is focused on the fovea, giving high resolution and color vision In dark conditions, focus is shifted onto the periphery, giving greater sensitivity but reducing resolution and color perception. The retina contains ganglion cells which perform some local processing of images. There are two types of ganglion cells: X-Cells perform basic pattern recognition mainly concentrated in the fovea Y-Cells perform movement detection more widely distributed than X-Cells, and predominate the periphery The photo-receptors and ganglion cells are all connected to the optic nerve, which carries visual information to the brain. There are no photo-receptors in the area of the retina around the optic nerve. Thus there is a blind spot at this point. We are not usually aware of the blind spot because our brains ‘fill in' the missing part of the image. The luminance of an object depends on: the amount of light falling on its surface the reflective properties of the surface(s) Contrast is related to luminance. It is the difference in luminance between the brightest and darkest areas of an image. The human visual system compensates for bright or dark conditions by varying the relative percentage of rods and cones it uses. The human eye can distinguish about 150 hues within the visible light spectrum. However, the total number of colors we can distinguish is much higher. This is because: Each of the pure hues can be mixed with white in various quantities to produce other colors. We refer to the spectrum of hues as fully-saturated colors. When mixed with white, we refer to them as partially-saturated or de-saturated colors. The brightness of each color can be varied: In practice, we use a limited number of primary colors, e.g.: Red, Green and Blue( RGB ) when mixing light This is known as additive mixing. Cyan, Magenta, Yellow and Black( CMYK ) when mixing pigments This is known as subtractive mixing. Factors affecting our judgement of size include: Stereo vision - the difference in the image seen by each eye can be analysed to gauge distances Head Movement - small changes in viewing position produce changes in view that allow distance to be gauged Monocular Cues: Relative size Relative height Relative motion When children learn to read, they initially read linearly, i.e. start at the beginning of the sentence read each word in turn identify the meaning of each word identify the meaning of the sentence This is a very slow and inefficient method of reading. As they become more proficient at reading they learn to scan text by spotting key-words. This process involves the following stages: Identify a word or character Guess the meaning of the phrase of sentence Confirm/disaprove the guess Revise the guess if necessary A number of methods are used to measure the readibility of text: Average reading time A group of people are asked to read the text, and the average time taken is noted. Fog Index Takes into account word-length, sentence-complexity, etc. Cloze Technique Subjects are asked to read a piece of text in which every fifth word is blanked out. The index is based on the percentage of blanked words that are guessed correctly. Factors that affect the readability of text include: Font-style and capitalization Font size Character spacing Line Lengths 5) Human Memory & Perception - [Auditory Perception] Like the visual system, the human auditory system can be divided into two stages: Physical reception of sounds Processing and interpretation Like the visual system, the human auditory system has both strengths and weaknesses: Certain things cannot be heard even when present Processing allows sounds to be constructed from incomplete information The principal characteristics of sound - as perceived by the listener - are: Pitch Loudness Timbre The perceived intensity of a sound depends upon: The sound pressure. The distance between the source and the listener. The duration of the sound. The frequency of the sound. Our hearing system allows us to determine the location of sound sources with reasonable accuracy, subject to certain limitations. Stereo hearing allows us to locate the source of a sound by comparing the sound arriving at each ear and noting differences in: Amplitude Time of arrival Head movement allows us to improve the localization accuracy of stereo hearing Analysis of reflected vs direct sound allows us to localize both the horizontal and vertical planes - to a limited extent Familiarity affects localization accuracy Judgment of distance is based partly on intensity - the quicker the sound, the farther away the source. Sound localization (in both horizontal and vertical planes) can be improved by tailoring the sound distribution. This is done using Head-Related Transfer-Functions (HRTFs). Ideally, HRTFs should be tailored to suit the individual. However, this is complex and costly. Researchers are currently trying to develop non-individualized HRTFs which will give a useful improvement in localization accuracy for a substantial percentage of the population. Research suggests that the human auditory system includes a short-term store - a kind of mental 'tape loop' that always stores the last few seconds of sound. This is known as the Pre-categorical Acoustic Store or PAS. Researchers disagree as the length of the store. Estimates range from as little as 10 seconds to as much as 60 seconds. However, there is significant evidence for the existence of such a store. The existence of this auditory store explains some of the following effects. Recall of Un-attended Material The Recency Effect If someone listens to a voice reciting a list of digits (or characters etc.) and is then asked to repeat the digits, he or she will recall the last few digits more reliably than the earlier ones. Typically the last 3 - 5 digits are recalled. The Auditory Suffix Effect The recency effect (see above) is most noticeable when the speech or sound is followed by a period of silence. If a further sound occurs after (e.g.) a list has been spoken recall is impaired. Conversely, if speech or sound is followed by complete silence, the period for which the last few seconds of it can be recalled extends significantly. 6) Human Memory & Perception - [Haptic Perception] Haptic Perception is the general term covering the various forms of perception based on touch. There are three types of sensory receptor in the skin: thermoreceptors respond to heat and cold mechanoreceptors respond to pressure nociceptors respond to intense heat, pressure or pain In computing applications, we are mostly concerned with mechanoreceptors. Mechanoreceptors are of two types: rapidly-adapting mechanoreceptors react to rapid changes in pressure, but do not respond to continuous pressure slowly-adapting mechanoreceptors respond to continuous pressure Sensory acuity is often measured using the two-point test. This simply involves pressing two small points (ed. sharpened pencil tips) against the body. The two points are initially placed very close together, and then moved further apart until it becomes possible to feel two distinct pressure points rather than one. The smaller the distance at which both points can be detected, the greater the sensory acuity. The fingers and thumbs have the greatest acuity. Sensory acuity varies considerably among individuals. It can be improved with training, within certain limits. For example, blind people who read Braille generally have better sensory acuity than non-Braille readers. However, certain medical conditions can lead to reduced sensory acuity. Kinaesthetic Feedback Another aspect of haptic perception is known as kinaesthetic feedback. Kinaesthetic receptors in our joints and muscles tell us where our limbs, fingers, etc., are relative to the rest of our body. Kinaesthetic receptors are of three types: Rapidly-adapting kinaesthetic receptors respond only to changes in the position of limbs, etc... Slowly-adapting kinaesthetic receptors respond both to changes in position and static position of limbs, etc.. Static receptors respond only to static position of limbs, etc.. Kinaesthetic feedback is important in many rapid actions, e.g. typing or playing a musical instrument. Haptic Memory As with auditory perception, we have a short-term sensory memory for haptic experience. This is known as haptic memory. It functions in a very similar way to the auditory store, i.e.: Haptic events are stored as they are experienced New experiences replace older ones in the memory, but... If no new haptic events are experienced, previous events remain in the store. 7) Human Memory & Perception - [Speech Perception] How do humans extract meaning from speech? Early models assume a 'bottom-up' approach, i.e.: Separate the stream of speech into words. Identify each word and determine its meaning. Determine the meaning of the whole utterance. More recent models assume a 'top-down' approach, i.e.: Analyze prosody and other cues to locate the key words. Identify the key words and guess the meaning of utterance. If unsuccessful, analyze more words until the meaning has been extracted. Even when the individual words are correctly recognized, speech is more difficult to analyze than written language. There are a number of reasons for this: Speech relies heavily on non-grammatical sentence forms( minor sentences ) There is no punctuation Repetition and re-phrasing are common. Efficient speech communication relies heavily on other communication channels - gesture, facial expression, etc... 8) User-Centered Design - [Intro] The first stage in the design of an interface is to identify the requirements. This involves consideration of a number of questions, such as: 1. What area of expertise will the application be designed for? 2. Who are the users? 3. What do the users want to do with the application? 4. Where and how will the application be used? 1. What is the area of expertise? The task of identifying domain knowledge is known as domain analysis. A common problem with domain analysis is that: Experts are so familiar with their field that they regard some domain knowledge as general knowledge. Thus they are unable to accurately identify domain knowledge. Therefore, domain analysis should involve talking to both: experts in the relevant fields(s) end-users( or potential end users ) 2. Who are the users? It is important to know who the system is being designed for. Therefore the designer should start by identifying the target users. One approach is to draw-up a 'profile' which includes factors such as: Age Sex Culture Physical abilities and disabilities Computing/IT knowledge experience However, it’s very difficult to design for a large, loosely-defined group. A better approach is to segment the users into a number of smaller, tightly-defined groups. Each group can be represented by a profile of an imaginary user. These profiles are called personas. A persona: Should cover all the factors listed above, but should also include other details, such as likes and dislikes, habits, etc… Can be a composite, combining characteristics from a number of real people, but should be consistent and realistic. Should read as the description of a real person. In segmenting the users, it may also be necessary to distinguish between primary and secondary users. For example: in the case of a flight information system: the primary users might be travel agents the secondary users might be customers who book flights through travel agents 3. What do the users want to do? In identifying needs we must distinguish between: Needs identified by professional designers/developers. These are often referred to as normative needs The needs of the end-user. These can be difficult to determine. It often helps to think in terms of: Expressed needs - what end-users SAY they want Felt needs - what end-users ACTUALLY want( or would like ) from the system The principal methods used to identify user needs are: direct observation( where possible ) questionnaires interviews Ideally, you should observe people who are using the system for their own ends, unprompted by you. For example, if the task is to develop a better ATM interface, you could (with the banks permission) user video to monitor people using existing ATMs. You could then note any problems they encounter. An artefact is an object or aid used in the performance of a task. Examples of artefacts include: Notes stuck to the computer detailing (e.g.) keyboard short-cuts. Reference manuals pinned in a prominent position. Manuals created by the users themselves. The questionnaire might cover: How much experience they have with relevant systems? What kinds of tasks, queries, etc., they have carried out using this type of system? Did they encounter particular problems? If they have tried several computing systems, did they find one easier to use than another, and if so, in what way? Interviews vs questionnaires: Interviews are usually less structured than questionnaires Questionnaires provide a more formal, structured setting than interviews, ensuring consistency between respondents. 4. How will the application be used? For example, users of an ATM may only be able to devote part of their attention to the task because: They are surrounded by other people and feel pressured or concerned about their privacy They are simultaneously trying to control small children 9) User-Centered Design - [Conceptual Design] …Nothing of note… 10) UCD - Guidelines - [Shneiderman's Golden Rules] Shneiderman’s Eight Golden Rules are widely-used general-purpose guidelines. Shneiderman has revised the rules a number of times since he first proposed them. The current set of rules is as follows: 1. Strive for consistency Identical terminology should be used in menus, prompts etc. Consistent colour and layout should be used If exceptions have to be made, they should be comprehensible and limited in number 2. Cater for universal usability Recognise the needs of diverse users(range of ages, levels of expertise, special needs, etc.), e.g. Explanations for novices Shortcuts for experts 3. Offer informative feedback For every user action there should be system feedback, tailored to the action: Modest feedback for frequent and/or modest actions More substantial feedback for infrequent and/or major actions 4. Design dialogs to yield closure Sequence of actions should be organized into groups with a beginning, middle, and end. 5. Prevent errors As far as possible, design systems so that users cannot make errors, e.g.: Grey-out inappropriate menu-items Do not allow typing of alphabetic characters into numeric fields 6. Permit easy reversal of actions This relives anxiety since the user knows that errors can be undone 7. Support internal locus of control Operators want to feel they are in charge of a system 8. Reduce short-term memory load 11) UCD - Guidelines - [Web Content Accessibility Guidelines] It comprises 12 Guidelines which relate to four general principles: Perceivable Operable Understandable Robust Perceivable Provide text alternatives for any non-text content Provide alternatives for time-based media. Create content that can be presented in different ways without losing information or structure. Make it easier for users to see and hear content 2. Operable Make all functionality available from a keyboard. Provide users enough time to read and use content. Do not design content in a way that is known to cause seizures. Provide ways to help users navigate and find content. 3. Understandable Make text content readable and understandable. Make Web pages appear and operate in predictable ways. Help users avoid and correct mistakes. 4. Robust Maximize compatibility with current and future user agents, including assistive technologies. 1. 12) UCD - [Heuristics and Metrics] Once a prototype system (or even a partial prototype) has been created, it can be analysed to see how usable it is. The two main approaches to testing are Heuristic Evaluation and Usability Metrics. Heuristic Evaluation In Heuristic Evaluation, a number of evaluators examine an interface and assess its compliance with a set of recognised usability principles (the heuristics). Heuristics are general rules which describe common properties of usable interfaces. The process is as follows: Each evaluator is asked to assess the interface in the light of the heuristics - not their own likes/dislikes, etc.. Evaluators work alone, so that they cannot influence one-another. Each evaluator should work through the interface several times. Evaluators should record their comments so that they in turn, can be recorded by an observer If an evaluator encounters problems with the interface the experimenter should offer assistance, but not until the evaluator has assessed and commented upon the problem. Only when all the evaluators have assessed the system individually should the results be aggregated and the evaluators allowed to communicate with one another. Usability Metrics The term Usability Metrics refers to a range of techniques that are typically more expensive and time-consuming than Heuristic Evaluation but yield more reliable results. Techniques based on usability metrics involve asking a group of users to perform a specified task (or set of tasks). The data gathered may include: success rate (task completion/non-completion, % of task completed) time errors (number of errors, time wasted by errors) user satisfaction Examples Web Accessibility Testers These work in a similar way to HTML validators, but analyse the target page for accessibility as well as for HTML code validity. They automatically check many of the accessibility issues listed in the Web Content Accessibility Guidelines, e.g.: Inclusion of alt text, summaries, table header information, etc. Contrast between foreground and background colours etc. Where a page is found to violate the guidelines, most testers identify the type of error and the line of HTML code on which it occurs. 13) UCD - Interaction Modelling - [Introduction] Interaction models can be divided into two broad categories: Task analysis models only what happens - or is observable - during interaction Cognitive models Designed to incorporate some representation of the user's abilities, understanding, knowledge, etc... Cognitive models can be broadly categorised as follows: Hierarchical representations of the user's task and goal structure These models deal directly with the issues of formulating tasks and goals. Linguistic and Grammatical models These models deal with articulation and translation between the system and the user. Physical and Device-Level models These models deal with articulation at the human motor level rather than at higher levels. 14) UCD - Interaction Modelling - [Goal & Task Hierarchies] Probably the best-known and most influential model based on goal/task hierarchies is GOMS GOMS stands for Goals, Operators, Methods and Selection. Goals These describe what the user wishes to achieve. Operators These represent the lowest level of analysis, the basic actions that the user must perform in order to use the system. Methods It may be possible to achieve a goal using any of several alternative sub-goals or sequences of sub-goals. These are known as methods. Selection Where a goal may be achieved using several alternative methods, the choice of method is determined by a selection rule. Note that GOMS, like many models based on goal/task hierarchies, does not take account of error. Cognitive Complexity Theory CCT has two descriptions which operate in parallel: A description of the user's goals based on a GOMS-like hierarchy but expressed through production rules. A description of the system state, expressed as generalised transition networks, a form of state transition network. 15) UCD - Interaction Modelling - [Linguistic & Grammatical Models] These use formalisms such as BNF (Backus-Naur Form) to describe interactions. The intention is to represent the cognitive difficulty of the interface so that it can be analysed. Backus-Naur Form BNF can be used to define the syntax of a language. BNF defines a language in terms of Terminal Symbols, Syntactic Constructs and Productions. Terminal Symbols Elementary symbols of a language, such as words and punctuation marks. In computing languages, these may be variable-names, operators, reserved words, etc... Syntactic Constructs (or non-terminal symbols) Phrases, sentences, etc. In computing languages, these may be conditions, statements, programs, etc... Productions are sets of rules which determine how Syntactic Constructs are built. 16) UCD - IM - [Physical & Device Models - Fitt's Law] Fitts' Law states that, for a given system, the time taken to move a pointer onto a target varies as a function of: The distance the pointer has to be moved The size of the target. Fitts' Law is normally stated as follows: d tm = a + b log2 ( +1) s Where: tm a b d s = = = = = movement time start/stop time device tracking speed distance moved target size (relative to the direction of movement) a and b must be empirically determined for different operations, pointing devices, etc.. Some implications of Fitts' Law: Interaction times can be reduced by making targets large and distances small wherever possible, e.g.: Pop-up menus are generally faster to use than fixed menus. The efficiency of fixed, linear menus can be improved by: Placing frequently-used options near the start-point Placing the menu at (or near) the screen edge so that it becomes infinitely large in the direction of movement. Point-and-click operations are usually faster than dragging operations. The distance/size ratio determines acquisition time. 17) UCD - IM - [Physical & Device Models - KLM] The Keystroke-Level Model (KLM) is designed to model unit-tasks within an interaction. These would typically be short command sequences, such as changing the font of a character. The KLM would rarely be used to model sequences lasting more than twenty seconds. The Keystroke-Level Model divides tasks into two phases: Acquisition - the user builds a mental model of the task. Execution - the task is executed using the system's facilities. The KLM does not attempt to model what happens during the acquisition phase. This must be done using other models or methods. However, the KLM models what happens during the execution phase in great detail. The execution phase is broken down into physical motor operations, system responses, and mental operations. The KLM defines five types of motor operation: K keystroking, i.e., striking a key, including a modifier key such as shift B Pressing a mouse button P Pointing, using the mouse or other pointing device, at a target H Homing, i.e., switching the hand between mouse and keyboard D Drawing lines using the mouse The KLM also provides mental response and system response operators: M Mentally preparing for a physical action R Response from the system: may be ignored in some cases, e.g., copy-typing Suppose we wish to model the interaction involved in correcting a single-character error using a mouse-driven text-editor. This involves pointing at the error, deleting the character, re-typing it, then returning to the original point in the text. This might be modelled as follows: 1 move hand to mouse H[mouse] 2 position cursor after bad character PB[LEFT] 3 return hand to keyboard H[keyboard] 4 delete character MK[DELETE] 5 type correction K[char] 6 reposition insertion point H[mouse]MPB[LEFT] Once an operation has been decomposed in this way, the time required to perform it can be calculated. This is done by counting the number of each type of operation, multiplying by the time required for each type of operation, then summing the times, e.g.: Texecute = TK + TB + TP + TH + TD + TM + TR For example, the time required for the operation described earlier could be calculated as follows: Texecute = 2tB + 3tH + 2tK + 2tM + 2tP 18) UCD - Usability Testing - [Experimental Design] Testing can be carried out at various stages during design and development, e.g.: At a preliminary stage, to determine requirements, expectations, etc. During design, as a means of testing general concepts or individual elements of a proposed system. At the prototype stage, to find out if the design meets expectations etc. The best approach is iterative testing, i.e., testing at each stage of the design and development cycle. Usability testing may take different forms, depending upon the stage at which it is carried out and the type of data required: Surveys - subjects fill-in a questionnaire or are interviewed. Surveys can be either: Qualitative The questionnaire contains 'open' questions that may elicit a wide range of responses. For example, 'what did you like most about the web-site?’ Quantitative The questionnaire contains questions or statements that require a 'yes/no' or numerical response. The results can be analysed statistically if required. For example, 'The performance was too slow', to which the user should indicate agreement or disagreement on a numerical scale. Observation - users are observed (or videoed) using a system and data is gathered on (e.g.) time taken to perform tasks, number of errors made, etc. Controlled Studies Usually involve comparing a new system with a reference system. The comparison is based on measurable/observable factors such as time to complete task, number of errors made, etc.. The results would normally be analysed statistically. Designing Controlled Studies In order to carry out a controlled study we need: Two (or more) conditions to compare, e.g. performance of a task on: a new/experimental system an existing system which serves as a reference A task which can be performed on both systems A prediction that can be tested A set of variables, including: an independent variable one or more dependent variables A number of subjects who may need to be divided into groups An experimental procedure Conditions If we conduct a controlled study in which we compare a new system against a reference system, we use the following terminology: The condition in which the new system is used is known as the experimental condition. The condition in which reference system is used is known as the control condition. Variables The independent variable is the one we wish to control. The dependent variable is the one we will measure in order to determine if changing the independent variable has produced an effect. The dependent variable will be some measure of performance on the two systems, e.g.: task-completion time level of knowledge/skills acquired user satisfaction Subjects and Groups The subjects should be chosen to suit the system under test, e.g.: potential customers, if testing an eCommerce system students, if testing an eLearning application people with a relevant special need, if testing an accessible system Having chosen the subjects, we also have to decide how to assign them to the conditions. The options are: Independent measures: divide the subjects randomly into groups, and test each group under a different condition. Matched subjects: as above, but match the groups according to relevant criteria (e.g., the average IQ score is the same for each group). Repeated measures: all subjects are tested under all conditions. Quality of Data When designing a test or questionnaire, careful thought should be given to the kind of data it will generate. If our aim (for example) is merely to gather ideas on how to improve a system, then a qualitative questionnaire will be suitable. However, if we hope to demonstrate that our system is better than existing systems in some way(s), we may want to use a statistical test to prove this. In this latter case, we will need to design our test or questionnaire carefully to ensure it yields testable data. Statisticians classify data under the following headings: Nominal-scaled data There is no numerical relationship between scores e.g., a score of 2 is not necessarily higher than a score of 1. Ordinal-scaled data A score of 2 is higher than a score of 1, but not necessarily twice as high. Data obtained from questionnaires is usually ordinal-scaled. Interval-scaled data A score of 2 is exactly twice as high as a score of 1. Timing data is usually interval-scaled. Parametric data The data must be interval-scaled (see above) and in addition: The scores must be drawn from a normal population If we were to measure our subjects on factors which are important in the study (e.g., intelligence), the results would lie on a normal distribution (sometimes known as a bell-curve). The scores must be drawn from a population that has normal variance If we were measure our subjects as described above, the spread of scores would be the same as that found in the general population. 19) UCD - Usability Testing - [Data Analysis] The Frequency Distribution As a first attempt at visualising the result, we might create a frequency distribution. This is a graph showing the frequency with which each score occurs under each condition: The frequency distribution shows us that the scores for the experimental group appear to be higher than the scores for the control group. This is a commonly-used descriptive method. It presents the data, without loss, in a form that allows the characteristics of the data to be understood more easily than is possible using just the raw data. The Average Descriptive measures are useful but have limitations. Often we need to summarise the data in some way. One of the simplest ways to summarise data is by calculating the averages. However, this tells us very little about the data. For small groups of subjects, a single very low or very high score (an outlier) can significantly affect the average. This would be obvious in a frequency distribution but not in an average value. Therefore the average, while useful, does not capture all the features of the data. The Variance A more useful way of summarising data is to state the variance. The variance indicates the amount of dispersion in the scores. By quoting just two values - the variance and the average - we can summarise a set of scores in considerable detail. Standard Deviation Another widely-used measure of dispersion is the standard deviation. The standard deviation is simply the square-root of the variance. Standard Deviation and the Normal Distribution The frequency distribution graph obtained earlier shows marked differences between the two sets of scores, not only in their average values but also in their distribution. If we were to take samples from an infinite number of subjects and then chart the frequency distribution, we would probably find that the results show a normal distribution. The normal distribution has the following features: It is symmetrical, with most of the scores falling in the central region. Because it is symmetrical, all measures of central tendency (mean, mode, median) have the same value. It can be defined using only the mean and the standard deviation. 20) UCD - Usability Testing - [Statistical Inference] Experimental Design and Analysis Statistical Inference All the techniques described so far are intended either to describe or summarise data. For many purposes this is sufficient, but sometimes we need to go further and attempt to prove that: there is a significant difference between two sets of experimental data there is a significant difference in a particular direction, e.g., that data-set a is better in some way than data-set b This is known as drawing statistical inference. Significance When designing experiments we try to keep all possible factors stable with the exception of one, the independent variable, which we deliberately manipulate in some way. We then measure another variable, the dependent variable, to see how it has been affected by the change(s) in the independent variable. However, we cannot assume that all changes in the dependent variable are due to our manipulation of the independent variable. Some changes will almost certainly occur by chance. The purpose of statistical testing is to determine the likelihood that the results occurred by chance. We can never prove beyond doubt that any differences observed are the result of changes in the independent variable rather than mere chance occurrences. However, we can determine just how likely it is that a given result could have occurred by chance, and then use this figure to indicate the reliability of our findings. Before testing, we formulate two hypotheses: Any differences arise purely as a result of chance variations This is known as the null hypothesis Any differences arise - at least in part - as a result of the change(s) in the independent variable This is known as the alternate or experimental hypothesis Statistical tests allow us to determine the likelihood of our results having occurred purely by chance. Thus they allow us to decide whether we should accept the null hypothesis or the alternate hypothesis. We usually express probability on a scale from 0 to 1. For example: p ≤ 0.05 When accompanying a statistical finding, this indicates that the likelihood of the observed difference having occurred as a result of chance factors is less than one in 20. This is known as the significance level. What is an appropriate level of significance to test for? If we choose a relatively high value of significance, we are more likely to obtain significant results, but the results will be wrong more often. This is known as a Type 1 error. If we choose a very low value for significance, we can place more confidence in our results. However, we may fail to find a correlation when it does in fact exist. One-Tailed and Two-Tailed Predictions In formulating our prediction, we must also decide whether to predict the direction of any observed difference or not. If we predict only that there will be a difference, we are using a two-tailed test. If we predict the direction of the difference, we are using a one-tailed test. This is known as a Type 2 error. Choice of Test When choosing a test, the following factors should be taken into account: Two-sample or k-sample Most tests compare two groups of samples, e.g., the results obtained from comparative tests on two different systems. Some tests can be used to compare more than two groups of samples, e.g., the results obtained from comparative tests on three or four different systems. Related measures or independent measures Different tests are used depending upon whether the two (or more) groups from which the data is drawn are related or not. Nominal, ordinal, interval or parametric data. These three factors - number of groups, relationship between groups and quality of data are the principal factors to be taken into account when designing a study and choosing a statistical test. There are tests available to suit each combination of these factors. 2-Sample Tests k-Sample Tests Related Samples Tests t-test (related-samples) parametric data Wilcoxon interval-scaled data Sign test ordinal-scaled data Page's L test ordinal-scaled data Independent Samples Tests t-test (independent-samples) parametric data Mann-Whitney ordinal-scaled data X2 test nominal-scaled data Jonckheere trend test ordinal-scaled data Various software packages are available to carry out these and similar tests. Therefore, the main task facing the designer of a usability test is to choose the right test, in accordance with the data being gathered.