Using card sorts to elicit perceptions of risk

advertisement
Using card sorts to elicit software developers’ categorizations of project risks
Satvere Sanghera & Gordon Rugg
c. 3,700 words
Abstract
Although risk is a topic which has been extensively described in several literatures, there has
been surprisingly little research into people’s individual categorization of risks. This article
describes how we used card sorts to elicit software developers’ own categorizations of a
variety of factors which could cause risks to software development projects. The results were
an interesting mixture of the expected and the unexpected: the developers’ categorizations
were in some ways richer than those of the comparison group, as might be expected, but were
in other ways less rich than those of the comparison group. The developers’ categorizations
were largely based on project management concepts, but were phrased in idiosyncratic ways,
suggesting that the developers were using these concepts as a starting point for their own
categorization, rather than applying them “by the book”. We conclude that this approach to
perception of risk is an effective one, and merits wider use.
Keywords: software project management; risk; card sorts
Introduction
Risk in general, and software project risks in particular, have received considerable attention
from practitioners and the research community. One praiseworthy result of this is an
extensive literature on risk management, together with corresponding methods and
1
guidelines for best practice, within software engineering. Another praiseworthy result is an
extensive literature on perception of risk, together with corresponding methods and
guidelines for best practice, within psychology. This would be a good thing, except that in
practice the two research communities have proceeded largely independently, often with
quite different conclusions. A further complication is that there is yet another extensive
literature, with its own body of practice, focusing on disasters and disaster management; this,
like the literature on human error, has evolved its own methods and guidelines for best
practice. This article describes how we used card sorts to investigate how software developers
actually perceived risk, with a view to establishing how much relationship this bore to the
various literatures on risk.
One relevant literature is the normative literature on risk analysis and risk management for
information systems (IS). This topic is covered in most IS textbooks, usually as part of IS
security. Standard approaches in this tradition include matrices of likelihood versus severity
of risks. These will presumably be familiar to most readers, so are not discussed in further
detail here.
The second literature which we took into account is the considerable literature on the
psychology of risk perception, much of it relating to the judgement/decision-making (J/DM)
in the tradition of Kahneman, Slovic & Tversky’s classic text [1]. This literature has produced
a substantial set of results showing that people are prone to serious and predictable errors in
areas such as estimation of probabilities. A classic example involves framing: people tend to
react differently to the same underlying risk depending on the way in which it is presented,
even though the different presentations are in fact formally equivalent. Another classic
example is that people tend to be over-optimistic about their own prospects in situations such
2
as estimating their own life expectancy, or the likelihood of failure when they are setting up a
new company.
The J/DM literature is significant for two reasons. One is that it clearly demonstrates a
predictable set of shortcomings and biases in risk estimation which apply even to experts.
These findings are not included in the vast majority of IS risk literature. The second
significant issue is that, although the reliability of the J/DM effects is not in dispute, there has
been considerable debate within the literature about their validity. Researchers such as Gerd
Gigerenzer have argued that the classic J/DM effects are largely an artefact of asking people
to estimate risks as probabilities; when the same problem is re-cast as a formally equivalent
frequency estimation, then the effects typically vanish. The balance of opinion in the field has
recently shifted towards the frequentist position, but there remains a general consensus that
people have problems with risk estimation when the task is presented in certain ways. An
excellent overview of work by Gigerenzer and other leading researchers in this tradition is
provided by [2].
One consistent finding from the J/DM literature is that people’s reaction to risks is dominated
by a few main underlying factors, the most prominent of which are severity of outcome, the
extent to which the risk is novel and unknown, and the amount of control that the individual
has over the situation. In general, the more prominent these factors are in a given situation,
the more likely it is that people will react in ways which might appear irrational. Given the
same severity of possible outcome, for instance, people tend to be much more concerned
about an unfamiliar risk which is outside their control than about a much more likely risk
which is familiar and perceived to be under their control – the classic example is people’s
3
nervousness about air travel (statistically very safe) as opposed to automobile accidents (a
much more common cause of death).
Taken in conjunction with other issues such as framing effects, this has serious implications
for public policy decisions involving new technological developments, such as automated
aviation systems to replace human operatives, since it can be extremely difficult to predict
how the public will react. However, despite the seriousness of this issue, comparatively little
research has involved asking people to categorize risks in their own terms. Most research has
involved researchers deriving underlying factors from statistical analysis of experimental
results – even when researchers have looked at individual differences in risk-taking behavior,
this has typically involved statistical analysis of individuals’ behaviors in experiments, rather
than asking the individuals to explain their mental models of risk [3].
The third literature which we considered is the “disaster literature” in the tradition of
researchers such as Charles Perrow [4], Nancy Leveson and Clark Turner [5]. This literature
typically examines the history of a particular incident in considerable detail, and frequently
uses the results to guide policy formulation. This literature can lead both into very specific
technical detail and also into the realities of actual working practices, as opposed to official
working practices – Leveson’s examination of the Therac-25 incident is a classic example of
this, with a detailed examination both of the programming errors that led to the device killing
several people, and of the working practices that led to the situation in which the
programming errors occurred.
4
We wanted to find out which issues were perceived by developers as being important ones in
software development project risk, and to see how these corresponded to the three main
literatures described above, plus the literature on human error, which is discussed below.
One promising way of investigating this was via card sorts. Card sorts have been widely used
on an informal basis for many years, but the formalization of one variety [6] has led to its use
more systematically in areas such as elicitation of software metrics [7] and of programmers’
categorization of software problems [8]. This variety of card sorts involves showing a
respondent a set of cards representing domain entities, and asking the respondent to sort the
cards into groups which they considered significant. After the respondent has sorted the
cards in relation to one criterion, they are asked to re-sort the same cards using a different
criterion, and to repeat this process until all the relevant criteria have been covered. The cards
may bear images such as screen dumps of Web pages [7], or may bear words, such as verbal
descriptions of programming problems [8].
This method offered several promising advantages for research into how developers actually
categorized risks. One is that the method allows respondents to choose whichever criteria and
groups they want. Another is that the responses from card sorts are typically short, discrete
phrases which are less ambiguous and vague than typical responses to interviews and to
open questions in questionnaires. A third advantage is that the format of the method makes it
possible to include quite detailed descriptions of programming problems, project problems,
etc, in a tractable way. Previous researchers using this technique have consistently reported
that respondents reacted positively to the method, and that respondents would often say
explicitly that when they thought they had listed all the criteria worth mentioning; this
5
provides a useful insurance against the risk of respondents giving trivial answers because of
issues such as demand characteristics in a research context.
For this study, we concentrated on the three literatures described above. We did not focus
primarily on the literature on human error because some important types of human error
occur at a subconscious level, and are therefore not amenable to examination via card sorts.
An example is frequency capture errors, where someone executes an action which they make
frequently, rather than a rarer (but correct in the circumstances) action. A typical example of
this is someone who usually enters their home through the front door, and who mistakenly
gets out the front door key when they want to enter via the back door instead.
The method we used was as follows.
The case study
There were two groups, each of six respondents, with both groups containing a mix of
Caucasian and Asian respondents, and with both groups containing three male and three
female respondents to allow for possible gender effects in card sorting [8]. One group
consisted of software developers with at least one year of commercial experience in
information systems projects. The other was a control group consisting of respondents with
no information systems experience beyond home computer use. The developers’ ages ranged
from 22 to 38, and the control group’s from 20 to 36.
6
The materials used were 9 standard 15cm x 10cm filing cards, each numbered, and each
bearing a different description of a potential source of problems on a project. The full list,
with the same sequence of numbering as on the cards, is as follows:
1: Failure to understand who the project is for
2: Failure to appoint an executive user responsible for sponsoring the project
3: Failure to appoint a fully qualified and supported project manager
4: Failure to define the objectives of the project
5: Failure to secure commitments from people who are needed to assist with the project
6:Failure to estimate costs accurately
7: Failure to specify very precisely the end users’ requirements
8: Failure to provide a good working environment
9: Failure to tie in all the people involved in the project with contracts
The procedure used was the standard one described in [5]. Respondents were shown how
sorting process using pictures of houses as the example (to reduce the risk of cueing
respondents towards a particular set of risk-related responses). They were told that they
could use categories such as: “Don’t know” or “Not applicable” if they wished. At the end of
the session, they were asked to identify the card which in their opinion bore the most
important risk for the outcome of a project, and the card which they felt was the least
important risk.
Results
The number of sorts performed by the two groups ranged between two and twelve sorts for
the developers, and between two and four sorts for the control group. This is consistent with
7
the developers having more expertise, and therefore being able to generate more criteria for
sorting. In total, the developers generated thirty-five criteria, and the control group generated
eighteen. There were no significant differences between genders within each group in relation
to number of criteria, with male developers generating seventeen criteria compared to
eighteen from the female developers; the males in the control group generated nine criteria,
as did the females.
There were some interesting differences in the number of categories (i.e. groups of cards)
which respondents used within each sort. The most striking difference between developers
and the control group was that the developers used diadic sorts (i.e. sorting the cards into
two groups) on eighteen occasions (51% of their sorts), whereas the control group only did
this on four occasions (22% of their sorts). The next most common pattern of sorting was into
three groups, used by the technical developers on nine occasions and by the control group on
ten occasions; the developers and control group made some use of sorts into one group, four
groups, five groups and six groups, but never on more than three occasions. The reason for
this difference in diadic sorting, compared with no major differences in the other sorts, is an
interesting question. Similar differences in use of diadic categories were reported by Sue
Gerrard [9] in the domain of perceptions of women’s working dress, with males more likely
to use dichotomous sorts than females; one possible explanation is that males were less expert
than females in that domain, resulting in less rich categorization, but this is the opposite to
the pattern reported here. Other studies in a variety of domains have also found clear
differences between groups in relation to use of diadic categories, relating to gender,
expertise and ethnicity, but the results do not form a consistent pattern. For instance, gender
has been clearly involved in at least one domain where there were no a priori reasons to
expect a gender difference (categorization of teaching materials), but in other domains, such
8
as this one, there is no visible gender effect. This issue is the subject of some debate in the
card sorts research community, but full discussion of it and its implications is beyond the
scope of this article.
Content analysis began with analysis of verbatim agreement among names of criteria. This is
a useful way of checking for the use of codified knowledge, typically learned via formal
instruction. In some domains, different respondents typically use identical wording for
criterion names, reflecting what they have been taught on courses or at university; in others,
respondents vary widely in their terminology, reflecting independent learning. In this
domain, we found only four instances where verbatim agreement occurred, each of these
involving only two responses. The criteria involved were:

requirements vs non-requirements problems

environment

point of failure/cause of failure

human resources
The next stage of content analysis involved grouping the individual criteria into coarsergrained superordinate criteria, by aggregating criteria which were closely related. This was
done by an independent judge. The results from this are shown in Table 1 below.
Table 1: superordinate criteria
Superordinate Construct
Responsibility
Outcome
People involvement
Project
Measures
Resources
Developers
3
2
3
9
2
2
9
Control Group
1
5
3
3
0
1
Total
4
7
6
12
2
3
Objectives
Problem
Requirements
Sponsorship
Commitment
Environment
Analysis vs interpersonal
Challenge
Expenditure
Consequences
Relevance with regard to list*
Business steps to develop a project
Objective vs cost
Risk
Levels of management
Total
1
2
1
1
1
1
1
1
1
1
1
1
0
0
0
34
1
0
1
0
0
1
0
0
0
0
0
0
1
1
1
19
2
2
2
1
1
2
1
1
1
1
1
1
1
1
1
53
Note: Relevance with regard to list* refers to the relevance of the item considering the presence
of other items on the list – a somewhat idiosyncratic criterion generated by one respondent.
There are various other ways in which content analysis can be performed on the criteria and
categories. An interesting result emerges when the criteria are categorised into “objective”
and “subjective”, where “objective” criteria involve observable and measurable factors such
as “resources”, and subjective criteria do not (for instance, “interesting problems”). Table 2
shows the results from this for the verbatim criteria.
Respondent group
Objective criteria
Subjective criteria
Developers
30
4
Control Group
17
2
Total
47
6
10
Discussion
The results show an interesting mixture of the expected and the unexpected. The developers
used considerably more criteria than the control group, presumably reflecting greater
expertise in dealing with projects and with project risks, as might be expected. However, the
developers used considerably less rich categorization within their criteria than the control
group, with a preference for dichotomous categories, which was a surprising result – both the
literature on expertise and the findings from other work with card sorts would normally
predict richer categorization by experts than by non-experts.
Similarly, the developers used a high proportion of superordinate criteria such as
“measures”, “sponsorship”, “expenditure” and “business steps to develop a product” which
were not used by the control group, and which are consistent with classic risk management
factors. This is what might be expected if the developers are experts and have been trained in
classic risk management; however, the low figures for verbatim agreement within the
developer group suggest that members of this group were not simply applying well-practised
standard procedures to categorizing the risks on the cards.
Another interesting finding was the high proportion of “objective” criteria used by both the
developer group and by the control group. As with the absence of gender differences in
dichotomous categorization, the reasons for this are obscure, but have practical implications
which would merit further research.
Significantly absent from these results was any mention of the classic three factors of dread,
severity and (lack of) control reported in the J/DM literature, even though the nature of card
sorting appears eminently suitable for allowing respondents to use these as the criteria for
11
sorting. A single study such as this one is not a sufficient basis for questioning the validity of
these three factors, but it does demonstrate that card sorting provides a good methodological
basis for further investigation of this issue.
The results from asking respondents to identify the most important and the least important
factor listed on a card were also interesting. Most developers believed that the most
important was card 7, “Failure to specify very precisely the end users’ requirements”,
whereas the majority of the control group believed that the most important was “Failure to
define the objectives of the project”. Although at first glance these factors may appear very
similar, there are some profound differences relating to the development process. There is a
significant difference between “specifying the end users’ requirements very precisely” and
“finding out what the end users want”. Quite often, users are not very sure what they want,
or have conflicting factions wanting different things; in such situations, a wise developer may
simply offer the users a precisely defined and feasible specification as a way of breaking the
deadlock [10].
As for the least important factor listed on the cards, the most common choice for the
developers (chosen by three of the six developers) was card 8, “Failure to provide a good
working environment for the project”. The most common choice among the control group
(again, chosen by three of the six) was card 9: “Failure to tie in all the people involved in the
project with contracts”. Informal discussion with the respondents confirmed that these results
were not a fluke; the developers believed that the working environment meant very little as
long as the job was done and the people were committed, usually (though not necessarily) via
a contract.
12
Conclusions
Card sorts provided a flexible, practical method for eliciting respondents’ categorization of
factors relating to project risk. Our findings suggest that software developers draw on classic
risk management concepts when dealing with risk-related problems, but that they do not do
this in a “management by numbers” way; instead, they use these concepts in a variety of
personalized ways. Our findings also suggest that the classic J/DM model of three main
underlying factors in perception of risk is not ubiquitous, and may be open to debate in terms
of their validity , which is consistent with arguments from the frequentist approach to
judgement and decision-making.
There were also unexpected results in this study, such as the high proportion of dichotomous
sorts performed by the developers, and the high proportion of “objective” criteria used by
developers and by the control group. These would probably reward further investigation.
The study described here is primarily a demonstration of concept; further work would clearly
be needed to test the generalizability of these findings. In addition, it would be useful to
investigate in more depth some of the issues identified above – techniques such as laddering
and on-line self report would be well suited for this.
13
References
[1] D. Kahneman, P. Slovic & A. Tversky (Eds.)
Judgement under Uncertainty: Heuristics and Biases.
Cambridge University Press, Cambridge, UK, 1982
[2] G. Wright & P. Ayton (Eds.) Subjective probability.
John Wiley and Sons, Chichester, UK, 1994
[3] P. Bromiley & S. P. Curley
Individual differences in risk taking.
In: J. F. Yates (Ed.)
Risk Taking Behavior
John Wiley & Sons, Chichester, UK, 1992
[4] C. Perrow (1984)
Normal Accidents: Living with High-Risk Technologies
Basic Books, New York
[5] N. G. Leveson & C. S. Turner (1993)
An investigation of the Therac-25 Accidents.
IEEE Computer, 25(7), pp. 18-41
[6] G. Rugg and P. McGeorge (1997).
The sorting techniques: a tutorial paper on card sorts, picture sorts and item sorts.
14
Expert Systems, 14(2), pp. 80-93.
[7] L. Upchurch, G. Rugg & B. Kitchenham (2001).
Using Card Sorts to Elicit Web Page Quality Attributes.
IEEE Software, 18(4), pp. 84-89
[8]N. A. M. Maiden & M. Hare (1998)
Problem Domain Categories in Requirements Engineering
International Journal of Human – Computer Studies, 49, pp. 281-304
[9] S. Gerrard, The Working Wardrobe: Perceptions of Women’s Clothing at Work, Unpublished
Master’s thesis, London University, London, UK, 1995.
[10] B. Kitchenham (Keele University, UK)
Personal communication, April 2002
15
Authors’ biographies
Satvere Kaur Sanghera has a BSc in Sociology/Social Psychology from Bradford University, a
Postgraduate Diploma in Senior Secretarial Studies from Coventry Technical
College, and an MSc in Office Systems and Data Communications from University College
Northampton.
Dr Gordon Rugg is a Senior Lecturer in Computer Science at Keele University, UK, and a
Senior Visiting Research Fellow in the Department of Computer Science at the Open
University, UK. His first degree was in French and Linguistics from Reading University,
followed by a PhD in Psychology from Reading University. He is editor of Expert Systems: the
International Journal of Knowledge Engineering and Neural Nets. His research interests include
requirements acquisition, elicitation methods, and cross-cultural research. He can be
contacted by e-mail at: g.rugg@cs.keele.ac.uk
16
Download