Digital Representations of Student Performance for Assessment

P. John Williams
University of Waikato, New Zealand
and
C. Paul Newhouse (Eds.)
Edith Cowan University, Australia
It was the belief that assessment is the driving force of curriculum that
motivated the authors of this monograph to embark on a program of research
and development into the use of digital technologies to support more authentic
forms of assessment. They perceived that in responding to the educational needs
of children in the 21st Century, curriculum needed to become more relevant and
engaging, but that change was unlikely without commensurate change in methods
and forms of assessment. This was particularly true for the high-stakes assessment
typically conducted at the conclusion of schooling as this tended to become the
focus of the implemented curriculum throughout the years of school. Therefore
the authors chose to focus on this area of assessment with the understanding
that this would inform assessment policy and practices generally in schools.
This book provides a conceptual framework and outlines a project in which digital
methods of representing students performance were developed and tested in the
subject areas of Applied Information Technology, Engineering, Italian and Physical
Education. The methodology and data collection processes are discussed, and
the data is analysed, providing the basis for conclusions and recommendations.
SensePublishers
DIVS
P. John Williams and C. Paul Newhouse (Eds.)
ISBN 978-94-6209-339-3
Digital Representations of Student Performance for Assessment
Digital Representations of Student
­Performance for Assessment
Spine
12.014 mm
Digital Representations
of Student Performance
for Assessment
P. John Williams and
C. Paul Newhouse (Eds.)
Digital Representations of Student Performance
for Assessment
Digital Representations of Student Performance
for Assessment
Edited by
P. John Williams
University of Waikato, New Zealand
and
C. Paul Newhouse
Edith Cowan University, Australia
A C.I.P. record for this book is available from the Library of Congress.
ISBN: 978-94-6209-339-3 (paperback)
ISBN: 978-94-6209-340-9 (hardback)
ISBN: 978-94-6209-341-6 (e-book)
Published by: Sense Publishers,
P.O. Box 21858,
3001 AW Rotterdam,
The Netherlands
https://www.sensepublishers.com/
Printed on acid-free paper
All Rights Reserved © 2013 Sense Publishers
No part of this work may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, microfilming,
recording or otherwise, without written permission from the Publisher, with the
exception of any material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work.
TABLE OF CONTENTS
Preface
vii
Introduction and Background
John Williams
Significance and Rationale
Statement of Problem and Research Question
Method
Recommendations
1
Literature Review and Conceptual Framework
Paul Newhouse
Performance Assessment
Computer-Supported Assessment
Digital Forms of Performance Assessment
Methods of Marking
Conceptual Framework for the Study
9
1
3
6
7
9
12
15
23
26
Method and Analysis
John Williams and Alistair Campbell
Samples
Data Collection and Analysis
Methodology Framework
Developing the Assessment Tasks
29
Applied Information Technology
Paul Newhouse
The Nature of the AIT Course
Implementation and Technologies
Online Repository
Analytical Marking and Analysis
Comparative Pairs Marking
Conclusions About Marking Processes
Student and Teacher Perceptions and Attitudes
Comparison Between Classes
Conclusions from The AIT Course
Summary of Findings for AIT
Recommendations from the AIT data
49
30
32
38
41
v
49
53
56
57
63
71
76
83
84
89
95
TABLE OF CONTENTS
Engineering Studies
John Williams
Implementation and Technologies
Engineering Case Studies
Online Repository
Analytical Marking and Analysis
Comparative Pairs Marking and Analysis
Conclusions About Marking Processes
Conclusions from Student and Teacher Data
Comparison between Classes
Conclusions from Engineering Course
Summary of Findings from Engineering Studies Case Studies
Recommendations for Engineering
99
99
102
102
102
105
109
111
117
117
119
122
Italian Studies
Martin Cooper
Implementation and Technologies
Italian Case Studies
Online Repository
Analytical Marking and Analysis
Comparative Pairs Marking and Analysis
Conclusions About Marking Processes
Conclusions from Student and Teacher Data
Overall Conclusions from Italian Course
Summary of Findings for Italian Studies
125
Physical Education Studies
Dawn Penney and Andrew Jones
Implementation and Technologies
Case Studies
Online Repository
Analytical Marking and Analysis
Comparative Pairs Marking and Analysis
Conclusions About Marking
Conclusions from Student and Teacher Data
Overall Conclusions from PES Course
Summary of Findings from PES
169
Findings And Conclusions
Jeremy Pagram
Findings
General Conclusions
197
References
213
vi
126
133
133
135
140
148
150
157
160
169
175
175
176
178
184
185
189
191
197
208
PREFACE
It was the belief that assessment is the driving force of curriculum that motivated
the authors of this monograph to embark on a program of research and development
into the use of digital technologies to support more authentic forms of assessment.
They perceived that in responding to the educational needs of children in the 21st
Century, curriculum needed to become more relevant and engaging, but that change
was unlikely without commensurate change in methods and forms of assessment.
This was particularly true for the high-stakes assessment typically conducted at
the conclusion of schooling as this tended to become the focus of the implemented
curriculum throughout the years of school. Therefore the authors chose to focus on
this area of assessment with the understanding that this would inform assessment
policy and practices generally in schools.
It is gratifying when a project which is researching at the cutting edge of
educational development leads to real change in educational practice, as was the
case in this project. A number of the recommendations made were implemented
around the time of the conclusion of the project. The recognition of the need for valid
and reliable high stakes assessment, and the coinciding development of technologies
which can feasibly capture the performance of students in school, will help ensure
that the outcomes of this research continue to inform educational assessment
decision making.
We would like to thank all the chapter authors for their willingness to develop
their chapters, and also Cathy Buntting for her expertise in reviewing the manuscript
and then formatting it to such a high standard.
This monograph is the outcome of a three-year research project that was managed
by the Centre for Schooling and Learning Technologies (CSaLT) at Edith Cowan
University, and funded by the Australian Research Council Linkage Scheme and the
Curriculum Council of Western Australia.
The research was conducted under the leadership of Paul Newhouse and John
Williams, and the authors of the chapters in this book were the Investigators in the
project. A broader team of consultants, managers, advisors, research assistants,
postgraduate students, assessors and teachers all helped to ensure the project’s
successful conclusion.
A number of conference and journal outcomes have accompanied this project
and supported this book. They are listed after the References at the end of the book.
John Williams and Paul Newhouse
April, 2013
vii
CHAPTER 1
JOHN WILLIAMS
INTRODUCTION AND BACKGROUND
This research was conducted in Western Australia (WA) over a period of three years,
concluding in 2011. This report of the research focuses on the findings, conclusions
and recommendations of the study, but contextualizes that within a rationale,
literature review and description of the methodology.
The study set out to investigate the use of digital forms of assessment in four
upper secondary school courses. It built on concerns that the assessment of student
achievement should, in many areas of the curriculum, include practical performance
and that this will only occur in a high-stakes context if the assessment can be shown
to validly and reliably measure the performance and be manageable in terms of cost
and school environment. The assessment examined in this research is summative in
nature (i.e. it is principally designed to determine the achievement of a student at
the end of a learning sequence rather than inform the planning of that sequence for
the student) with reliability referring to the extent to which results are repeatable,
and validity referring to the extent to which the results measure the targeted learning
outcomes.
The research specifically addressed a critical problem for the school systems in
Western Australia, which also has national and international significance. At the
same time the research advanced the knowledge base concerning the assessment of
practical performance by developing techniques to represent practical performance
in digital forms, collate these in online repositories, and judge their quality using a
standards-based marking method and trialling a comparative pairs marking method.
SIGNIFICANCE AND RATIONALE
From the 1990s, significant developments in computer technology have been the
emergence of low-cost, high-powered portable computers, and improvements in the
capabilities and operation of computer networks (e.g., intranets and the accessibility
of the Internet). These technologies have appeared in schools at an escalating rate.
During that same period school systems in Australia were moving towards a more
standards-based curriculum and investigating methods of efficiently and effectively
assessing students from this perspective.
P. J. Williams and C. P. Newhouse (Eds.), Digital Representations of Student
Performance for Assessment, 1–8.
© 2013 Sense Publishers. All rights reserved.
J. WILLIAMS
In Western Australia this became critical with the development of high-stakes
senior secondary courses to be implemented over the latter half of the decade. In
some courses developments in technology dictated that students should be assessed
making use of that technology, while in many courses it was likely that at least some
of the intended learning outcomes were not able to be adequately assessed using
paper-based methods. Therefore it was important that a range of forms of assessment
were considered along with the potential for digital technologies to support them.
There is a critical need for research into the use of digital forms of representation
of student performance on complex tasks for the purposes of summative assessment
that are feasible within the constraints of school contexts. Internationally the need
for better forms of assessment is increasingly being seen as a critical component in
improving schooling, and is often discussed under the banner of ‘twenty-first century
skills’ (Kozma, 2009). Recently (March 2011), even the American President spoke
at length on the need to measure performance in other ways to traditional exams in
order to support students “learning about the world” and so that “education would
not be boring for kids” (eSchool News, 2011, p. 15). However, it is still necessary
that these alternative forms of assessment generate defensible measurements; that
is, are valid and reliable measures of the intended performance, particularly for
summative assessment.
An assessment needs to possess content, criterion and construct validity
(Dochy, 2009), the first being the extent to which the assessment addresses the
relevant knowledge domain, its authenticity. Dochy (2009) sees the identification
of criterion and construct validity as being more problematic for new modes of
assessment focused on complex problems. Criterion validity is the extent to which
the assessment correlates with another assessment designed to measure the same
construct. Construct validity is the extent to which the assessment measures a
‘construct’ within a ‘conceptual network’ usually through estimating relationships
with other constructs. The value of the validity of an assessment is dependent on
the reliability of measurement that may be interpreted as the degree of agreement
between assessors (inter-rater reliability) or degree of consistency between
assessments (e.g., test-retest). Dochy questions this classical theory and argues for
the use of generalisability theory that seeks to include judgements from multiple
perspectives and from multiple assessments to generalise the behaviour of a student.
In essence this theory seeks to identify and explain sources of error in measurement
rather than minimise it.
This research investigated authentic digital forms of assessment with high levels
of reliability and manageability, which were capable of being scaled-up for statewide implementation in a cost effective manner. The findings of this research provide
guidelines for educators and administrators that reflect successful practice in using
Information and Communications Technology (ICT) to support standards based courses.
The findings also provide significant benefit to the wider educational community,
particularly in terms of the development and provision of a nationally consistent
schooling system with accountability to standards in senior schooling systems.
2
INTRODUCTION AND BACKGROUND
STATEMENT OF PROBLEM AND RESEARCH QUESTION
The general aim of the study was to explore the potential of various digitally-based
forms for external assessment for senior secondary courses in Western Australia.
Specifically the study set out to determine the feasibility of four digital-assessment
forms in terms of manageability, cost, validity and reliability, and the need to support
a standards-based curriculum framework for students in schools across the state.
The problem being addressed was the need to provide students with assessment
opportunities in new courses, which are on one hand authentic, where many outcomes
do not lend themselves to being assessed using pen and paper over a three hour
period, while on the other hand being able to be reliably and manageably assessed by
external examiners. That is, the external assessment for a course needs to accurately
and reliably assess the outcomes without a huge increase in the cost of assessment.
The main research question was:
How are digitally based representations of student work output on authentic
tasks most effectively used to support highly reliable summative assessments
of student performances for courses with a substantial practical component?
The study addresses this question by considering a number of subsidiary questions.
1. What are the benefits and constraints of each digitally based form to support
the summative assessment of student practical performance in senior secondary
courses in typical settings?
2. What is the feasibility of each digital form of assessment in terms of the four
dimensions: technical, pedagogic, manageability, and functional?
3. Does the paired comparison judgments method deliver reliable results when
applied to student practical performance across different courses?
The courses selected for the research were Applied Information Technology,
Engineering Studies, Italian Studies and Physical Education Studies. Following is a
summary of the specific issues in each of these courses.
Discussion of the Problem for Applied Information Technology (AIT)
In contrast to the other three courses in the project, in Applied Information Technology,
digital technologies provide the content for study as well as pedagogical support.
Therefore performance relates to using the technologies to demonstrate capability.
The syllabus states that the AIT course “provides opportunities for students
to develop knowledge and skills relevant to the use of ICT to meet everyday
challenges”. As such in the course students should “consider a variety of computer
applications for use in their own lives, business and the wider community”. In the
course students spend the majority of their time in class using digital technologies
to develop information solutions. It should therefore be surprising that currently
the external assessment consists of a three-hour paper-based exam. This is despite
3
J. WILLIAMS
the fact that the syllabus stipulates that around 50% of the weighting of assessment
should be on production.
In early 2008 courses like AIT were changed with the decision that all senior
students were to sit an external examination. The ramifications of this decision were
likely to be widespread including that the ‘exam’ would have to be appropriate for
lower achieving students, it would dominate the course delivery and would involve
a lot more students, increasing the cost considerably. Originally it had been assumed
that because only higher achieving students were likely to be involved, the extra time
needed to collate a portfolio was reasonable and would only include higher quality
work that would be easier to mark. Another confounding change was the requirement
for the course to be packaged in a syllabus format with details of specific content
for each unit rather than what had been a definition of the boundaries of the content
with the opportunity to address the content to varying depths and across a range of
relevant contexts for the students and teacher. This also led to a shift of focus away
from outcomes towards content that immediately highlighted the issue of the variety
of relevant contexts that could be involved in the course and the issue of the rapidly
changing content of these areas of technology. This had not been such an issue with
the focus on outcomes because they could be applied to the range of contexts and did
not specify particular content that could quickly date. This has since led to the focus
for assessment being on assessment type rather than outcomes.
While students can include study in AIT towards University entry this would be
of no value if the external assessment propels the course towards becoming mainly
‘book work’ rather than creative digital work. We are living in a society where
almost every avenue of work and life requires the use of digital tools and resources.
Whether a senior student is aiming to be a mechanic, doctor, accountant or travel
agent, study in AIT could begin to give them the skills, attitudes and understanding
that will support them in being more successful in work and life.
Therefore the research problem for the AIT course becomes that to align with the
aims, rationale, outcomes, content and preferred pedagogy, assessment must include
students using digital technologies but there are a number of ways in which that may
be achieved. The research question therefore becomes, which method of assessment,
portfolio or computer-based exam or combination, is most feasible for the course at
this time?
Discussion of the Problem for Engineering Studies
In 2007 a new senior secondary subject, Engineering Studies, was introduced in
Western Australia. As a result, for the first time in Western Australia, achievements
in Engineering Studies could contribute to gaining tertiary entrance. Thus, an
assessment structure had to be designed and implemented that would measure
achievement in Engineering. The course was structured with a design core, and
then students could study one of three options: materials, structures and mechanical
systems; systems and control or electrical/electronics.
4
INTRODUCTION AND BACKGROUND
The assessment structure had an internal and an external component. The teacher
submitted a mark for each student, representing design, production and response
activities throughout the year and worth 50% of the student’s final mark. A 3-hour
external written examination measured student knowledge on both the core and
specialization areas through a series of multiple choice and short answer questions,
and was combined and moderated with the school based assessment mark.
For a practical and performance based subject the examination did not reflect
that essential nature of the subject. Consequently pedagogies were too theoretical
as teachers taught for the exam and had difficulty effectively connecting theory and
practice. The examination was therefore limited in relation to the course content,
concepts and outcomes that it embraced.
The practical examination developed in this project reaffirmed the need for
research to explore the possibilities that new technologies may open up to extend the
practical assessment in the course.
Discussion of the Problem for Italian Studies
In general, this research project has sought to explore digital assessment tasks that are
targeted at the measurement of student performance in the area being investigated.
However, Italian studies already had a tradition of assessing Italian oral performance
through a face-to-face examination where two markers assess each student’s
performance in real time. This examination is undertaken at a central location away
from the students’ school. Therefore the Italian component of the study has focused
on the exploration of different ways of digitally assessing oral performance that may
have advantages in terms of validity, reliability and logistics. Ultimately techniques
were trialled that both simulated a conversation using digital technologies and were
capable of being carried out within a typical school which is teaching Italian.
Throughout the research process the usefulness of digital technologies to the daily
pedagogical practices of Italian teachers was also investigated and demonstrated.
In the final year of the project the scope of the research was expanded to cover
Listening and Responding, Viewing, Reading and Responding in addition to Oral
Communication. The final formal assessment task had components designed to
address these areas such as visual stimuli, and Italian audio for the students to
respond to.
Discussion of the Problem for Physical Education Studies
In 2007 a new senior secondary, Physical Education Studies, was introduced in
WA. The development of the course meant that for the first time in WA, student
achievements in Physical Education Studies could contribute to gaining tertiary
entrance. A challenge and dilemma for the course developers was to determine the
nature of the achievements that could be encompassed in assessment and specifically,
an external examination. Differences in current practice across Australasia reflect an
5
J. WILLIAMS
ongoing lack of consensus about the examination requirements and arrangements
for senior physical education that can effectively address concerns to ensure validity,
reliability, equity and feasibility. More particularly, the differences centre on firstly,
whether and in what ways the skills, knowledge and understandings inherent in
practical performance can feasibly and reliably be assessed so as to align with
examination requirements; and secondly, how any such assessment can align with
an intent embedded in the new WA and a number of other course developments, to
seek to better integrate theoretical and practical dimensions of knowledge (see for
example, Macdonald & Brooker, 1997; Penney & Hay, 2008; Thorburn, 2007).
In these respects, the research sought to acknowledge and respond to Hay’s (2006,
p. 317) contention that:
…authentic assessment in PE should be based in movement and capture the
cognitive and psychomotor processes involved in the competent performance
of physical activities. Furthermore, assessment should redress the mind/body
dualism propagated by traditional approaches to assessment, curriculum and
pedagogies in PE, through tasks that acknowledge and bring to the fore the
interrelatedness of knowledge, process (cognitive and motor), skills and the
affective domain
Thus, the assessment task and the associated use of digital technologies, was
designed to promote integration of conceptual and performance-based learning. It
also reflected that the PES course does not prescribe the physical activity contexts
(sports) through which learning will be demonstrated. The task therefore needed to
be adaptable to the varied sporting contexts that schools may choose to utilise in
offering the PES course.
METHOD
The focus of this study was on the use of digital technologies to ‘capture’ performance
on practical tasks for the purpose of high stakes summative assessment. The purpose
was to explore this potential so that such performances could be included to a
greater extent in the assessment of senior secondary courses, in order to increase the
authenticity of the assessment in these courses. The study involved case studies for
the four courses. During the three years there was a total of at least 82 teachers and
1015 students involved. The number of students involved in each case study ranged
from 2 to 45. Therefore, caution needs to be taken in interpreting the analysis and
generalising from the results.
Four different fundamental forms of assessment (reflective portfolios, extended
production exams, performance tasks exams, and oral presentations) were
investigated in 81 cases with students from the four courses and with the assessment
task being different in each course. For each course there was a common assessment
task that consisted of a number of sub-tasks. For each case a variety of quantitative
and qualitative data was collected from the students and teachers involved, including
6
INTRODUCTION AND BACKGROUND
digital representations of the students’ work on the assessment tasks, surveys and
interviews. These data were analysed and used to address the research questions
within a feasibility framework consisting of four dimensions: Manageability (Can
the assessment task be reasonably managed in a typical school?), Technical (Can
existing technologies be adapted for assessment purposes?), Functional (Can the
assessment be marked reliably and validly when compared to traditional forms of
assessment?), and Pedagogic (Does a digital form of assessment support and enrich
students’ learning experiences?).
The evidence of performance generated from the digital assessment tasks were
marked independently by two external assessors using an analytical standardsreferenced method. This method used detailed sets of criteria, represented as rubrics, and
linked to the assessment task, appropriate course content and outcomes. Correlations
were determined for comparison purposes between the two external assessors and also
between the assessors and the classroom teacher. Additionally, the work was marked
using the method of comparative pairs and these results were again compared against
the results from the other forms of marking. This method of marking involved a panel
of between 5 and 20 assessors and is based on Rasch dichotomous modelling.
RECOMMENDATIONS
The following recommendations are made with regard to the general application of
findings of the study.
Methods of Marking
– Comparative pairs method of marking is typically generates more reliable scores
than analytical marking, but is probably only valid when the assessment task
is fundamentally holistic (i.e., not made up of many required sub-tasks) with
a minimum of scaffolding (typically some is required to ensure students leave
enough information to assess). Where there are a number of components, these
would need to be considered separately if using a pairs comparison method as
an holistic decision may not give appropriate proportionate weighting to the
components. This is not an issue in analytical marking as weights are easily
applied to the various components via the marking key.
– Analytical standards-referenced marking may be used to generate reliable sets of
scores for the range of digital forms of assessment tried, provided that criteria are
developed specifically for the assessment task, assessors agree on an interpretation
of the criteria, the values that can be awarded are limited to as small a range as
possible tied to specific descriptors, and that descriptors only pertain to likely
levels of demonstration of the criteria by the target students.
– It is desirable to implement either method of marking using online tools connected
to a digital repository of student work. Assessors have few problems in accessing
and recording scores whether they are resident locally, interstate or internationally.
7
J. WILLIAMS
Digital Forms of Assessment
– In typical WA secondary schools it would be possible to implement, for most
courses, the range of digital forms of assessment tried, even for those that don’t
typically operate in an environment with ICT available, using local workstations
(desktop or portable), with local storage, about 10% of workstations spare, and a
school IT support person on-call.
– If an assessment task is implemented using online technologies, then in many
typical WA secondary schools networks may not be adequate depending on the
technologies used (e.g., Flash, Java) or the bandwidth required. Therefore each
site would need testing under realistic conditions (e.g., normal school day with
number of students accessing concurrently). Further, all data should be stored
locally as a backup to the online storage (could upload at the end of the task).
– Students are highly amenable to digital forms of assessment, even those with
less than average levels of ICT skill, generally preferring them to paper-based
forms provided that they have confidence in the hardware and some experience
with the software and form of assessment. Almost all students are able to quickly
learn how to use simple types of software required in assessment (e.g., Paint for
drawing diagrams, digital audio recording).
– Teachers are amenable to digital forms of assessment provided that benefits to
students are clear, implementation is relatively simple, and any software is easy
to learn.
– Experienced teachers and graduate students can be trained to implement digital
forms of assessment.
– Undergraduate students in IT related courses can be trained to prepare the digital
materials resulting from an assessment, ready for online access by assessors.
– Commercial online assessment systems such as MAPS/e-scape and Willock may
be successfully used in WA schools, but are limited in their effectiveness by the
infrastructure available in each school and by the design of those systems.
8
CHAPTER 2
PAUL NEWHOUSE
LITERATURE REVIEW AND CONCEPTUAL
FRAMEWORK
The aim of the research was to investigate the feasibility of using digital technologies
to support performance assessment. As such the study connects with two main fields
of research: performance assessment, and computer-supported assessment. However,
clearly these are subsumed within the general field of assessment. While it will be
assumed that the basic constructs within the field of assessment are known and apply
perhaps it is useful to be reminded of this through a definition of assessment from
Joughin (2009) and a statement of three pillars that Barrett (2005) suggests provide
the foundation for every assessment.
To assess is to make judgements about students’ work, inferring from this what
they have the capacity to do in the assessed domain, and thus what they know,
value, or are capable of doing. (Joughin, 2009, p. 16)
– A model of how students represent knowledge and develop competence in a
content domain.
– Tasks or situations that allow one to observe students’ performance.
– An interpretation method for drawing inferences from performance evidence.
(Barrett, 2005)
PERFORMANCE ASSESSMENT
Research in, and the call to investigate “performance-and-product assessment” is
not new as pointed out by Messick (1994, p. 14), tracing back at least to the 1960s.
However, Messick claims that mainstream schooling showed little interest in this
in until an “upsurge of renewed interest” in the 1990s with “positive consequences
for teaching and learning” (p. 13). While Messick does not specifically address
digital forms of performance assessment, his arguments for the need to address
“issues of validity, reliability, comparability and fairness” apply, particularly
to a range of validity criteria. He argues they are social values that require close
attention to the intended and unintended consequences of the assessment through
considerations of the purposes of the assessment, the nature of the assessed domain,
P. J. Williams and C. P. Newhouse (Eds.), Digital Representations of Student
Performance for Assessment, 9–28.
© 2013 Sense Publishers. All rights reserved.
P. NEWHOUSE
and “construct theories of pertinent skills and knowledge” (p. 14). For example,
he outlines situations under which product assessment should be considered rather
than performance assessment. The issue is their relationship to replicability and
generalisability requirements because these are important when performance is the
“vehicle” of assessment.
Lane (2004) claims that in the USA there has been a decline in the use of
performance assessments due to increased accountability requirements and resource
constraints. She outlines how this has led to a lack of alignment between assessment,
curriculum standards, and instructional practices; particularly with regard to eliciting
complex cognitive thinking. Dochy (2009) calls for new assessment modes that are
characterised by students constructing knowledge, the application of this knowledge
to real life problems, the use of multiple perspectives and context sensitivity, the
active participation of students, and the integration of assessment and the learning
environment. At the same time Pollitt (2004) argues that current methods of summative
assessment that focus on summing scores on “micro-judgements” is “dangerous and
that several harmful consequences are likely to follow” (p. 5). Further, he argues that
it is unlikely that such a process will accurately measure a student’s “performance
or ability” (p. 5), and more holistic judgements of performance are required. Koretz
(1998) analysed the outcomes of four large-scale portfolio assessment systems in the
USA school systems and concluded that overall the programmes varied in reliability
and were resource intensive with “problematic” (p. 309) manageability. This body
of literature clearly presents the assessment of student performance as critically
important but fundamentally difficult with many unanswered questions requiring
research.
Globally interest in performance assessment has increased with the increasing use
of standards-referenced curricula. Standards-referenced curricula have evolved over
the past 20 years particularly from the UK and more recently in Australian states since
the early 1990s. The key concept in these curricula was that student achievement
was defined in terms of statements describing what students understood, believed or
could do. The term standards-referenced has tended to be used recently to indicate
that student achievement is measured against defined standards. This has reinforced
the need for clear alignment between intended curriculum outcomes and pedagogy,
and assessment (Taylor, 2005). Alignment has typically been poor, particularly in
areas where some form of practical performance is intended.
Koretz (1998), who defines portfolio assessment as the evaluation of performance
by means of a cumulative collection of student work, has figured prominently in
USA debate about education reform. He analysed the outcomes of four largescale portfolio assessment systems in the USA school systems, in particular, in
terms of their reliability. Each example involved marking student portfolios for the
purpose of comparing students and/or schools across a state, mainly in English and
Mathematics. All of the examples occurred in the 1990s and none involved digital
representations of performance. Koretz concluded that overall the programmes were
resource intensive and did not produce “evidence that the resulting scores provide a
10
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
valid basis for the specific inferences users base on them…” (p. 332). Even though
he noted that significant improvements in the implementation and reliable marking
of portfolios had been achieved, at that time he saw portfolio-based assessment as
“problematic” (p. 309). Findings such as this provide a rationale for considering
digital solutions to performance assessment.
Apart from the lack of validity of traditional paper-based assessment methods
another compelling rationale to consider the efficacy of performance assessment
is that teachers tend to teach to the summative assessment (Dochy, 2009; Lane,
2004; Ridgway, McCusker, & Pead, 2006). McGaw (2006) discussed this in the
light of changes in the needs of the society, advances in psychometric methods, and
improvements in digital technologies and believed that there is a “risk that excessive
attention will be given to those aspects of the curriculum that are assessed” and
that “risk-taking is likely to be suppressed” (p. 2). This leads to what Dochy (2009)
refers to as a deprofessionalization of teachers. Further, summative assessment
tends to drive learning with students “adapting their approaches to learning to meet
assessment requirements” (Joughin, 2009, p. 16). Joughin goes on to discuss how
assessment determines what the actual curriculum is as opposed to the intended
curriculum, the inference being that if the intended curriculum is to be implemented
then assessment needs to align with and reinforce it. Worse than this he explains
how assessment will determine the extent to which students adopt deep approaches
to learning as opposed to surface approaches.
A concern underpinning the argument for computer-based assessment methods
to replace traditional paper-and-pencil methods was presented by the American
National Academy of Sciences (Garmire & Pearson, 2006). They argue that assessing
many performance dimensions is too difficult on paper and too expensive using
“hands-on laboratory exercises” (p. 161) while computer-based assessment has the
potential to increase “flexibility, authenticity, efficiency, and accuracy” but must
be subject to “defensible standards” (p. 162) such as the Standards for Educational
and Psychological Testing (American Educational Research Association, American
Psychological Association, & Education., 1999). The committee cites the use of
computer-based adaptive testing, simulations, computer-based games, electronic
portfolios, and electronic questionnaires as having potential in the assessment of
technological literacy (2006). They concluded that computer-based simulations were
suitable but could be expensive. They also raised a number of questions requiring
research that electronic portfolios, “appear to be excellent tools for documenting and
exploring the process of technological design” (p. 170).
McGaw (2006) also believes that without change to the main high-stakes
assessment strategies currently employed there is a reduced likelihood that
productive use will be made of formative assessment. He is not alone in this
concern, for example, Ridgway et al. (2006, p. 39) states that, “There is a danger that
considerations of cost and ease of assessment will lead to the introduction of ‘cheap’
assessment systems which prove to be very expensive in terms of the damage they
do to students’ educational experiences.” Therefore, from both a consideration of
11
P. NEWHOUSE
the need to improve the validity of the assessment of student practical performance,
and the likely negative impact on teaching (through not adequately assessing this
performance using ill-structured tasks) there is a strong rationale for exploring
alternative methods of assessment (Dochy, 2009). However, any approach or
strategy will not be perfect and will require compromises and consideration of the
following questions:
1.
2.
3.
4.
5.
What skills or knowledge are best demonstrated through practical performance?
What are the critical components of that practical performance?
Why can’t those components be demonstrated on paper?
What alternative representations other than paper could be used?
What level of compromise in reliability, authentication and cost is acceptable in
preference to not assessing the performance at all?
COMPUTER-SUPPORTED ASSESSMENT
Computer-Supported Assessment, sometimes referred to as Computer-Assisted
Assessment, is a broad term encompassing a range of applications from the use
of computers to conduct the whole assessment process such as with on-screen
testing, to only assisting in one aspect of the task assessment process (e.g., recording
performance or marking) (Bull & Sharp, 2000b). The first area of the task assessment
process that took advantage of computer-support was objective type assessments
that automated the marking process (eliminating the marker) and allowed the results
to be instantly available. Bull and Sharp (2000a) found that the use of computers to
support assessment has many advantages for the assessment process, assessors and
students.
Much of the published research in the field of computer-supported assessment
relates to higher education, particularly in university settings (e.g., Brewer, 2004), with
little specific to school-based education. However, in the school sector assessment of
student creative work in the arts has been addressed for some time with, for example,
Madeja (2004) arguing the case for alternatives to paper-and-pencil testing for the
arts. Further, there has been some research into the use of portfolios for assessment
but most often this is for physical, not digital, portfolios. There has been a limited
amount of research in the area in Australia, typically these have been small-scale
trials in the use of IT to support assessment processes (e.g., Newhouse, 2005). There
have also been reports on the use of online testing in Australia, such as by MacCann
(2006), but these usually do not involve assessing practical performance and merely
replicate paper-and-pen tests in an online environment.
While there has been only limited empirical research into many areas of computersupported assessment there are many useful theoretical discussions of the issues
such as Spector’s (2006) outline of a method for assessing learning in “complex and
ill-structured task domains”. While providing useful ideas and rationales these ideas
remain largely untested in the reality of classrooms. What is known is that any use of
12
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
ICT involves school change (Lim & Hung, 2003; Newhouse, Clarkson, & Trinidad,
2005) and will require training of teachers, changes in thinking, and pedagogical
understandings that are difficult to take on, even for younger teachers (Newhouse,
Williams, & Pearson, 2006).
There has been increasing interest internationally in the application of computer
support to improve assessment as indicated in the focus of a recent keynote address
by McGaw (2006). The University of Cambridge Local Examinations Syndicate is
conducting over 20 projects to explore the impact of new technologies on assessment
including using online simulations in assessing secondary science investigation skills
(Harding, 2006). Other organisations (e.g., Becta, 2006) or groups of researchers
(e.g., Ridgway et al., 2006) have reported on exploratory projects, particularly the
increasing use of online testing, although rarely for high-stakes assessment and not
without some difficulty (Horkay, Bennett, Allen, Kaplan, & Yan, 2006).
The British Psychological Society has produced a set of guidelines for ComputerBased Assessment. While they mainly focus on online testing they provide a
conceptual model that includes Assessment Generation, Assessment Delivery,
Assessment Scoring and Interpretation, and Storage, Retrieval and Transmission.
The latter two were relevant to the present study with the guidelines for developers
and users.
Recently the Joint Research Centre for the European Commission (Scheuermann
& Bojornsson, 2009) brought out a major report titled, The Transition to ComputerBased Assessment. Kozma (2009) lays out the rationale in terms of a mismatch
between what is needed in modern society and what is addressed and thus assessed
at school. In particular he draws attention to the differences between standardized
pen-and-paper assessment and “Tasks in the Outside World”. In the latter he explains
how tasks: require cross-discipline knowledge; relate to complex ill-structured
problems; and are completed collaboratively using a wide range of technological
tools to meet needs and standards. These characteristics are at odds with traditional
approaches to assessment. While he does not see assessment reform only requiring
the use of ICT he outlines a number of significant advantages including: reduced
costs; increased adaptability to individuals; opportunity to collect process data
on student performance; the provision to tools integral to modern practice; and
better feedback data. Kozma does introduce a number of challenges to using ICT
to support assessment including: start-up costs for systems; the need to choose
between standardized and ‘native’ applications; the need to integrate applications
and systems; the need to choose between ‘stand-alone’ and online implementation;
the need for security of data; the need for tools to make the design of tasks easy
and efficient; and the lack of knowledge and examples of high-quality assessments
supported by ICT. He also highlights methodological challenges including: the
extent of equivalence with pen-and- paper; the design of appropriate complex tasks;
making efficient and reliable high-level professional judgements; scoring students’
processes and strategies; and distinguishing individual contributions to collaborative
work.
13
P. NEWHOUSE
A recent research initiative of Cisco, Intel and Microsoft (Cisco, Intel, &
Microsoft, 2009) is the Assessment and Teaching of 21st Century Skills project. The
paper that was a call to action clearly argues that changes are required to high stakes
assessments before needed change will occur in schools.
Reform is particularly needed in education assessment-how it is that education
and society more generally measure the competencies and skills that are needed
for productive, creative workers and citizens. Accountability is an important
component of education reform. But more often than not, accountability efforts
have measured what is easiest to measure, rather than what is most important.
Existing models of assessment typically fail to measure the skills, knowledge,
attitudes and characteristics of self-directed and collaborative learning that are
increasingly important for our global economy and fast changing world. New
assessments are required that measure these skills and provide information
needed by students, teachers, parents, administrators, and policymakers to
improve learning and support systemic education reform. To measure these
skills and provide the needed information, assessments should engage students
in the use of technological tools and digital resources and the application of a
deep understanding of subject knowledge to solve complex, real world tasks
and create new ideas, content, and knowledge. (Cisco et al., 2009, p. 1)
Ripley (2009) defines e-assessment as “the use of technology to digitise, make
more efficient, redesign or transform assessments and tests; assessment includes
the requirements of school, higher education and professional examinations,
qualifications, certifications and school tests, classroom assessment and assessment
for learning; the focus of e-assessment might be any of the participants with the
assessment processes – the learners, the teachers and tutors, managers, assessment and
test providers and examiners. He highlights presents two ‘drivers’ of e-assessment;
business efficiency and educational transformation. The former leads to migratory
strategies (i.e. replicating traditional assessment in digital form) while the latter leads
to transformational strategies that change the form and design of assessment. An
example of the latter is the recent ICT skills test conducted with 14-year olds in the
UK in which students completed authentic tasks within a simulated ICT environment.
He raises issues that need to be addressed including: providing accessibility to all
students; the need to maintain standards over time; the use of robust, comprehensible
and publicly acceptable means of scoring student’s work; describing the new skill
domains; overcoming technological perceptions of stakeholders (e.g., unreliability
of IT systems); and responding to the conceptions of stakeholders about assessment.
Lesgold (2009) calls into question the existence of a shared understanding
among the American public on what is wanted out of schools and how this may
have changed with changes in society. He argues that this must go with changes to
assessment to include 21st century skills and this will not be served by the traditional
standard approach to testing based on responses to small items that minimises the
need for human judgement in marking. Instead students will need to respond to
14
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
tasks representing complex performances, supported by appropriate tools with the
results needing to be judged by experts. He recognises the issues that this would
throw up and provides ‘stealth assessment’ as an example solution. In this example
students complete a portfolio of performance at school over time and supervised by
the teacher. The testing system then selects one or two “additional performances” to
be externally supervised “as a confirmation that the original set was not done with
inappropriate coaching” (p. 20). This is more amenable to ‘learning by doing’ and
project-based learning where bigger, more realistic tasks can be accomplished that
develop attributes such as persistence.
At this stage it is likely that a minority of teachers provide students with
experiences in using ICT to support any forms of assessment. For example, in a
survey reported by Becta (2010) in Messages from the Evidence: Assessment using
Technology, it was found that at best 4 out of 10 teachers reported using ICT to
‘create or administer assessment’. This lack of experience for students and teachers
is likely to be a constraint in using ICT to support summative assessment, particularly
where the stakes are high.
DIGITAL FORMS OF PERFORMANCE ASSESSMENT
Many educational researchers argue that traditional assessment fails to assess
learning processes and higher-order thinking skills, and go on to explain how digital
technologies may address this problem (Lane, 2004; Lin & Dwyer, 2006). This
argument centres around the validity of the assessment in terms of the intended
learning outcomes, where there is a need to improve the criterion-related validity,
construct validity and consequential validity of high-stakes assessment (McGaw,
2006). Further, in some school courses students learn with technologies and this
dictates that students should be assessed making use of those technologies. Dede
(2003) suggests that traditionally educational assessment has been “based on
mandating performance without providing appropriate resources, then using a ‘drive
by’ summative test to determine achievement” (p. 6). He goes on to explain how
digital technologies may address this problem and claims that “the fundamental
barriers to employing these technologies effectively for learning are not technical
or economic, but psychological, organizational, political and cultural” (p. 9). Taylor
(2005) optimistically suggests that, “as technology becomes an integral component
of what and how students learn, its use as an essential tool for student assessment is
inevitable” (p. 9).
Lin and Dwyer (2006) argue that to date computer technology has really only
been used substantially in assessment to automate routine procedures such as for
multiple-choice tests and collating marks. They suggest that the focus should be
on capturing “more complex performances” (p. 29) that assess a learner’s higherorder skills (decision-making, reflection, reasoning and problem solving) and cite
examples such as the use of simulations and the SMART (Special Multimedia Areas
for Refining Thinking) model but suggest that this is seldom done due to “technical
15
P. NEWHOUSE
complexity and logistical problems” (p. 28). A recent review of assessment methods
in medical education (Norcini & McKinley, 2007) outlines performance-based
assessment of clinical, communications and professional skills using observations,
recordings and computer-based simulations.
Design and Development of Digital Assessments
A major aim of the study was to increase the validity of assessment using a variety
of forms of assessment supported by digital technologies. Clearly the design of the
tasks for the form of assessment was critically important. Dochy (2009, p. 105)
discusses the manner in which “new assessment modes” may improve the validity
of the tasks, the scoring, generalisability, and consequential validity. He explains
that initially construct validity “judges how well assessment matches the content
and cognitive specifications of the construct being measured”. In the study this was
achieved using course teams, a situation analysis, and seeking the perceptions of
teachers and students. If this is done then he claims the authenticity and “complex
problem characteristics” of the task improves its validity. Secondly he explains
that criteria to judge student performances need to be fair and allow demonstration
of ability. In the study this was addressed through the use of standards-referenced
analytical marking and holistic comparative pairs marking, and through correlation
analyses between methods or marking and teacher generated scores. Thirdly, he
explains how generalisability can be improved through greater authenticity through
a consideration of reliability. In the study this was addressed through a combination
of Rasch model analysis, and inter and intra-rater correlation analysis. Finally,
Dochy discusses potential intended and unintended consequences of new forms of
assessment such as improvements in teaching methods, higher performances and
increased feelings of ownership and motivitation.
For the purposes of the study four particular forms of assessment were defined
that employed digital technologies to represent the output of student performance.
These forms were an Oral presentation/interview, an Extended Production Exam, a
Focussed Performance Tasks Exam and a Reflective Digital Portfolio and were not
intended to provide an exhaustive list but rather define major forms that appeared
to be relevant to the courses involved in the study. Sadler (2009) and Dochy (2009)
provide longer lists of common forms appropriate for the assessment of complex
learning.
A Focussed Performance Tasks Exam was considered to be the completion,
under ‘exam conditions’, of a range of practical tasks that are not necessarily
logically connected and typically focus on the demonstration of practical skills.
However, in reality the Exams created in the study for the AIT course provided
some connection between the tasks and associated these with a scenario. Thus it had
characteristics of an Extended Production Exam but without incorporating a full
set of processes due to time constraints. The most comprehensive example of this
type of assessment is that of Kimbell et al. (2007) in the UK where students spent
16
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
two consecutive mornings of three hours duration each working on a structured
design activity for the production of a pill dispenser. All student work output was
collected digitally using a networked Personal Digital Assistant (PDA) device and
local server.
A Reflective Process Digital Portfolio was considered to be a collection of digital
artefacts of work output with some reflective commentary (journaling) by the student,
organised according to specified parameters such as form, structure, and range of
samples required. There are many types of digital portfolios used internationally
(Taylor, 2005). For this study the portfolios were repositories of previous workoutput annotated by the student to explain the inclusion of the artefact and describe
its characteristics relevant to assessment criteria. In a review of e-assessment the
digital portfolio is recommended as a “way forward” in the high-stakes assessment
of “practical” work in that ICT “provides an opportunity to introduce manageable,
high quality coursework as part of the summative assessment process (Ridgway et
al., 2006). Three uses of portfolios are suggested, one of which is “to provide a
stimulus for reflective activity”. Thus, the use of portfolios is not new, particularly
in areas such as the visual arts and design and technology but typically these have
been paper-based (Garmire & Pearson, 2006). The exercise of assembling a portfolio
is often seen as much as a “learning tool” as an “assessment tool” but the results
are typically limited by physical storage space and methods of access (Garmire &
Pearson, 2006).
An Extended Production Exam was considered to be the completion, under
‘exam conditions’, of one practical assessment task that incorporated a full set
of processes (e.g., design process, scientific investigation) and centred on one
major scenario. Examples were found locally, nationally and internationally of
performance on practical tasks being assessed through an extended production,
or small project, under exam conditions. However, most did not involve the use
of digital technologies. The most comprehensive example was that of Kimbell
et al. (2007) in the UK where students spent two consecutive mornings of three
hours duration each working on a structured design activity for the production of
a pill dispenser. All student work output was collected digitally using a networked
PDA device and local server. In WA the final Drama assessment has involved a
short individual ‘performance’, that is, assessment face-to-face, but is also usually
videotaped although again this is not typically assessed in a digital form. On a
number of occasions over the past decade, samples of Year 3, 5, 7 and 9 students
have been assessed in the Monitoring Standards of Education (MSE) programme
that has involved completing a short (2 hours in two parts) design brief including
prototype production.
An audio or video interview or oral presentation with a student is digitally
recorded under controlled circumstances and following a pre-determined script of
prompts and/or questions. Clearly the quality of the audio recording is critical so it
is likely to require the use of a radio microphone attached to the student or directly
in front of the student.
17
P. NEWHOUSE
Digital Representations of Assessment Tasks
In order to judge student performance that performance needs to either be viewed
or represented in some form. This may involve the assessor viewing a student
performing, such as in a musical recital, or viewing the results of a student performing,
such as in an art exhibition. Most often the latter occurs because this is either more
appropriate or more cost-effective. In places such as WA the inclusion of either type
of assessment for high-stakes purposes has been rare due to the costs and logistics
involved. For example, student performance in conducting science experiments has
not been included because of the difficulty in supervising the students and viewing
their work, and production in design and technology, or home economics related
areas, has not been included because the products are bulky and therefore difficult to
access by assessors. However, many forms of student performance can be recorded
in digital representations using video, audio, photographic or scanned documents,
and some student work is created in digital format using computer software. In
these cases the representations of student work can be made available to assessors
relatively easily using digital repositories and computer networks.
As in most areas of education, and particularly for assessment, authorities and/
or researchers in many localities have developed guidelines for the use of digital
technologies with assessment processes. For example, the British Psychological
Society published a set of general guidelines for the use of “Computer-Based
Assessments” through its Psychological Testing Centre (The British Psychological
Society, 2002). These guidelines include the use of digital technologies in Assessment
Generation, Assessment Delivery, Assessment Scoring and Interpretation, Storage,
Retrieval and Transmission. These guidelines are defined from a developer and user
perspective. Similarly, The Council of the International Test Commission developed
international guidelines for good practice in computer-based and Internet delivered
testing (The Council of the International Test Commission, 2005). These were
focussed on four issues: the technology, the quality of the testing, the control of
the test environment, and the security of the testing. The contexts considered all
involved students sitting at a computer to complete a test.
Irrespective of whether digital technologies are used, the quality of the assessment
task itself is vital and therefore the design of digital forms of assessment needs to
start with the task itself. Boud (2009) suggests ten principles pertinent to a ‘practice’
perspective of assessment; these provide a valuable backdrop to this project although
some have reduced potential for purely summative assessment.
1.
2.
3.
4.
5.
6.
7.
18
Locating assessment tasks in authentic contexts.
Establishing holistic tasks.
Focusing on the processes required for a task.
Learning from the task.
Having consciousness of the need for refining the judgements of students.
Involving others in assessment activities.
Using standards appropriate to the task.
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
8. Linking activities from different courses.
9. Acknowledging student agency.
10. Building an awareness of co-production.
The first three and the seventh and ninth guided the development of the assessment
tasks in all four courses in the project. There was an attempt to ensure the fourth
and fifth were incorporated in tasks in all courses and to some extent the sixth and
10th were represented in the PES and Engineering tasks. Dochy (2009) presents
five characteristics of new assessment tasks: students construct knowledge; the
application of knowledge; multiple perspectives and context sensitivity; the active
involvement of students; and integration with the learning process.
All assessment items are required to be valid, educative, explicit, fair and
comprehensive, and should allow for reliable marking. The descriptions of the digital
assessment tasks below assume this but focus on any areas that are of a particular
challenge to that assessment type.
Guidelines Specific to Computer-Based Exams
Computer-based exams involve students sitting at computer workstations completing
tasks, including typing answers to questions. They may be required to use various
pieces of software to create digital products or may simply use a browser to complete
response type assessment. In AIT while both types of assessment activities may be
involved it is likely that the focus would be on the former. Taylor (Taylor, 2005)
discusses three delivery methods: stand-alone; LAN; and web-based. Both standalone using USB flash drives and web-based models were considered suitable in
AIT.
The International Test Commission has provided detailed guidelines for computerbased exams (The Council of the International Test Commission, 2005). These
guidelines were specific to test developers, test publishers and users and mainly
related to response type assessment. An array of specific guidelines was presented
according to the following structure.
1. Give due regard to technological issues in Computer Based Testing (CBT) and
Internet testing
a. Give consideration to hardware and software requirements
b. Take account of the robustness of the CBT/Internet test
c. Consider human factor issues in the presentation of material via computer or
the Internet
d. Consider reasonable adjustments to the technical features of the test for those
with disabilities
e. Provide help, information, and practice items within the CBT/Internet test
2. Attend to quality issues in CBT and Internet testing
a. Ensure knowledge, competence and appropriate use of CBT/Internet testing
b. Consider the psychometric qualities of the CBT/Internet test
19
P. NEWHOUSE
c. Where the CBT/Internet test has been developed from a paper and pencil
version, ensure that there is evidence of equivalence
d. Score and analyse CBT/Internet testing results accurately
e. Interpret results appropriately and provide appropriate feedback
f. Consider equality of access for all groups
3. Provide appropriate levels of control over CBT and Internet testing
a. Detail the level of control over the test conditions
b. Detail the appropriate control over the supervision of the testing
c. Give due consideration to controlling prior practice and item exposure
d. Give consideration to control over test-taker’s authenticity and cheating
4. Make appropriate provision for security and safeguarding privacy in CBT and
Internet testing
a. Take account of the security of test materials
b. Consider the security of test-taker’s data transferred over the Internet
c. Maintain the confidentiality of test-taker results
Clearly many of the guidelines apply generally to any test-taking context (e.g., 2d,
2e and 2f), whether on computer, or not. Many of the other guidelines were not
applicable to the current project (e.g., 4a, b and c) because only single classes and
their teachers in particular schools were involved. However, many of the guidelines
in the first three areas were relevant to one or more of the cases in the project. For
example, some of the guidelines associated with 1a, 1b, 2a and 2b were relevant, and
to some extent some guidelines associated with 3a, 3b and 3d were relevant. Even so
they were mainly relevant to the implementation of large scale online testing.
More recently there has been increased international interest in computer-based
testing to assess ICT capability that is more relevant to the AIT course. For example,
over the past year an international research project, the Assessment and Teaching
of 21st Century Skills project, has commenced supported by of Cisco, Intel and
Microsoft. There have also been trials of such tests in a number of countries including
the UK, Norway, Denmark, USA and Australia (MCEETYA., 2007). In Australia the
ACER used a computer-based test to assess the ICT literacy of Year 6 and 10 students.
They developed the test around a simulated ICT environment and implemented the
test using sets of networked laptop computers. While they successfully implemented
the test with over 7000 students this was over a long period of time and would not
be scalable for an AIT examination. Also the use of a simulated environment would
be expensive and not scalable to provide a great enough variety of activities each
year. The trial in the UK also involved a multi-million pound simulated system but
was accessed by students through their school computers. In the Norwegian example
students used their own government-provided notebook computers. In the USA a
decision has been made to include an ICT literacy test in national testing in 2012 but
in a number of states there are already such tests.
Performance tasks and Production exams are not necessarily computer-based. It
is generally recommended that the tasks be clearly defined and limited, the work
environment be narrowly prescribed (e.g., access to prescribed information or
20
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
tools), and the required work output be well defined. The areas of concern are;
ensuring that they are fair to all students in terms of access to information, materials,
and tools, that they are valid in assessing what is intended, and provide for reliable
marking given the usually varied types of student work output. Therefore it is often
recommended that the assessment task be well bounded, the work environment be
limited (e.g., access to a limited set of information or tools), the time available
be controlled, student work be invigilated, and the required work output be well
defined.
Guidelines Specific to Digital Portfolios
The main concerns with the use of digital portfolios for assessment are:
– The authentication of student work given the period of time within which work
is completed
– Ensuring that they are fair to all students in terms of access to information,
materials and tools
– That they can be marked reliably given the usually varied types of student work
output.
Therefore it is often recommended that the portfolio require a particular structure
and limit the contents in type and size, the time available be controlled, and the work
be authenticated by a teacher and the students. In a review of e-assessment it was
suggested that a digital portfolio may involve three sections: student self-awareness;
student interaction; and thinking about futures and informed decisions (Ridgway et
al., 2006). In British Columbia, Canada, students complete a graduation portfolio.
They are provided with a number of guides as Word documents that act as templates
to construct their portfolios.
Carney (2004) developed a set of critical dimensions of variation for digital
portfolios:
1. Purpose(s) of the portfolio;
2. Control (who determines what goes into the portfolio and the degree to which this
is specified);
3. Mode of presentation (portfolio organisation and format; the technology chosen
for authoring);
4. Social Interaction (the nature and quality of the social interaction throughout the
portfolio process);
5. Involvement (Zeichner & Wray identify degree of involvement by the cooperative
teacher important for preservice portfolios; when considered more broadly, other
important portfolio participants might include university teachers, p-12 students
and parents, and others); and
6. Use (can range from low-stakes celebration to high-stakes assessment).
21
P. NEWHOUSE
The study considered the following suggestions by Barrett (2005):
Identify tasks or situations that allow one to assess students’ knowledge
and skills through both products and performance. Create rubrics that
clearly differentiate levels of proficiency. Create a record keeping system
to keep track of the rubric/evaluation data based on multiple measures/
methods. (p. 10)
She goes on to suggest that for “Portfolios used for Assessment of Learning” that is
for summative assessment the following are defining characteristics.
–
–
–
–
Purpose of portfolio prescribed by institution
Artefacts mandated by institution to determine outcomes of instruction
Portfolio usually developed at the end of a class, term or program - time limited
Portfolio and/or artefacts usually “scored” based on a rubric and quantitative data
is collected for external audiences
– Portfolio is usually structured around a set of outcomes, goals or standards
– Requires extrinsic motivation
– Audience: external - little choice
Beetham (n.d.) finds that e-portfolios are “less intimidating for some learners than
a traditional examination” and provide evidence that gives a “much richer picture
of learners’ strengths and achievements than, for example, a test score” (p. 4).
She points to the need for web-based relational database systems to implement
portfolios. While she points out that in the past e-portfolios have been found to
take longer to moderate and mark this has become more streamlined where this is
part of a “integrated assessment facility” – she provided five commercial examples
of such systems. She provides a list of “issues relating to the use of e-portfolios
for summative assessment” (p. 5). Seven of the nine issues are technical and most
are addressed by the use of a good assessment management system. The remaining
issues are:
– acceptability and credibility of data authenticated by Awarding Bodies
– designing assessment strategies to make effective use of the new tools and
systems
– ensuring enhanced outcomes for learners, for example, higher motivation, greater
choice over evidence, assessment around capabilities and strengths.
She also raises some issues for teachers and learners:
– fit with existing practices and expectations
– access and ICT capability of teachers and learners
– acceptability and appropriateness of e-portfolio use. (p. 16)
On most of these issues it is easy to argue that for courses such as AIT they are
not issues as it has become normal practice over many years for school-based
assessment, and provided there is a good assessment management system.
22
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
Guidelines Specific to Production Exams
Production exams would not necessarily be computer-based, for example, production
exams in design and technology need only be represented digitally through records
of the performance (e.g., video, photograph, scanned document). The areas of
concern with production exams are; ensuring that they are fair to all students in
terms of access to information, materials, and tools, that they are valid in assessing
what is intended, and provide for reliable marking given the usually varied types
of student work output. Therefore it is often recommended that the assessment task
be well bounded, the work environment be limited (e.g., access to a limited set of
information or tools), the time available be controlled, student work be invigilated,
and the required work output be well defined.
Guidelines Specific to Recorded Interviews or Oral Presentations
The main concerns are for the quality of the audio recording and for the comfort of the
student. Clearly the quality of the audio recording is critical so it is likely to require
the use of a radio microphone attached to the student or directly in front of the student.
If the student is to perform as close as possible to the best possible then it is important
that the student feels comfortable and confident in the immediate environment. This
could be supported through providing the student with opportunities to practice under
similar conditions and through being in an environment that is familiar and supportive
of the student such as the regular classroom or an interview room at the school.
METHODS OF MARKING
Task assessment is what is commonly referred to as ‘marking’. Once students have
completed the assessment task the output needs to be judged by some method to
determine a score, grade or ranking. Three methods of marking are considered here:
‘traditional’ true score marking, judgements using standards-based frameworks, and
comparative pairs judgements.
Traditionally summative assessment has tended to involve students ‘sitting’
paper-based exams that are scored by allocating a number to items in the exam
and then summing these numbers. This is sometimes called true score marking
or cumulative marking. Pollitt (2004) argues that current methods of summative
assessment that focus on summing scores on “micro-judgements” is “dangerous and
that several harmful consequences are likely to follow” (p. 5). Further, he argues that
it is unlikely that such a process will accurately measure a student’s “performance or
ability” (p. 5). He claims that this has been tolerated because assessment validity has
been overshadowed by reliability due to the difficulty and expense in addressing the
former compared with the latter.
Standards-reference or based frameworks and rubrics have been used for many
years by teachers in Western Australia and other localities to mark student work
23
P. NEWHOUSE
but have less often been used for summative high-stakes marking. This involves
the definition of standards of achievement against which to compare the work of
students. Typically this is operationalised for a particular assessment task through
a rubric that describes these standards according to components of the task. The
results may be represented as a set of levels of achievement or may be combined
by converting these to numbers and adding them. However, using Rasch Modelling
they may be combined to create an interval scale score. This report will refer to this
approach as analytical marking.
Comparative pairs marking involves a number of assessors making judgements
on achievement through comparing each student’s work with that of other students,
considering a pair of students at a time and indicating the better of the two. This is
sometimes referred to as pairwise comparisons or cumulative comparisons.
Sadler (2009) suggests that to assess complex learning requires tasks calling
for divergent responses that require marking based on qualitative judgement.
Such judgements may be facilitated by either a holistic or analytical approach
with the difference being in granularity. He claims there has been a gradual swing
towards analytical approaches in the pursuit of objectiveness (i.e., reliability of
measure).
Standards Referenced Analytical Marking
In a report for the Curriculum Council of WA, Prof Jim Tognolini, states that
“One of the main advantages of a standards-referenced assessment system is that
the results can indicate what it is students have achieved during the course” and
that “at the same time, use the same scores for university entrance purposes”.
Further he explains that this provides students with “a meaningful record of their
achievements” and this will “facilitate smoother entry through different pathways
into higher education and the workforce”. He points out that all Australian states and
many international systems including the Baccalaureate and PISA have a standardsreferenced curriculum. He defines it as “where educational outcomes are clearly and
unambiguously specified” and claims this has “significant power and appeal in more
globalised contexts” providing a “mechanism for tracking and comparing outcomes
over time and across jurisdictions”. In Western Australia this is now sometimes also
referred to as ‘analytical’ marking.
Sadler (2009) explains that there are two analytic assessment schemes, analytical
rating scales and analytic rubrics. The latter was employed in this project. The word
“rubric” is a derivative of the Latin word ruber meaning “red”. In literary history,
rubrics are margin notes in texts giving description, or common examples for, or
about, the passage (Wiggins, 1998). The current research literature on marking keys
promotes the use of criterion or rubric based marking keys to enhance transparency,
reliability and, when the task is aligned with the learning outcomes, also validity
(Andrade, 2005; Tierney & Marielle, 2004). In current usage, a rubric is a guide
listing criteria used for rating performance (Wiggins, 1998).
24
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
Marking using a rubric based on a standards framework requires assessors to
compare a student’s work against a set of theoretical standards separated into criteria
(Sadler, 2009). Standards are described using either quantifiers or sub-attributes of
the criterion. Marking using such descriptions is difficult and requires considerable
depth of knowledge and experience and can still result in different assessors judging
the same work differently because they have different standards in mind. This leads
to a problem of reliability that is typically overcome by using more than one assessor
for each piece of work and then having a consensus process. This may be costly and
still somewhat unreliable.
Assessment based on a standards framework is not new and has been used in a
number of countries for many decades. The best known example of an assessment
based around a standards framework is the testing associated with the National
Curriculum in the United Kingdom in the 1980s and 1990s. At the school level
Schagen and Hutchinson (1994) found that there were a “… variety of different
methods used to award Levels based on marks obtained or performance on
Statements of Attainment (SoA)”. However, at the national level there are a number
of National Curriculum Assessment tests that must be completed by all students
of selected ages in the UK. Tests of reliability on these tests have found that in
some of the National Curriculum Assessment tests “pupils of similar ability could
be assigned Levels two or more apart” due to statistical error or other factors such as
context, test construction, etc. (Schagen & Hutchison, 1994).
In a study titled “Assessing Expressive Learning” that involved nearly 2000 art
portfolios and the use of rubrics it was found that “qualitative instructional outcomes
can be assessed quantitatively, yielding score values that can be manipulated
statistically, and that produce measures that are both valid and reliable estimates of
student art performance” (Madeja, 2004).
Comparative Pairs Method of Marking
The comparative pairs judgement method of marking involves Rasch Modelling and
was used by Kimbell, Wheeler, Miller and Pollitt (2007) in the e-scape (e-solutions
for creative assessment in portfolio environments) project delivering high assessor
reliability. Pollitt (2004) describes the comparative pairs method of marking applied
to performance assessment in his paper, “Let’s stop marking exams”. He claims the
method he and his colleagues have developed is “intrinsically more valid” and is
“rooted in the psychophysics of the 1920s” (p. 2). He goes on to explain that while
the system is better than the traditional system to this stage, it has not been feasible
to apply, due to time and cost constraints, however, with the use of ICT to support
this system these constraints are removed and “Thurstone’s methods” that “have
waited 80 years … are at last … feasible” (p. 21). He quotes Laming that there is
“no absolute judgement. All judgements are comparisons of one thing with another”
and explains that it is more reliable to compare performances or products between
students than with “descriptions of standards” (p. 6). He claims that they have more
25
P. NEWHOUSE
than ten years experience in applying the method in a variety of contexts and that
with expert application about 20 comparisons per student is required. However, he
does suggest that the method should not be used with every type of assessment, with
research required to determine the appropriateness and whether “sufficient precision
can be achieved without excessive cost” (p. 16). A description of the mathematics
behind the method, and how it was implemented in the online system developed for
the e-scape project, is provided by Pollitt (2012).
McGaw (2006) also believes that the comparative pairs method of marking
provides an opportunity to improve the validity of high-stakes assessment in
separating the “calibration of a scale and its use in the measurement of individuals”
(p. 6). He claims that while the “deficiency” of norm-referenced assessment has been
understood for many years it was seen that there was no alternative. Now he believes
that with methods involving comparisons being supported of digital technologies
there is an alternative that should be explored.
An important question is whether the advances in psychometrics that permit
calibration of scales and measurement of individuals that allows interpretation
of performance in terms of scales can be applied in public examination.
(McGaw, 2006, p. 7)
The comparative pairs method of marking necessarily combines the judgements of
a group of assessors that could be seen as what Boud (2009, p. 30) refers to as a
“community of judgement”. He also explains that if assessment is to be shaped by
practice, as related to everyday world activity, or situated action, then a holistic
conception must be applied. Although he calls for a move away from “measurementoriented views of assessment” (p. 35) in fact the comparative pairs method of marking
provides holistic judgement, on situated activity, by a community of practice, while
still meeting stringent measurement requirements.
CONCEPTUAL FRAMEWORK FOR THE STUDY
In order to investigate the use of digital representations to deliver authentic and
reliable assessments of performance this study brought together three key innovations:
1. The representation in digital files of the performance of students doing practical
work.
2. The presentation of digital representations of student performance in an online
repository so that they are easily accessible to markers.
3. Assessing the digital representations of student performance using both analytical
standards-referenced judgement and the paired comparison judgement methods
with holistic and component criteria-based judgements.
While each of these innovations is not new in themselves, their combination applied
at the secondary level of education is new. Apart from Kimbell’s (2007) work at the
University of London there was no known precedent.
26
LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK
Fundamentally this study investigated the use of digital forms of representation
of student practical performance for summative assessment, whether the student
created digital files or their performance was recorded in digital format by filming,
photographing, audio recording or scanning.
The digital representations of student performance were combined within
an online repository. The use of online repositories for student work output is
increasingly common, often referred to as online portfolios, with many products
available to facilitate their creation and access (Richardson & Ward, 2005). The key
feature is that the portfolios can be accessed from anywhere and thus markers from
different jurisdictions can be involved, enhancing consistency of standards.
The paired comparison judgement method of marking, involving Rasch
Modelling, was implemented using holistic judgements. While Pollitt (2004)
describes the method as “intrinsically more valid” and better than the traditional
system, he believes that without some ICT support it has not been feasible to apply
due to time and cost constraints, and he does suggest that further research is required
to determine the appropriateness and whether “sufficient precision can be achieved
without excessive cost” (p. 16). McGaw (2006) believes that such methods being
supported by digital technologies should be applied in public examinations.
The diagram in Figure 2.1 represents the main concepts involved in assessment
with the study focussing initially on the Assessment Task and thereby the Method
of Assessment and the Student Work itself. However, to investigate the achievement
of the desired performance indicators that relate to Kimbell’s feasibility framework
the study was also involved with Task Assessment, in particular marking activities
using standards frameworks and the use of the comparative pairs marking approach.
27
P. NEWHOUSE
Method of assessment
The means of assessing
learning
Desired
learning
outcomes
and
institutional
goals
Achieved by É
– Quality assurance
– Training &
– Performance
support É at all
level and stages
of the assessment
process
Marking activities
– Mode ration
– Marking and grading/
reporting
The student work
(task or object)
Assessment task
(what the student does)
ASSESSMENT
Task assessment
(wath the assess or does)
Marking criteria
e.g. marking schemes &
guides
Desired
performance
indicators
– Valid
– Reliable
– Authentic
– Transparent
– Fair
Need to consider
–Manageability
– Technology adequacy
– Functional acceptability
– Pedagogic value
Management and
administration
Is required in all aspects
of assessment and
between assessments
for all stakeholders
Assessor skills/
knowledge
e.g. content, standards
Figure 2.1. A diagrammatic representation of a conceptual framework for the assessment of
performance. {Based on the work of Campbell (2008).}
28