P. John Williams University of Waikato, New Zealand and C. Paul Newhouse (Eds.) Edith Cowan University, Australia It was the belief that assessment is the driving force of curriculum that motivated the authors of this monograph to embark on a program of research and development into the use of digital technologies to support more authentic forms of assessment. They perceived that in responding to the educational needs of children in the 21st Century, curriculum needed to become more relevant and engaging, but that change was unlikely without commensurate change in methods and forms of assessment. This was particularly true for the high-stakes assessment typically conducted at the conclusion of schooling as this tended to become the focus of the implemented curriculum throughout the years of school. Therefore the authors chose to focus on this area of assessment with the understanding that this would inform assessment policy and practices generally in schools. This book provides a conceptual framework and outlines a project in which digital methods of representing students performance were developed and tested in the subject areas of Applied Information Technology, Engineering, Italian and Physical Education. The methodology and data collection processes are discussed, and the data is analysed, providing the basis for conclusions and recommendations. SensePublishers DIVS P. John Williams and C. Paul Newhouse (Eds.) ISBN 978-94-6209-339-3 Digital Representations of Student Performance for Assessment Digital Representations of Student ­Performance for Assessment Spine 12.014 mm Digital Representations of Student Performance for Assessment P. John Williams and C. Paul Newhouse (Eds.) Digital Representations of Student Performance for Assessment Digital Representations of Student Performance for Assessment Edited by P. John Williams University of Waikato, New Zealand and C. Paul Newhouse Edith Cowan University, Australia A C.I.P. record for this book is available from the Library of Congress. ISBN: 978-94-6209-339-3 (paperback) ISBN: 978-94-6209-340-9 (hardback) ISBN: 978-94-6209-341-6 (e-book) Published by: Sense Publishers, P.O. Box 21858, 3001 AW Rotterdam, The Netherlands https://www.sensepublishers.com/ Printed on acid-free paper All Rights Reserved © 2013 Sense Publishers No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. TABLE OF CONTENTS Preface vii Introduction and Background John Williams Significance and Rationale Statement of Problem and Research Question Method Recommendations 1 Literature Review and Conceptual Framework Paul Newhouse Performance Assessment Computer-Supported Assessment Digital Forms of Performance Assessment Methods of Marking Conceptual Framework for the Study 9 1 3 6 7 9 12 15 23 26 Method and Analysis John Williams and Alistair Campbell Samples Data Collection and Analysis Methodology Framework Developing the Assessment Tasks 29 Applied Information Technology Paul Newhouse The Nature of the AIT Course Implementation and Technologies Online Repository Analytical Marking and Analysis Comparative Pairs Marking Conclusions About Marking Processes Student and Teacher Perceptions and Attitudes Comparison Between Classes Conclusions from The AIT Course Summary of Findings for AIT Recommendations from the AIT data 49 30 32 38 41 v 49 53 56 57 63 71 76 83 84 89 95 TABLE OF CONTENTS Engineering Studies John Williams Implementation and Technologies Engineering Case Studies Online Repository Analytical Marking and Analysis Comparative Pairs Marking and Analysis Conclusions About Marking Processes Conclusions from Student and Teacher Data Comparison between Classes Conclusions from Engineering Course Summary of Findings from Engineering Studies Case Studies Recommendations for Engineering 99 99 102 102 102 105 109 111 117 117 119 122 Italian Studies Martin Cooper Implementation and Technologies Italian Case Studies Online Repository Analytical Marking and Analysis Comparative Pairs Marking and Analysis Conclusions About Marking Processes Conclusions from Student and Teacher Data Overall Conclusions from Italian Course Summary of Findings for Italian Studies 125 Physical Education Studies Dawn Penney and Andrew Jones Implementation and Technologies Case Studies Online Repository Analytical Marking and Analysis Comparative Pairs Marking and Analysis Conclusions About Marking Conclusions from Student and Teacher Data Overall Conclusions from PES Course Summary of Findings from PES 169 Findings And Conclusions Jeremy Pagram Findings General Conclusions 197 References 213 vi 126 133 133 135 140 148 150 157 160 169 175 175 176 178 184 185 189 191 197 208 PREFACE It was the belief that assessment is the driving force of curriculum that motivated the authors of this monograph to embark on a program of research and development into the use of digital technologies to support more authentic forms of assessment. They perceived that in responding to the educational needs of children in the 21st Century, curriculum needed to become more relevant and engaging, but that change was unlikely without commensurate change in methods and forms of assessment. This was particularly true for the high-stakes assessment typically conducted at the conclusion of schooling as this tended to become the focus of the implemented curriculum throughout the years of school. Therefore the authors chose to focus on this area of assessment with the understanding that this would inform assessment policy and practices generally in schools. It is gratifying when a project which is researching at the cutting edge of educational development leads to real change in educational practice, as was the case in this project. A number of the recommendations made were implemented around the time of the conclusion of the project. The recognition of the need for valid and reliable high stakes assessment, and the coinciding development of technologies which can feasibly capture the performance of students in school, will help ensure that the outcomes of this research continue to inform educational assessment decision making. We would like to thank all the chapter authors for their willingness to develop their chapters, and also Cathy Buntting for her expertise in reviewing the manuscript and then formatting it to such a high standard. This monograph is the outcome of a three-year research project that was managed by the Centre for Schooling and Learning Technologies (CSaLT) at Edith Cowan University, and funded by the Australian Research Council Linkage Scheme and the Curriculum Council of Western Australia. The research was conducted under the leadership of Paul Newhouse and John Williams, and the authors of the chapters in this book were the Investigators in the project. A broader team of consultants, managers, advisors, research assistants, postgraduate students, assessors and teachers all helped to ensure the project’s successful conclusion. A number of conference and journal outcomes have accompanied this project and supported this book. They are listed after the References at the end of the book. John Williams and Paul Newhouse April, 2013 vii CHAPTER 1 JOHN WILLIAMS INTRODUCTION AND BACKGROUND This research was conducted in Western Australia (WA) over a period of three years, concluding in 2011. This report of the research focuses on the findings, conclusions and recommendations of the study, but contextualizes that within a rationale, literature review and description of the methodology. The study set out to investigate the use of digital forms of assessment in four upper secondary school courses. It built on concerns that the assessment of student achievement should, in many areas of the curriculum, include practical performance and that this will only occur in a high-stakes context if the assessment can be shown to validly and reliably measure the performance and be manageable in terms of cost and school environment. The assessment examined in this research is summative in nature (i.e. it is principally designed to determine the achievement of a student at the end of a learning sequence rather than inform the planning of that sequence for the student) with reliability referring to the extent to which results are repeatable, and validity referring to the extent to which the results measure the targeted learning outcomes. The research specifically addressed a critical problem for the school systems in Western Australia, which also has national and international significance. At the same time the research advanced the knowledge base concerning the assessment of practical performance by developing techniques to represent practical performance in digital forms, collate these in online repositories, and judge their quality using a standards-based marking method and trialling a comparative pairs marking method. SIGNIFICANCE AND RATIONALE From the 1990s, significant developments in computer technology have been the emergence of low-cost, high-powered portable computers, and improvements in the capabilities and operation of computer networks (e.g., intranets and the accessibility of the Internet). These technologies have appeared in schools at an escalating rate. During that same period school systems in Australia were moving towards a more standards-based curriculum and investigating methods of efficiently and effectively assessing students from this perspective. P. J. Williams and C. P. Newhouse (Eds.), Digital Representations of Student Performance for Assessment, 1–8. © 2013 Sense Publishers. All rights reserved. J. WILLIAMS In Western Australia this became critical with the development of high-stakes senior secondary courses to be implemented over the latter half of the decade. In some courses developments in technology dictated that students should be assessed making use of that technology, while in many courses it was likely that at least some of the intended learning outcomes were not able to be adequately assessed using paper-based methods. Therefore it was important that a range of forms of assessment were considered along with the potential for digital technologies to support them. There is a critical need for research into the use of digital forms of representation of student performance on complex tasks for the purposes of summative assessment that are feasible within the constraints of school contexts. Internationally the need for better forms of assessment is increasingly being seen as a critical component in improving schooling, and is often discussed under the banner of ‘twenty-first century skills’ (Kozma, 2009). Recently (March 2011), even the American President spoke at length on the need to measure performance in other ways to traditional exams in order to support students “learning about the world” and so that “education would not be boring for kids” (eSchool News, 2011, p. 15). However, it is still necessary that these alternative forms of assessment generate defensible measurements; that is, are valid and reliable measures of the intended performance, particularly for summative assessment. An assessment needs to possess content, criterion and construct validity (Dochy, 2009), the first being the extent to which the assessment addresses the relevant knowledge domain, its authenticity. Dochy (2009) sees the identification of criterion and construct validity as being more problematic for new modes of assessment focused on complex problems. Criterion validity is the extent to which the assessment correlates with another assessment designed to measure the same construct. Construct validity is the extent to which the assessment measures a ‘construct’ within a ‘conceptual network’ usually through estimating relationships with other constructs. The value of the validity of an assessment is dependent on the reliability of measurement that may be interpreted as the degree of agreement between assessors (inter-rater reliability) or degree of consistency between assessments (e.g., test-retest). Dochy questions this classical theory and argues for the use of generalisability theory that seeks to include judgements from multiple perspectives and from multiple assessments to generalise the behaviour of a student. In essence this theory seeks to identify and explain sources of error in measurement rather than minimise it. This research investigated authentic digital forms of assessment with high levels of reliability and manageability, which were capable of being scaled-up for statewide implementation in a cost effective manner. The findings of this research provide guidelines for educators and administrators that reflect successful practice in using Information and Communications Technology (ICT) to support standards based courses. The findings also provide significant benefit to the wider educational community, particularly in terms of the development and provision of a nationally consistent schooling system with accountability to standards in senior schooling systems. 2 INTRODUCTION AND BACKGROUND STATEMENT OF PROBLEM AND RESEARCH QUESTION The general aim of the study was to explore the potential of various digitally-based forms for external assessment for senior secondary courses in Western Australia. Specifically the study set out to determine the feasibility of four digital-assessment forms in terms of manageability, cost, validity and reliability, and the need to support a standards-based curriculum framework for students in schools across the state. The problem being addressed was the need to provide students with assessment opportunities in new courses, which are on one hand authentic, where many outcomes do not lend themselves to being assessed using pen and paper over a three hour period, while on the other hand being able to be reliably and manageably assessed by external examiners. That is, the external assessment for a course needs to accurately and reliably assess the outcomes without a huge increase in the cost of assessment. The main research question was: How are digitally based representations of student work output on authentic tasks most effectively used to support highly reliable summative assessments of student performances for courses with a substantial practical component? The study addresses this question by considering a number of subsidiary questions. 1. What are the benefits and constraints of each digitally based form to support the summative assessment of student practical performance in senior secondary courses in typical settings? 2. What is the feasibility of each digital form of assessment in terms of the four dimensions: technical, pedagogic, manageability, and functional? 3. Does the paired comparison judgments method deliver reliable results when applied to student practical performance across different courses? The courses selected for the research were Applied Information Technology, Engineering Studies, Italian Studies and Physical Education Studies. Following is a summary of the specific issues in each of these courses. Discussion of the Problem for Applied Information Technology (AIT) In contrast to the other three courses in the project, in Applied Information Technology, digital technologies provide the content for study as well as pedagogical support. Therefore performance relates to using the technologies to demonstrate capability. The syllabus states that the AIT course “provides opportunities for students to develop knowledge and skills relevant to the use of ICT to meet everyday challenges”. As such in the course students should “consider a variety of computer applications for use in their own lives, business and the wider community”. In the course students spend the majority of their time in class using digital technologies to develop information solutions. It should therefore be surprising that currently the external assessment consists of a three-hour paper-based exam. This is despite 3 J. WILLIAMS the fact that the syllabus stipulates that around 50% of the weighting of assessment should be on production. In early 2008 courses like AIT were changed with the decision that all senior students were to sit an external examination. The ramifications of this decision were likely to be widespread including that the ‘exam’ would have to be appropriate for lower achieving students, it would dominate the course delivery and would involve a lot more students, increasing the cost considerably. Originally it had been assumed that because only higher achieving students were likely to be involved, the extra time needed to collate a portfolio was reasonable and would only include higher quality work that would be easier to mark. Another confounding change was the requirement for the course to be packaged in a syllabus format with details of specific content for each unit rather than what had been a definition of the boundaries of the content with the opportunity to address the content to varying depths and across a range of relevant contexts for the students and teacher. This also led to a shift of focus away from outcomes towards content that immediately highlighted the issue of the variety of relevant contexts that could be involved in the course and the issue of the rapidly changing content of these areas of technology. This had not been such an issue with the focus on outcomes because they could be applied to the range of contexts and did not specify particular content that could quickly date. This has since led to the focus for assessment being on assessment type rather than outcomes. While students can include study in AIT towards University entry this would be of no value if the external assessment propels the course towards becoming mainly ‘book work’ rather than creative digital work. We are living in a society where almost every avenue of work and life requires the use of digital tools and resources. Whether a senior student is aiming to be a mechanic, doctor, accountant or travel agent, study in AIT could begin to give them the skills, attitudes and understanding that will support them in being more successful in work and life. Therefore the research problem for the AIT course becomes that to align with the aims, rationale, outcomes, content and preferred pedagogy, assessment must include students using digital technologies but there are a number of ways in which that may be achieved. The research question therefore becomes, which method of assessment, portfolio or computer-based exam or combination, is most feasible for the course at this time? Discussion of the Problem for Engineering Studies In 2007 a new senior secondary subject, Engineering Studies, was introduced in Western Australia. As a result, for the first time in Western Australia, achievements in Engineering Studies could contribute to gaining tertiary entrance. Thus, an assessment structure had to be designed and implemented that would measure achievement in Engineering. The course was structured with a design core, and then students could study one of three options: materials, structures and mechanical systems; systems and control or electrical/electronics. 4 INTRODUCTION AND BACKGROUND The assessment structure had an internal and an external component. The teacher submitted a mark for each student, representing design, production and response activities throughout the year and worth 50% of the student’s final mark. A 3-hour external written examination measured student knowledge on both the core and specialization areas through a series of multiple choice and short answer questions, and was combined and moderated with the school based assessment mark. For a practical and performance based subject the examination did not reflect that essential nature of the subject. Consequently pedagogies were too theoretical as teachers taught for the exam and had difficulty effectively connecting theory and practice. The examination was therefore limited in relation to the course content, concepts and outcomes that it embraced. The practical examination developed in this project reaffirmed the need for research to explore the possibilities that new technologies may open up to extend the practical assessment in the course. Discussion of the Problem for Italian Studies In general, this research project has sought to explore digital assessment tasks that are targeted at the measurement of student performance in the area being investigated. However, Italian studies already had a tradition of assessing Italian oral performance through a face-to-face examination where two markers assess each student’s performance in real time. This examination is undertaken at a central location away from the students’ school. Therefore the Italian component of the study has focused on the exploration of different ways of digitally assessing oral performance that may have advantages in terms of validity, reliability and logistics. Ultimately techniques were trialled that both simulated a conversation using digital technologies and were capable of being carried out within a typical school which is teaching Italian. Throughout the research process the usefulness of digital technologies to the daily pedagogical practices of Italian teachers was also investigated and demonstrated. In the final year of the project the scope of the research was expanded to cover Listening and Responding, Viewing, Reading and Responding in addition to Oral Communication. The final formal assessment task had components designed to address these areas such as visual stimuli, and Italian audio for the students to respond to. Discussion of the Problem for Physical Education Studies In 2007 a new senior secondary, Physical Education Studies, was introduced in WA. The development of the course meant that for the first time in WA, student achievements in Physical Education Studies could contribute to gaining tertiary entrance. A challenge and dilemma for the course developers was to determine the nature of the achievements that could be encompassed in assessment and specifically, an external examination. Differences in current practice across Australasia reflect an 5 J. WILLIAMS ongoing lack of consensus about the examination requirements and arrangements for senior physical education that can effectively address concerns to ensure validity, reliability, equity and feasibility. More particularly, the differences centre on firstly, whether and in what ways the skills, knowledge and understandings inherent in practical performance can feasibly and reliably be assessed so as to align with examination requirements; and secondly, how any such assessment can align with an intent embedded in the new WA and a number of other course developments, to seek to better integrate theoretical and practical dimensions of knowledge (see for example, Macdonald & Brooker, 1997; Penney & Hay, 2008; Thorburn, 2007). In these respects, the research sought to acknowledge and respond to Hay’s (2006, p. 317) contention that: …authentic assessment in PE should be based in movement and capture the cognitive and psychomotor processes involved in the competent performance of physical activities. Furthermore, assessment should redress the mind/body dualism propagated by traditional approaches to assessment, curriculum and pedagogies in PE, through tasks that acknowledge and bring to the fore the interrelatedness of knowledge, process (cognitive and motor), skills and the affective domain Thus, the assessment task and the associated use of digital technologies, was designed to promote integration of conceptual and performance-based learning. It also reflected that the PES course does not prescribe the physical activity contexts (sports) through which learning will be demonstrated. The task therefore needed to be adaptable to the varied sporting contexts that schools may choose to utilise in offering the PES course. METHOD The focus of this study was on the use of digital technologies to ‘capture’ performance on practical tasks for the purpose of high stakes summative assessment. The purpose was to explore this potential so that such performances could be included to a greater extent in the assessment of senior secondary courses, in order to increase the authenticity of the assessment in these courses. The study involved case studies for the four courses. During the three years there was a total of at least 82 teachers and 1015 students involved. The number of students involved in each case study ranged from 2 to 45. Therefore, caution needs to be taken in interpreting the analysis and generalising from the results. Four different fundamental forms of assessment (reflective portfolios, extended production exams, performance tasks exams, and oral presentations) were investigated in 81 cases with students from the four courses and with the assessment task being different in each course. For each course there was a common assessment task that consisted of a number of sub-tasks. For each case a variety of quantitative and qualitative data was collected from the students and teachers involved, including 6 INTRODUCTION AND BACKGROUND digital representations of the students’ work on the assessment tasks, surveys and interviews. These data were analysed and used to address the research questions within a feasibility framework consisting of four dimensions: Manageability (Can the assessment task be reasonably managed in a typical school?), Technical (Can existing technologies be adapted for assessment purposes?), Functional (Can the assessment be marked reliably and validly when compared to traditional forms of assessment?), and Pedagogic (Does a digital form of assessment support and enrich students’ learning experiences?). The evidence of performance generated from the digital assessment tasks were marked independently by two external assessors using an analytical standardsreferenced method. This method used detailed sets of criteria, represented as rubrics, and linked to the assessment task, appropriate course content and outcomes. Correlations were determined for comparison purposes between the two external assessors and also between the assessors and the classroom teacher. Additionally, the work was marked using the method of comparative pairs and these results were again compared against the results from the other forms of marking. This method of marking involved a panel of between 5 and 20 assessors and is based on Rasch dichotomous modelling. RECOMMENDATIONS The following recommendations are made with regard to the general application of findings of the study. Methods of Marking – Comparative pairs method of marking is typically generates more reliable scores than analytical marking, but is probably only valid when the assessment task is fundamentally holistic (i.e., not made up of many required sub-tasks) with a minimum of scaffolding (typically some is required to ensure students leave enough information to assess). Where there are a number of components, these would need to be considered separately if using a pairs comparison method as an holistic decision may not give appropriate proportionate weighting to the components. This is not an issue in analytical marking as weights are easily applied to the various components via the marking key. – Analytical standards-referenced marking may be used to generate reliable sets of scores for the range of digital forms of assessment tried, provided that criteria are developed specifically for the assessment task, assessors agree on an interpretation of the criteria, the values that can be awarded are limited to as small a range as possible tied to specific descriptors, and that descriptors only pertain to likely levels of demonstration of the criteria by the target students. – It is desirable to implement either method of marking using online tools connected to a digital repository of student work. Assessors have few problems in accessing and recording scores whether they are resident locally, interstate or internationally. 7 J. WILLIAMS Digital Forms of Assessment – In typical WA secondary schools it would be possible to implement, for most courses, the range of digital forms of assessment tried, even for those that don’t typically operate in an environment with ICT available, using local workstations (desktop or portable), with local storage, about 10% of workstations spare, and a school IT support person on-call. – If an assessment task is implemented using online technologies, then in many typical WA secondary schools networks may not be adequate depending on the technologies used (e.g., Flash, Java) or the bandwidth required. Therefore each site would need testing under realistic conditions (e.g., normal school day with number of students accessing concurrently). Further, all data should be stored locally as a backup to the online storage (could upload at the end of the task). – Students are highly amenable to digital forms of assessment, even those with less than average levels of ICT skill, generally preferring them to paper-based forms provided that they have confidence in the hardware and some experience with the software and form of assessment. Almost all students are able to quickly learn how to use simple types of software required in assessment (e.g., Paint for drawing diagrams, digital audio recording). – Teachers are amenable to digital forms of assessment provided that benefits to students are clear, implementation is relatively simple, and any software is easy to learn. – Experienced teachers and graduate students can be trained to implement digital forms of assessment. – Undergraduate students in IT related courses can be trained to prepare the digital materials resulting from an assessment, ready for online access by assessors. – Commercial online assessment systems such as MAPS/e-scape and Willock may be successfully used in WA schools, but are limited in their effectiveness by the infrastructure available in each school and by the design of those systems. 8 CHAPTER 2 PAUL NEWHOUSE LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK The aim of the research was to investigate the feasibility of using digital technologies to support performance assessment. As such the study connects with two main fields of research: performance assessment, and computer-supported assessment. However, clearly these are subsumed within the general field of assessment. While it will be assumed that the basic constructs within the field of assessment are known and apply perhaps it is useful to be reminded of this through a definition of assessment from Joughin (2009) and a statement of three pillars that Barrett (2005) suggests provide the foundation for every assessment. To assess is to make judgements about students’ work, inferring from this what they have the capacity to do in the assessed domain, and thus what they know, value, or are capable of doing. (Joughin, 2009, p. 16) – A model of how students represent knowledge and develop competence in a content domain. – Tasks or situations that allow one to observe students’ performance. – An interpretation method for drawing inferences from performance evidence. (Barrett, 2005) PERFORMANCE ASSESSMENT Research in, and the call to investigate “performance-and-product assessment” is not new as pointed out by Messick (1994, p. 14), tracing back at least to the 1960s. However, Messick claims that mainstream schooling showed little interest in this in until an “upsurge of renewed interest” in the 1990s with “positive consequences for teaching and learning” (p. 13). While Messick does not specifically address digital forms of performance assessment, his arguments for the need to address “issues of validity, reliability, comparability and fairness” apply, particularly to a range of validity criteria. He argues they are social values that require close attention to the intended and unintended consequences of the assessment through considerations of the purposes of the assessment, the nature of the assessed domain, P. J. Williams and C. P. Newhouse (Eds.), Digital Representations of Student Performance for Assessment, 9–28. © 2013 Sense Publishers. All rights reserved. P. NEWHOUSE and “construct theories of pertinent skills and knowledge” (p. 14). For example, he outlines situations under which product assessment should be considered rather than performance assessment. The issue is their relationship to replicability and generalisability requirements because these are important when performance is the “vehicle” of assessment. Lane (2004) claims that in the USA there has been a decline in the use of performance assessments due to increased accountability requirements and resource constraints. She outlines how this has led to a lack of alignment between assessment, curriculum standards, and instructional practices; particularly with regard to eliciting complex cognitive thinking. Dochy (2009) calls for new assessment modes that are characterised by students constructing knowledge, the application of this knowledge to real life problems, the use of multiple perspectives and context sensitivity, the active participation of students, and the integration of assessment and the learning environment. At the same time Pollitt (2004) argues that current methods of summative assessment that focus on summing scores on “micro-judgements” is “dangerous and that several harmful consequences are likely to follow” (p. 5). Further, he argues that it is unlikely that such a process will accurately measure a student’s “performance or ability” (p. 5), and more holistic judgements of performance are required. Koretz (1998) analysed the outcomes of four large-scale portfolio assessment systems in the USA school systems and concluded that overall the programmes varied in reliability and were resource intensive with “problematic” (p. 309) manageability. This body of literature clearly presents the assessment of student performance as critically important but fundamentally difficult with many unanswered questions requiring research. Globally interest in performance assessment has increased with the increasing use of standards-referenced curricula. Standards-referenced curricula have evolved over the past 20 years particularly from the UK and more recently in Australian states since the early 1990s. The key concept in these curricula was that student achievement was defined in terms of statements describing what students understood, believed or could do. The term standards-referenced has tended to be used recently to indicate that student achievement is measured against defined standards. This has reinforced the need for clear alignment between intended curriculum outcomes and pedagogy, and assessment (Taylor, 2005). Alignment has typically been poor, particularly in areas where some form of practical performance is intended. Koretz (1998), who defines portfolio assessment as the evaluation of performance by means of a cumulative collection of student work, has figured prominently in USA debate about education reform. He analysed the outcomes of four largescale portfolio assessment systems in the USA school systems, in particular, in terms of their reliability. Each example involved marking student portfolios for the purpose of comparing students and/or schools across a state, mainly in English and Mathematics. All of the examples occurred in the 1990s and none involved digital representations of performance. Koretz concluded that overall the programmes were resource intensive and did not produce “evidence that the resulting scores provide a 10 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK valid basis for the specific inferences users base on them…” (p. 332). Even though he noted that significant improvements in the implementation and reliable marking of portfolios had been achieved, at that time he saw portfolio-based assessment as “problematic” (p. 309). Findings such as this provide a rationale for considering digital solutions to performance assessment. Apart from the lack of validity of traditional paper-based assessment methods another compelling rationale to consider the efficacy of performance assessment is that teachers tend to teach to the summative assessment (Dochy, 2009; Lane, 2004; Ridgway, McCusker, & Pead, 2006). McGaw (2006) discussed this in the light of changes in the needs of the society, advances in psychometric methods, and improvements in digital technologies and believed that there is a “risk that excessive attention will be given to those aspects of the curriculum that are assessed” and that “risk-taking is likely to be suppressed” (p. 2). This leads to what Dochy (2009) refers to as a deprofessionalization of teachers. Further, summative assessment tends to drive learning with students “adapting their approaches to learning to meet assessment requirements” (Joughin, 2009, p. 16). Joughin goes on to discuss how assessment determines what the actual curriculum is as opposed to the intended curriculum, the inference being that if the intended curriculum is to be implemented then assessment needs to align with and reinforce it. Worse than this he explains how assessment will determine the extent to which students adopt deep approaches to learning as opposed to surface approaches. A concern underpinning the argument for computer-based assessment methods to replace traditional paper-and-pencil methods was presented by the American National Academy of Sciences (Garmire & Pearson, 2006). They argue that assessing many performance dimensions is too difficult on paper and too expensive using “hands-on laboratory exercises” (p. 161) while computer-based assessment has the potential to increase “flexibility, authenticity, efficiency, and accuracy” but must be subject to “defensible standards” (p. 162) such as the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & Education., 1999). The committee cites the use of computer-based adaptive testing, simulations, computer-based games, electronic portfolios, and electronic questionnaires as having potential in the assessment of technological literacy (2006). They concluded that computer-based simulations were suitable but could be expensive. They also raised a number of questions requiring research that electronic portfolios, “appear to be excellent tools for documenting and exploring the process of technological design” (p. 170). McGaw (2006) also believes that without change to the main high-stakes assessment strategies currently employed there is a reduced likelihood that productive use will be made of formative assessment. He is not alone in this concern, for example, Ridgway et al. (2006, p. 39) states that, “There is a danger that considerations of cost and ease of assessment will lead to the introduction of ‘cheap’ assessment systems which prove to be very expensive in terms of the damage they do to students’ educational experiences.” Therefore, from both a consideration of 11 P. NEWHOUSE the need to improve the validity of the assessment of student practical performance, and the likely negative impact on teaching (through not adequately assessing this performance using ill-structured tasks) there is a strong rationale for exploring alternative methods of assessment (Dochy, 2009). However, any approach or strategy will not be perfect and will require compromises and consideration of the following questions: 1. 2. 3. 4. 5. What skills or knowledge are best demonstrated through practical performance? What are the critical components of that practical performance? Why can’t those components be demonstrated on paper? What alternative representations other than paper could be used? What level of compromise in reliability, authentication and cost is acceptable in preference to not assessing the performance at all? COMPUTER-SUPPORTED ASSESSMENT Computer-Supported Assessment, sometimes referred to as Computer-Assisted Assessment, is a broad term encompassing a range of applications from the use of computers to conduct the whole assessment process such as with on-screen testing, to only assisting in one aspect of the task assessment process (e.g., recording performance or marking) (Bull & Sharp, 2000b). The first area of the task assessment process that took advantage of computer-support was objective type assessments that automated the marking process (eliminating the marker) and allowed the results to be instantly available. Bull and Sharp (2000a) found that the use of computers to support assessment has many advantages for the assessment process, assessors and students. Much of the published research in the field of computer-supported assessment relates to higher education, particularly in university settings (e.g., Brewer, 2004), with little specific to school-based education. However, in the school sector assessment of student creative work in the arts has been addressed for some time with, for example, Madeja (2004) arguing the case for alternatives to paper-and-pencil testing for the arts. Further, there has been some research into the use of portfolios for assessment but most often this is for physical, not digital, portfolios. There has been a limited amount of research in the area in Australia, typically these have been small-scale trials in the use of IT to support assessment processes (e.g., Newhouse, 2005). There have also been reports on the use of online testing in Australia, such as by MacCann (2006), but these usually do not involve assessing practical performance and merely replicate paper-and-pen tests in an online environment. While there has been only limited empirical research into many areas of computersupported assessment there are many useful theoretical discussions of the issues such as Spector’s (2006) outline of a method for assessing learning in “complex and ill-structured task domains”. While providing useful ideas and rationales these ideas remain largely untested in the reality of classrooms. What is known is that any use of 12 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK ICT involves school change (Lim & Hung, 2003; Newhouse, Clarkson, & Trinidad, 2005) and will require training of teachers, changes in thinking, and pedagogical understandings that are difficult to take on, even for younger teachers (Newhouse, Williams, & Pearson, 2006). There has been increasing interest internationally in the application of computer support to improve assessment as indicated in the focus of a recent keynote address by McGaw (2006). The University of Cambridge Local Examinations Syndicate is conducting over 20 projects to explore the impact of new technologies on assessment including using online simulations in assessing secondary science investigation skills (Harding, 2006). Other organisations (e.g., Becta, 2006) or groups of researchers (e.g., Ridgway et al., 2006) have reported on exploratory projects, particularly the increasing use of online testing, although rarely for high-stakes assessment and not without some difficulty (Horkay, Bennett, Allen, Kaplan, & Yan, 2006). The British Psychological Society has produced a set of guidelines for ComputerBased Assessment. While they mainly focus on online testing they provide a conceptual model that includes Assessment Generation, Assessment Delivery, Assessment Scoring and Interpretation, and Storage, Retrieval and Transmission. The latter two were relevant to the present study with the guidelines for developers and users. Recently the Joint Research Centre for the European Commission (Scheuermann & Bojornsson, 2009) brought out a major report titled, The Transition to ComputerBased Assessment. Kozma (2009) lays out the rationale in terms of a mismatch between what is needed in modern society and what is addressed and thus assessed at school. In particular he draws attention to the differences between standardized pen-and-paper assessment and “Tasks in the Outside World”. In the latter he explains how tasks: require cross-discipline knowledge; relate to complex ill-structured problems; and are completed collaboratively using a wide range of technological tools to meet needs and standards. These characteristics are at odds with traditional approaches to assessment. While he does not see assessment reform only requiring the use of ICT he outlines a number of significant advantages including: reduced costs; increased adaptability to individuals; opportunity to collect process data on student performance; the provision to tools integral to modern practice; and better feedback data. Kozma does introduce a number of challenges to using ICT to support assessment including: start-up costs for systems; the need to choose between standardized and ‘native’ applications; the need to integrate applications and systems; the need to choose between ‘stand-alone’ and online implementation; the need for security of data; the need for tools to make the design of tasks easy and efficient; and the lack of knowledge and examples of high-quality assessments supported by ICT. He also highlights methodological challenges including: the extent of equivalence with pen-and- paper; the design of appropriate complex tasks; making efficient and reliable high-level professional judgements; scoring students’ processes and strategies; and distinguishing individual contributions to collaborative work. 13 P. NEWHOUSE A recent research initiative of Cisco, Intel and Microsoft (Cisco, Intel, & Microsoft, 2009) is the Assessment and Teaching of 21st Century Skills project. The paper that was a call to action clearly argues that changes are required to high stakes assessments before needed change will occur in schools. Reform is particularly needed in education assessment-how it is that education and society more generally measure the competencies and skills that are needed for productive, creative workers and citizens. Accountability is an important component of education reform. But more often than not, accountability efforts have measured what is easiest to measure, rather than what is most important. Existing models of assessment typically fail to measure the skills, knowledge, attitudes and characteristics of self-directed and collaborative learning that are increasingly important for our global economy and fast changing world. New assessments are required that measure these skills and provide information needed by students, teachers, parents, administrators, and policymakers to improve learning and support systemic education reform. To measure these skills and provide the needed information, assessments should engage students in the use of technological tools and digital resources and the application of a deep understanding of subject knowledge to solve complex, real world tasks and create new ideas, content, and knowledge. (Cisco et al., 2009, p. 1) Ripley (2009) defines e-assessment as “the use of technology to digitise, make more efficient, redesign or transform assessments and tests; assessment includes the requirements of school, higher education and professional examinations, qualifications, certifications and school tests, classroom assessment and assessment for learning; the focus of e-assessment might be any of the participants with the assessment processes – the learners, the teachers and tutors, managers, assessment and test providers and examiners. He highlights presents two ‘drivers’ of e-assessment; business efficiency and educational transformation. The former leads to migratory strategies (i.e. replicating traditional assessment in digital form) while the latter leads to transformational strategies that change the form and design of assessment. An example of the latter is the recent ICT skills test conducted with 14-year olds in the UK in which students completed authentic tasks within a simulated ICT environment. He raises issues that need to be addressed including: providing accessibility to all students; the need to maintain standards over time; the use of robust, comprehensible and publicly acceptable means of scoring student’s work; describing the new skill domains; overcoming technological perceptions of stakeholders (e.g., unreliability of IT systems); and responding to the conceptions of stakeholders about assessment. Lesgold (2009) calls into question the existence of a shared understanding among the American public on what is wanted out of schools and how this may have changed with changes in society. He argues that this must go with changes to assessment to include 21st century skills and this will not be served by the traditional standard approach to testing based on responses to small items that minimises the need for human judgement in marking. Instead students will need to respond to 14 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK tasks representing complex performances, supported by appropriate tools with the results needing to be judged by experts. He recognises the issues that this would throw up and provides ‘stealth assessment’ as an example solution. In this example students complete a portfolio of performance at school over time and supervised by the teacher. The testing system then selects one or two “additional performances” to be externally supervised “as a confirmation that the original set was not done with inappropriate coaching” (p. 20). This is more amenable to ‘learning by doing’ and project-based learning where bigger, more realistic tasks can be accomplished that develop attributes such as persistence. At this stage it is likely that a minority of teachers provide students with experiences in using ICT to support any forms of assessment. For example, in a survey reported by Becta (2010) in Messages from the Evidence: Assessment using Technology, it was found that at best 4 out of 10 teachers reported using ICT to ‘create or administer assessment’. This lack of experience for students and teachers is likely to be a constraint in using ICT to support summative assessment, particularly where the stakes are high. DIGITAL FORMS OF PERFORMANCE ASSESSMENT Many educational researchers argue that traditional assessment fails to assess learning processes and higher-order thinking skills, and go on to explain how digital technologies may address this problem (Lane, 2004; Lin & Dwyer, 2006). This argument centres around the validity of the assessment in terms of the intended learning outcomes, where there is a need to improve the criterion-related validity, construct validity and consequential validity of high-stakes assessment (McGaw, 2006). Further, in some school courses students learn with technologies and this dictates that students should be assessed making use of those technologies. Dede (2003) suggests that traditionally educational assessment has been “based on mandating performance without providing appropriate resources, then using a ‘drive by’ summative test to determine achievement” (p. 6). He goes on to explain how digital technologies may address this problem and claims that “the fundamental barriers to employing these technologies effectively for learning are not technical or economic, but psychological, organizational, political and cultural” (p. 9). Taylor (2005) optimistically suggests that, “as technology becomes an integral component of what and how students learn, its use as an essential tool for student assessment is inevitable” (p. 9). Lin and Dwyer (2006) argue that to date computer technology has really only been used substantially in assessment to automate routine procedures such as for multiple-choice tests and collating marks. They suggest that the focus should be on capturing “more complex performances” (p. 29) that assess a learner’s higherorder skills (decision-making, reflection, reasoning and problem solving) and cite examples such as the use of simulations and the SMART (Special Multimedia Areas for Refining Thinking) model but suggest that this is seldom done due to “technical 15 P. NEWHOUSE complexity and logistical problems” (p. 28). A recent review of assessment methods in medical education (Norcini & McKinley, 2007) outlines performance-based assessment of clinical, communications and professional skills using observations, recordings and computer-based simulations. Design and Development of Digital Assessments A major aim of the study was to increase the validity of assessment using a variety of forms of assessment supported by digital technologies. Clearly the design of the tasks for the form of assessment was critically important. Dochy (2009, p. 105) discusses the manner in which “new assessment modes” may improve the validity of the tasks, the scoring, generalisability, and consequential validity. He explains that initially construct validity “judges how well assessment matches the content and cognitive specifications of the construct being measured”. In the study this was achieved using course teams, a situation analysis, and seeking the perceptions of teachers and students. If this is done then he claims the authenticity and “complex problem characteristics” of the task improves its validity. Secondly he explains that criteria to judge student performances need to be fair and allow demonstration of ability. In the study this was addressed through the use of standards-referenced analytical marking and holistic comparative pairs marking, and through correlation analyses between methods or marking and teacher generated scores. Thirdly, he explains how generalisability can be improved through greater authenticity through a consideration of reliability. In the study this was addressed through a combination of Rasch model analysis, and inter and intra-rater correlation analysis. Finally, Dochy discusses potential intended and unintended consequences of new forms of assessment such as improvements in teaching methods, higher performances and increased feelings of ownership and motivitation. For the purposes of the study four particular forms of assessment were defined that employed digital technologies to represent the output of student performance. These forms were an Oral presentation/interview, an Extended Production Exam, a Focussed Performance Tasks Exam and a Reflective Digital Portfolio and were not intended to provide an exhaustive list but rather define major forms that appeared to be relevant to the courses involved in the study. Sadler (2009) and Dochy (2009) provide longer lists of common forms appropriate for the assessment of complex learning. A Focussed Performance Tasks Exam was considered to be the completion, under ‘exam conditions’, of a range of practical tasks that are not necessarily logically connected and typically focus on the demonstration of practical skills. However, in reality the Exams created in the study for the AIT course provided some connection between the tasks and associated these with a scenario. Thus it had characteristics of an Extended Production Exam but without incorporating a full set of processes due to time constraints. The most comprehensive example of this type of assessment is that of Kimbell et al. (2007) in the UK where students spent 16 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK two consecutive mornings of three hours duration each working on a structured design activity for the production of a pill dispenser. All student work output was collected digitally using a networked Personal Digital Assistant (PDA) device and local server. A Reflective Process Digital Portfolio was considered to be a collection of digital artefacts of work output with some reflective commentary (journaling) by the student, organised according to specified parameters such as form, structure, and range of samples required. There are many types of digital portfolios used internationally (Taylor, 2005). For this study the portfolios were repositories of previous workoutput annotated by the student to explain the inclusion of the artefact and describe its characteristics relevant to assessment criteria. In a review of e-assessment the digital portfolio is recommended as a “way forward” in the high-stakes assessment of “practical” work in that ICT “provides an opportunity to introduce manageable, high quality coursework as part of the summative assessment process (Ridgway et al., 2006). Three uses of portfolios are suggested, one of which is “to provide a stimulus for reflective activity”. Thus, the use of portfolios is not new, particularly in areas such as the visual arts and design and technology but typically these have been paper-based (Garmire & Pearson, 2006). The exercise of assembling a portfolio is often seen as much as a “learning tool” as an “assessment tool” but the results are typically limited by physical storage space and methods of access (Garmire & Pearson, 2006). An Extended Production Exam was considered to be the completion, under ‘exam conditions’, of one practical assessment task that incorporated a full set of processes (e.g., design process, scientific investigation) and centred on one major scenario. Examples were found locally, nationally and internationally of performance on practical tasks being assessed through an extended production, or small project, under exam conditions. However, most did not involve the use of digital technologies. The most comprehensive example was that of Kimbell et al. (2007) in the UK where students spent two consecutive mornings of three hours duration each working on a structured design activity for the production of a pill dispenser. All student work output was collected digitally using a networked PDA device and local server. In WA the final Drama assessment has involved a short individual ‘performance’, that is, assessment face-to-face, but is also usually videotaped although again this is not typically assessed in a digital form. On a number of occasions over the past decade, samples of Year 3, 5, 7 and 9 students have been assessed in the Monitoring Standards of Education (MSE) programme that has involved completing a short (2 hours in two parts) design brief including prototype production. An audio or video interview or oral presentation with a student is digitally recorded under controlled circumstances and following a pre-determined script of prompts and/or questions. Clearly the quality of the audio recording is critical so it is likely to require the use of a radio microphone attached to the student or directly in front of the student. 17 P. NEWHOUSE Digital Representations of Assessment Tasks In order to judge student performance that performance needs to either be viewed or represented in some form. This may involve the assessor viewing a student performing, such as in a musical recital, or viewing the results of a student performing, such as in an art exhibition. Most often the latter occurs because this is either more appropriate or more cost-effective. In places such as WA the inclusion of either type of assessment for high-stakes purposes has been rare due to the costs and logistics involved. For example, student performance in conducting science experiments has not been included because of the difficulty in supervising the students and viewing their work, and production in design and technology, or home economics related areas, has not been included because the products are bulky and therefore difficult to access by assessors. However, many forms of student performance can be recorded in digital representations using video, audio, photographic or scanned documents, and some student work is created in digital format using computer software. In these cases the representations of student work can be made available to assessors relatively easily using digital repositories and computer networks. As in most areas of education, and particularly for assessment, authorities and/ or researchers in many localities have developed guidelines for the use of digital technologies with assessment processes. For example, the British Psychological Society published a set of general guidelines for the use of “Computer-Based Assessments” through its Psychological Testing Centre (The British Psychological Society, 2002). These guidelines include the use of digital technologies in Assessment Generation, Assessment Delivery, Assessment Scoring and Interpretation, Storage, Retrieval and Transmission. These guidelines are defined from a developer and user perspective. Similarly, The Council of the International Test Commission developed international guidelines for good practice in computer-based and Internet delivered testing (The Council of the International Test Commission, 2005). These were focussed on four issues: the technology, the quality of the testing, the control of the test environment, and the security of the testing. The contexts considered all involved students sitting at a computer to complete a test. Irrespective of whether digital technologies are used, the quality of the assessment task itself is vital and therefore the design of digital forms of assessment needs to start with the task itself. Boud (2009) suggests ten principles pertinent to a ‘practice’ perspective of assessment; these provide a valuable backdrop to this project although some have reduced potential for purely summative assessment. 1. 2. 3. 4. 5. 6. 7. 18 Locating assessment tasks in authentic contexts. Establishing holistic tasks. Focusing on the processes required for a task. Learning from the task. Having consciousness of the need for refining the judgements of students. Involving others in assessment activities. Using standards appropriate to the task. LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK 8. Linking activities from different courses. 9. Acknowledging student agency. 10. Building an awareness of co-production. The first three and the seventh and ninth guided the development of the assessment tasks in all four courses in the project. There was an attempt to ensure the fourth and fifth were incorporated in tasks in all courses and to some extent the sixth and 10th were represented in the PES and Engineering tasks. Dochy (2009) presents five characteristics of new assessment tasks: students construct knowledge; the application of knowledge; multiple perspectives and context sensitivity; the active involvement of students; and integration with the learning process. All assessment items are required to be valid, educative, explicit, fair and comprehensive, and should allow for reliable marking. The descriptions of the digital assessment tasks below assume this but focus on any areas that are of a particular challenge to that assessment type. Guidelines Specific to Computer-Based Exams Computer-based exams involve students sitting at computer workstations completing tasks, including typing answers to questions. They may be required to use various pieces of software to create digital products or may simply use a browser to complete response type assessment. In AIT while both types of assessment activities may be involved it is likely that the focus would be on the former. Taylor (Taylor, 2005) discusses three delivery methods: stand-alone; LAN; and web-based. Both standalone using USB flash drives and web-based models were considered suitable in AIT. The International Test Commission has provided detailed guidelines for computerbased exams (The Council of the International Test Commission, 2005). These guidelines were specific to test developers, test publishers and users and mainly related to response type assessment. An array of specific guidelines was presented according to the following structure. 1. Give due regard to technological issues in Computer Based Testing (CBT) and Internet testing a. Give consideration to hardware and software requirements b. Take account of the robustness of the CBT/Internet test c. Consider human factor issues in the presentation of material via computer or the Internet d. Consider reasonable adjustments to the technical features of the test for those with disabilities e. Provide help, information, and practice items within the CBT/Internet test 2. Attend to quality issues in CBT and Internet testing a. Ensure knowledge, competence and appropriate use of CBT/Internet testing b. Consider the psychometric qualities of the CBT/Internet test 19 P. NEWHOUSE c. Where the CBT/Internet test has been developed from a paper and pencil version, ensure that there is evidence of equivalence d. Score and analyse CBT/Internet testing results accurately e. Interpret results appropriately and provide appropriate feedback f. Consider equality of access for all groups 3. Provide appropriate levels of control over CBT and Internet testing a. Detail the level of control over the test conditions b. Detail the appropriate control over the supervision of the testing c. Give due consideration to controlling prior practice and item exposure d. Give consideration to control over test-taker’s authenticity and cheating 4. Make appropriate provision for security and safeguarding privacy in CBT and Internet testing a. Take account of the security of test materials b. Consider the security of test-taker’s data transferred over the Internet c. Maintain the confidentiality of test-taker results Clearly many of the guidelines apply generally to any test-taking context (e.g., 2d, 2e and 2f), whether on computer, or not. Many of the other guidelines were not applicable to the current project (e.g., 4a, b and c) because only single classes and their teachers in particular schools were involved. However, many of the guidelines in the first three areas were relevant to one or more of the cases in the project. For example, some of the guidelines associated with 1a, 1b, 2a and 2b were relevant, and to some extent some guidelines associated with 3a, 3b and 3d were relevant. Even so they were mainly relevant to the implementation of large scale online testing. More recently there has been increased international interest in computer-based testing to assess ICT capability that is more relevant to the AIT course. For example, over the past year an international research project, the Assessment and Teaching of 21st Century Skills project, has commenced supported by of Cisco, Intel and Microsoft. There have also been trials of such tests in a number of countries including the UK, Norway, Denmark, USA and Australia (MCEETYA., 2007). In Australia the ACER used a computer-based test to assess the ICT literacy of Year 6 and 10 students. They developed the test around a simulated ICT environment and implemented the test using sets of networked laptop computers. While they successfully implemented the test with over 7000 students this was over a long period of time and would not be scalable for an AIT examination. Also the use of a simulated environment would be expensive and not scalable to provide a great enough variety of activities each year. The trial in the UK also involved a multi-million pound simulated system but was accessed by students through their school computers. In the Norwegian example students used their own government-provided notebook computers. In the USA a decision has been made to include an ICT literacy test in national testing in 2012 but in a number of states there are already such tests. Performance tasks and Production exams are not necessarily computer-based. It is generally recommended that the tasks be clearly defined and limited, the work environment be narrowly prescribed (e.g., access to prescribed information or 20 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK tools), and the required work output be well defined. The areas of concern are; ensuring that they are fair to all students in terms of access to information, materials, and tools, that they are valid in assessing what is intended, and provide for reliable marking given the usually varied types of student work output. Therefore it is often recommended that the assessment task be well bounded, the work environment be limited (e.g., access to a limited set of information or tools), the time available be controlled, student work be invigilated, and the required work output be well defined. Guidelines Specific to Digital Portfolios The main concerns with the use of digital portfolios for assessment are: – The authentication of student work given the period of time within which work is completed – Ensuring that they are fair to all students in terms of access to information, materials and tools – That they can be marked reliably given the usually varied types of student work output. Therefore it is often recommended that the portfolio require a particular structure and limit the contents in type and size, the time available be controlled, and the work be authenticated by a teacher and the students. In a review of e-assessment it was suggested that a digital portfolio may involve three sections: student self-awareness; student interaction; and thinking about futures and informed decisions (Ridgway et al., 2006). In British Columbia, Canada, students complete a graduation portfolio. They are provided with a number of guides as Word documents that act as templates to construct their portfolios. Carney (2004) developed a set of critical dimensions of variation for digital portfolios: 1. Purpose(s) of the portfolio; 2. Control (who determines what goes into the portfolio and the degree to which this is specified); 3. Mode of presentation (portfolio organisation and format; the technology chosen for authoring); 4. Social Interaction (the nature and quality of the social interaction throughout the portfolio process); 5. Involvement (Zeichner & Wray identify degree of involvement by the cooperative teacher important for preservice portfolios; when considered more broadly, other important portfolio participants might include university teachers, p-12 students and parents, and others); and 6. Use (can range from low-stakes celebration to high-stakes assessment). 21 P. NEWHOUSE The study considered the following suggestions by Barrett (2005): Identify tasks or situations that allow one to assess students’ knowledge and skills through both products and performance. Create rubrics that clearly differentiate levels of proficiency. Create a record keeping system to keep track of the rubric/evaluation data based on multiple measures/ methods. (p. 10) She goes on to suggest that for “Portfolios used for Assessment of Learning” that is for summative assessment the following are defining characteristics. – – – – Purpose of portfolio prescribed by institution Artefacts mandated by institution to determine outcomes of instruction Portfolio usually developed at the end of a class, term or program - time limited Portfolio and/or artefacts usually “scored” based on a rubric and quantitative data is collected for external audiences – Portfolio is usually structured around a set of outcomes, goals or standards – Requires extrinsic motivation – Audience: external - little choice Beetham (n.d.) finds that e-portfolios are “less intimidating for some learners than a traditional examination” and provide evidence that gives a “much richer picture of learners’ strengths and achievements than, for example, a test score” (p. 4). She points to the need for web-based relational database systems to implement portfolios. While she points out that in the past e-portfolios have been found to take longer to moderate and mark this has become more streamlined where this is part of a “integrated assessment facility” – she provided five commercial examples of such systems. She provides a list of “issues relating to the use of e-portfolios for summative assessment” (p. 5). Seven of the nine issues are technical and most are addressed by the use of a good assessment management system. The remaining issues are: – acceptability and credibility of data authenticated by Awarding Bodies – designing assessment strategies to make effective use of the new tools and systems – ensuring enhanced outcomes for learners, for example, higher motivation, greater choice over evidence, assessment around capabilities and strengths. She also raises some issues for teachers and learners: – fit with existing practices and expectations – access and ICT capability of teachers and learners – acceptability and appropriateness of e-portfolio use. (p. 16) On most of these issues it is easy to argue that for courses such as AIT they are not issues as it has become normal practice over many years for school-based assessment, and provided there is a good assessment management system. 22 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK Guidelines Specific to Production Exams Production exams would not necessarily be computer-based, for example, production exams in design and technology need only be represented digitally through records of the performance (e.g., video, photograph, scanned document). The areas of concern with production exams are; ensuring that they are fair to all students in terms of access to information, materials, and tools, that they are valid in assessing what is intended, and provide for reliable marking given the usually varied types of student work output. Therefore it is often recommended that the assessment task be well bounded, the work environment be limited (e.g., access to a limited set of information or tools), the time available be controlled, student work be invigilated, and the required work output be well defined. Guidelines Specific to Recorded Interviews or Oral Presentations The main concerns are for the quality of the audio recording and for the comfort of the student. Clearly the quality of the audio recording is critical so it is likely to require the use of a radio microphone attached to the student or directly in front of the student. If the student is to perform as close as possible to the best possible then it is important that the student feels comfortable and confident in the immediate environment. This could be supported through providing the student with opportunities to practice under similar conditions and through being in an environment that is familiar and supportive of the student such as the regular classroom or an interview room at the school. METHODS OF MARKING Task assessment is what is commonly referred to as ‘marking’. Once students have completed the assessment task the output needs to be judged by some method to determine a score, grade or ranking. Three methods of marking are considered here: ‘traditional’ true score marking, judgements using standards-based frameworks, and comparative pairs judgements. Traditionally summative assessment has tended to involve students ‘sitting’ paper-based exams that are scored by allocating a number to items in the exam and then summing these numbers. This is sometimes called true score marking or cumulative marking. Pollitt (2004) argues that current methods of summative assessment that focus on summing scores on “micro-judgements” is “dangerous and that several harmful consequences are likely to follow” (p. 5). Further, he argues that it is unlikely that such a process will accurately measure a student’s “performance or ability” (p. 5). He claims that this has been tolerated because assessment validity has been overshadowed by reliability due to the difficulty and expense in addressing the former compared with the latter. Standards-reference or based frameworks and rubrics have been used for many years by teachers in Western Australia and other localities to mark student work 23 P. NEWHOUSE but have less often been used for summative high-stakes marking. This involves the definition of standards of achievement against which to compare the work of students. Typically this is operationalised for a particular assessment task through a rubric that describes these standards according to components of the task. The results may be represented as a set of levels of achievement or may be combined by converting these to numbers and adding them. However, using Rasch Modelling they may be combined to create an interval scale score. This report will refer to this approach as analytical marking. Comparative pairs marking involves a number of assessors making judgements on achievement through comparing each student’s work with that of other students, considering a pair of students at a time and indicating the better of the two. This is sometimes referred to as pairwise comparisons or cumulative comparisons. Sadler (2009) suggests that to assess complex learning requires tasks calling for divergent responses that require marking based on qualitative judgement. Such judgements may be facilitated by either a holistic or analytical approach with the difference being in granularity. He claims there has been a gradual swing towards analytical approaches in the pursuit of objectiveness (i.e., reliability of measure). Standards Referenced Analytical Marking In a report for the Curriculum Council of WA, Prof Jim Tognolini, states that “One of the main advantages of a standards-referenced assessment system is that the results can indicate what it is students have achieved during the course” and that “at the same time, use the same scores for university entrance purposes”. Further he explains that this provides students with “a meaningful record of their achievements” and this will “facilitate smoother entry through different pathways into higher education and the workforce”. He points out that all Australian states and many international systems including the Baccalaureate and PISA have a standardsreferenced curriculum. He defines it as “where educational outcomes are clearly and unambiguously specified” and claims this has “significant power and appeal in more globalised contexts” providing a “mechanism for tracking and comparing outcomes over time and across jurisdictions”. In Western Australia this is now sometimes also referred to as ‘analytical’ marking. Sadler (2009) explains that there are two analytic assessment schemes, analytical rating scales and analytic rubrics. The latter was employed in this project. The word “rubric” is a derivative of the Latin word ruber meaning “red”. In literary history, rubrics are margin notes in texts giving description, or common examples for, or about, the passage (Wiggins, 1998). The current research literature on marking keys promotes the use of criterion or rubric based marking keys to enhance transparency, reliability and, when the task is aligned with the learning outcomes, also validity (Andrade, 2005; Tierney & Marielle, 2004). In current usage, a rubric is a guide listing criteria used for rating performance (Wiggins, 1998). 24 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK Marking using a rubric based on a standards framework requires assessors to compare a student’s work against a set of theoretical standards separated into criteria (Sadler, 2009). Standards are described using either quantifiers or sub-attributes of the criterion. Marking using such descriptions is difficult and requires considerable depth of knowledge and experience and can still result in different assessors judging the same work differently because they have different standards in mind. This leads to a problem of reliability that is typically overcome by using more than one assessor for each piece of work and then having a consensus process. This may be costly and still somewhat unreliable. Assessment based on a standards framework is not new and has been used in a number of countries for many decades. The best known example of an assessment based around a standards framework is the testing associated with the National Curriculum in the United Kingdom in the 1980s and 1990s. At the school level Schagen and Hutchinson (1994) found that there were a “… variety of different methods used to award Levels based on marks obtained or performance on Statements of Attainment (SoA)”. However, at the national level there are a number of National Curriculum Assessment tests that must be completed by all students of selected ages in the UK. Tests of reliability on these tests have found that in some of the National Curriculum Assessment tests “pupils of similar ability could be assigned Levels two or more apart” due to statistical error or other factors such as context, test construction, etc. (Schagen & Hutchison, 1994). In a study titled “Assessing Expressive Learning” that involved nearly 2000 art portfolios and the use of rubrics it was found that “qualitative instructional outcomes can be assessed quantitatively, yielding score values that can be manipulated statistically, and that produce measures that are both valid and reliable estimates of student art performance” (Madeja, 2004). Comparative Pairs Method of Marking The comparative pairs judgement method of marking involves Rasch Modelling and was used by Kimbell, Wheeler, Miller and Pollitt (2007) in the e-scape (e-solutions for creative assessment in portfolio environments) project delivering high assessor reliability. Pollitt (2004) describes the comparative pairs method of marking applied to performance assessment in his paper, “Let’s stop marking exams”. He claims the method he and his colleagues have developed is “intrinsically more valid” and is “rooted in the psychophysics of the 1920s” (p. 2). He goes on to explain that while the system is better than the traditional system to this stage, it has not been feasible to apply, due to time and cost constraints, however, with the use of ICT to support this system these constraints are removed and “Thurstone’s methods” that “have waited 80 years … are at last … feasible” (p. 21). He quotes Laming that there is “no absolute judgement. All judgements are comparisons of one thing with another” and explains that it is more reliable to compare performances or products between students than with “descriptions of standards” (p. 6). He claims that they have more 25 P. NEWHOUSE than ten years experience in applying the method in a variety of contexts and that with expert application about 20 comparisons per student is required. However, he does suggest that the method should not be used with every type of assessment, with research required to determine the appropriateness and whether “sufficient precision can be achieved without excessive cost” (p. 16). A description of the mathematics behind the method, and how it was implemented in the online system developed for the e-scape project, is provided by Pollitt (2012). McGaw (2006) also believes that the comparative pairs method of marking provides an opportunity to improve the validity of high-stakes assessment in separating the “calibration of a scale and its use in the measurement of individuals” (p. 6). He claims that while the “deficiency” of norm-referenced assessment has been understood for many years it was seen that there was no alternative. Now he believes that with methods involving comparisons being supported of digital technologies there is an alternative that should be explored. An important question is whether the advances in psychometrics that permit calibration of scales and measurement of individuals that allows interpretation of performance in terms of scales can be applied in public examination. (McGaw, 2006, p. 7) The comparative pairs method of marking necessarily combines the judgements of a group of assessors that could be seen as what Boud (2009, p. 30) refers to as a “community of judgement”. He also explains that if assessment is to be shaped by practice, as related to everyday world activity, or situated action, then a holistic conception must be applied. Although he calls for a move away from “measurementoriented views of assessment” (p. 35) in fact the comparative pairs method of marking provides holistic judgement, on situated activity, by a community of practice, while still meeting stringent measurement requirements. CONCEPTUAL FRAMEWORK FOR THE STUDY In order to investigate the use of digital representations to deliver authentic and reliable assessments of performance this study brought together three key innovations: 1. The representation in digital files of the performance of students doing practical work. 2. The presentation of digital representations of student performance in an online repository so that they are easily accessible to markers. 3. Assessing the digital representations of student performance using both analytical standards-referenced judgement and the paired comparison judgement methods with holistic and component criteria-based judgements. While each of these innovations is not new in themselves, their combination applied at the secondary level of education is new. Apart from Kimbell’s (2007) work at the University of London there was no known precedent. 26 LITERATURE REVIEW AND CONCEPTUAL FRAMEWORK Fundamentally this study investigated the use of digital forms of representation of student practical performance for summative assessment, whether the student created digital files or their performance was recorded in digital format by filming, photographing, audio recording or scanning. The digital representations of student performance were combined within an online repository. The use of online repositories for student work output is increasingly common, often referred to as online portfolios, with many products available to facilitate their creation and access (Richardson & Ward, 2005). The key feature is that the portfolios can be accessed from anywhere and thus markers from different jurisdictions can be involved, enhancing consistency of standards. The paired comparison judgement method of marking, involving Rasch Modelling, was implemented using holistic judgements. While Pollitt (2004) describes the method as “intrinsically more valid” and better than the traditional system, he believes that without some ICT support it has not been feasible to apply due to time and cost constraints, and he does suggest that further research is required to determine the appropriateness and whether “sufficient precision can be achieved without excessive cost” (p. 16). McGaw (2006) believes that such methods being supported by digital technologies should be applied in public examinations. The diagram in Figure 2.1 represents the main concepts involved in assessment with the study focussing initially on the Assessment Task and thereby the Method of Assessment and the Student Work itself. However, to investigate the achievement of the desired performance indicators that relate to Kimbell’s feasibility framework the study was also involved with Task Assessment, in particular marking activities using standards frameworks and the use of the comparative pairs marking approach. 27 P. NEWHOUSE Method of assessment The means of assessing learning Desired learning outcomes and institutional goals Achieved by É – Quality assurance – Training & – Performance support É at all level and stages of the assessment process Marking activities – Mode ration – Marking and grading/ reporting The student work (task or object) Assessment task (what the student does) ASSESSMENT Task assessment (wath the assess or does) Marking criteria e.g. marking schemes & guides Desired performance indicators – Valid – Reliable – Authentic – Transparent – Fair Need to consider –Manageability – Technology adequacy – Functional acceptability – Pedagogic value Management and administration Is required in all aspects of assessment and between assessments for all stakeholders Assessor skills/ knowledge e.g. content, standards Figure 2.1. A diagrammatic representation of a conceptual framework for the assessment of performance. {Based on the work of Campbell (2008).} 28