HMEF5053 Measurement and Evaluation in Education Copyright © Open University Malaysia (OUM) HMEF5053 MEASUREMENT AND EVALUATION IN EDUCATION Prof Dr John Arul Phillip Yap Yee Khiong Copyright © Open University Malaysia (OUM) Project Directors: Prof Dr Widad Othman Dr Aliza Ali Open University Malaysia Module Writers: Prof Dr John Arul Phillip Yap Yee Khiong Moderator: Prof Dr Kuldip Kaur Open University Malaysia Enhancer: Assoc Prof Dr Chung Han Tek Developed by: Centre for Instructional Design and Technology Open University Malaysia First Edition, May 2006 Fourth Edition, August 2016 (rs) Sixth Edition, December 2019 (MREP) Copyright © Open University Malaysia (OUM), December 2019, HMEF5053 All rights reserved. No part of this work may be reproduced in any form or by any means without the written permission of the President, Open University Malaysia (OUM). Copyright © Open University Malaysia (OUM) Table of Contents Course Guide xiăxv Topic 1 The Role of Assessment in Teaching and Learning 1.1 Test, Measurement and Assessment 1.1.1 Test 1.1.2 Measurement 1.1.3 Assessment or Evaluation 1.2 The Why, What and How of Educational Assessment 1.3 Purposes of Assessment 1.3.1 To Help Learning 1.3.2 To Improve Teaching 1.4 General Principles of Assessment 1.5 Types of Assessment 1.5.1 Formative versus Summative Assessments 1.5.2 Norm-referenced versus Criterion-referenced Tests 1.6 Trends in Assessment Summary Key Terms References 1 2 2 3 3 5 6 7 9 10 12 13 15 17 18 19 19 Topic 2 Foundation of Assessment: What to Assess? 2.1 Identifying What to Assess 2.1.1 Three Types of Learning Outcomes 2.2 Assessing Cognitive Learning Outcomes or Behaviour 2.2.1 BloomÊs Taxonomy 2.2.2 The Helpful Hundred 2.3 Assessing Affective Learning Outcomes or Behaviour 2.4 Assessing Psychomotor Learning Outcomes or Behaviour 2.5 Important Trends in What to Assess Summary Key Terms References 21 22 23 25 26 28 32 37 41 42 43 44 Copyright © Open University Malaysia (OUM) iv TABLE OF CONTENTS Topic 3 Planning Classroom Tests 3.1 Purposes of Classroom Testing 3.2 Planning a Classroom Test 3.2.1 Deciding Its Purposes 3.2.2 Specifying the Intended Learning Outcomes 3.2.3 Selecting Best Item Types 3.2.4 Developing a Table of Specifications 3.2.5 Constructing Test Items 3.2.6 Preparing Marking Schemes 3.3 Assessing TeacherÊs Own Test Summary Key Terms 45 46 47 47 48 50 53 58 60 61 62 62 Topic 4 How to Assess? ă Objective Tests 4.1 What is an Objective Test? 4.2 Multiple-choice Questions (MCQs) 4.2.1 What is a Multiple-choice Question? 4.2.2 Construction of Multiple-choice Questions 4.2.3 Advantages of Multiple-choice Questions 4.2.4 Limitations of Multiple-choice Questions 4.3 True-False Questions 4.3.1 What are True-False Questions? 4.3.2 Advantages of True-False Questions 4.3.3 Limitations of True-False Questions 4.3.4 Suggestions for Constructing True-False Questions 4.4 Matching Questions 4.4.1 Construction of Matching Questions 4.4.2 Advantages of Matching Questions 4.4.3 Limitations of Matching Questions 4.4.4 Suggestions for Constructing Good Matching Questions 4.5 Short-answer Questions 4.5.1 Strengths and Weaknesses of Short-answer Questions 4.5.2 Guidelines on Constructing Short-answer Questions Summary Key Terms References 63 64 65 65 67 73 74 75 76 76 77 78 80 80 81 82 82 Copyright © Open University Malaysia (OUM) 83 84 85 88 89 89 TABLE OF CONTENTS Topic 5 Topic 6 Topic 7 How to Assess? ă Essay Tests 5.1 What is an Essay Question? 5.2 Formats of Essay Tests 5.3 Advantages of Essay Questions 5.4 Deciding Whether to Use Essay Questions or Objective Questions 5.5 Limitations of Essay Questions 5.6 Misconceptions About Essay Questions in Examinations 5.7 Guidelines on Constructing Essay Questions 5.8 Verbs Describing Various Kinds of Mental Tasks 5.9 Marking an Essay 5.10 Suggestions for Marking Essays Summary Key Terms References v 90 91 92 94 95 96 97 99 107 110 117 119 120 120 Authentic Assessment 6.1 What is Authentic Assessment in the Classroom? 6.2 Alternative Names for Authentic Assessment 6.3 How to Use Authentic Assessment? 6.4 Advantages of Authentic Assessment 6.5 Disadvantages of Authentic Assessment 6.6 Characteristics of Authentic Assessment 6.7 Differences between Authentic and Traditional Assessments Summary Key Terms References 122 123 124 125 126 128 130 132 Project and Portfolio Assessments 7.1 Project Assessment 7.1.1 What is Assessed Using Projects? 7.1.2 Designing Effective Projects 7.1.3 Possible Problems with Project Work 7.1.4 Group Work in Projects 7.1.5 Assessing Project Work 7.1.6 Evaluating Process in a Project 7.1.7 Self-assessment in Project Work 7.2 What is a Portfolio? 7.2.1 What is Portfolio Assessment? 7.2.2 Types of Portfolios 7.2.3 Developing a Portfolio 137 138 141 144 147 148 150 154 156 158 159 160 161 Copyright © Open University Malaysia (OUM) 135 135 136 vi TABLE OF CONTENTS 7.2.4 Advantages of Portfolio Assessment 7.2.5 Disadvantages of Portfolio Assessment 7.2.6 How and When Should Portfolios be Assessed? Summary Key Terms References 162 164 165 169 171 171 Topic 8 Reliability and Validity of Assessment Techniques 8.1 What is Reliability? 8.2 The Reliability Coefficient 8.3 Methods to Estimate the Reliability of a Test 8.4 Inter-rater and Intra-rater Reliability 8.5 Types of Validity 8.6 Factors Affecting Reliability and Validity 8.7 Relationship between Reliability and Validity Summary Key Terms References 173 174 175 179 183 185 189 191 192 193 193 Topic 9 Item Analysis 9.1 What is Item Analysis? 9.2 Steps in Item Analysis 9.3 Difficulty Index 9.4 Discrimination Index 9.5 Application of Item Analysis on Essay-type Questions 9.6 Relationship between Difficulty Index and Discrimination Index 9.7 Distractor Analysis 9.8 Practical Approach in Item Analysis 9.9 Usefulness of Item Analysis to Teachers 9.10 Caution in Interpreting Item Analysis Results 9.11 Item Bank 9.12 Psychometric Software Summary Key Terms References 195 196 197 199 200 203 206 Copyright © Open University Malaysia (OUM) 208 210 212 213 214 216 216 217 218 TABLE OF CONTENTS Topic 10 Analysis of Test Scores 10.1 Why Use Statistics? 10.2 Describing Test Scores 10.2.1 Central Tendency 10.2.2 Dispersion 10.3 Standard Scores 10.3.1 Z-score 10.3.2 Example of Using the Z-score to Make Decisions 10.3.3 T-score 10.4 The Normal Curve 10.5 Norms Summary Key Terms Copyright © Open University Malaysia (OUM) vii 219 220 222 223 224 229 230 231 232 233 235 237 239 xxvi X COURSE ASSIGNMENT GUIDE Copyright © Open University Malaysia (OUM) Copyright © Open University Malaysia (OUM) Copyright © Open University Malaysia (OUM) COURSE GUIDE xi COURSE GUIDE DESCRIPTION You must read this Course Guide carefully from the beginning to the end. It tells you briefly what the course is about and how you can work your way through the course material. It also suggests the amount of time you are likely to spend in order to complete the course successfully. Please refer to the Course Guide from time to time as you go through the course material as it will help you to clarify important study components or points that you might miss or overlook. INTRODUCTION HMEF5053 Measurement and Evaluation in Education is one of the courses offered at Open University Malaysia (OUM). This course is worth three credit hours and should be covered over eight to 15 weeks. COURSE AUDIENCE This course is a core subject for learners undertaking the Master of Education (MEd) programme. Its main aim is to provide you with a foundation in the principles and theories of educational testing and assessment as well as their applications in the classroom. The course introduces the differences between testing, measurement and assessment. The focus is on the role of assessment in teaching and learning, followed by discussions on „what to assess‰ and „how to assess‰. Regarding the „what‰, the emphasis is on the cognitive, affective and psychomotor learning outcomes. Next, the „how‰ of assessment is discussed with emphasis on the various assessment techniques that teachers can adopt. Besides the usual traditional assessment techniques such as objective and essay tests, authentic assessment techniques such as projects and portfolios are presented. There is also a general discussion on how authentic assessment is similar to and different from traditional assessment. Also discussed are techniques to determine the effectiveness of various assessment approaches, focusing on reliability, validity and item analysis. Finally, various statistical procedures are presented in the analysis of assessment results and their interpretation. As an open and distance learner, you should be acquainted with learning independently and being able to optimise the learning modes and environment available to you. Before you begin this course, please ensure that you have the right course material, and understand the course requirements as well as how the course is conducted. Copyright © Open University Malaysia (OUM) xii COURSE GUIDE STUDY SCHEDULE It is a standard OUM practice that learners accumulate 40 study hours for every credit hour. As such, for a three-credit hour course, you are expected to spend 120 study hours. Table 1 gives an estimation of how the 120 study hours could be accumulated. Table 1: Estimation of Time Accumulation of Study Hours Study Activities Study Hours Briefly go through the course content and participate in initial discussions 4 Study the module 60 Attend five tutorial sessions 15 Online participation 11 Revision 15 Assignment(s), Test(s) and Examination(s) 15 TOTAL STUDY HOURS ACCUMULATED 120 COURSE LEARNING OUTCOMES By the end of this course, you should be able to: 1. Identify the different principles and theories of educational testing and assessment; 2. Compare the different principles, theories and procedures of educational testing and assessment; 3. Apply the different principles and theories in the development of assessment techniques for use in the classroom; and 4. Critically evaluate the principles and theories in educational testing and assessment. Copyright © Open University Malaysia (OUM) COURSE GUIDE xiii COURSE SYNOPSIS This course is divided into 10 topics. The synopsis for each topic is as follows: Topic 1 discusses the differences between testing, measurement, evaluation and assessment, the role of assessment in teaching and learning and some general principles of assessment. Also explored is the difference between formative and summative assessments as well as the differences between criterion and normreferenced tests. The topic concludes with a brief discussion of the current trends in assessment. Topic 2 discusses the behaviours to be tested focussing on cognitive, affective and psychomotor learning outcomes and reasons why assessments of the latter two outcomes are ignored. Topic 3 provides some useful guidelines to help teachers plan valid, reliable and useful classroom tests. It discusses the steps involved in planning and designing a test. These steps are deciding the purpose, specifying the intended learning outcomes, selecting best item types, developing a table of specifications, constructing test items and preparing marking schemes. The topic also includes a subtopic on how teachers can assess their own tests. Topic 4 discusses the design and development of objective tests in the assessment of various kinds of behaviours with emphasis on the limitations and advantages of using this type of assessment tool. Topic 5 examines the role of essay tests in assessing various kinds of learning outcomes as well as its limitations and strengths, and the procedures involved in the design of good essay questions. Topic 6 introduces a form of assessment in which learners are assigned to perform real-world tasks that demonstrate meaningful application of essential knowledge and skills. Teachers will be able to understand how authentic assessment is similar to or different from traditional assessment. Emphasis is also given to scoring rubrics. Topic 7 discusses in detail two examples of authentic assessments, namely portfolio and project assessments. Guidelines to portfolio entries and project works and evaluation criteria are discussed in detail. Topic 8 focuses on basic concepts of test reliability and validity. The topic also includes methods to estimate the reliability of a test and factors to increase reliability and validity of a test. Copyright © Open University Malaysia (OUM) xiv COURSE GUIDE Topic 9 examines the concept of item analysis and the different procedures for establishing the effectiveness of objective and essay-type tests focussing on item difficulty and item discrimination. The topic concludes with a brief discussion on the usefulness of item analysis and the cautions in interpreting the item analysis results. Topic 10 focuses on the analysis and interpretation of the data collected by tests. For quantitative analysis of data, various statistical procedures are used. Some of the statistical procedures used in the interpretation and analysis of assessment results are measures of central tendency and correlation coefficients. There is also a brief discussion on the use of standard scores. TEXT ARRANGEMENT GUIDE Before you go through this module, it is important that you note the text arrangement. Understanding the text arrangement will help you to organise your study of this course in a more objective and effective way. Generally, the text arrangement for each topic is as follows: Learning Outcomes: This section refers to what you should achieve after you have completely covered a topic. As you go through each topic, you should frequently refer to these learning outcomes. By doing this, you can continuously gauge your understanding of the topic. Self-Check: This component of the module is inserted at strategic locations throughout the module. It may be inserted after one sub-section or a few subsections. It usually comes in the form of a question. When you come across this component, try to reflect on what you have already learnt thus far. By attempting to answer the question, you should be able to gauge how well you have understood the sub-section(s). Most of the time, the answers to the questions can be found directly from the module itself. Activity: Like Self-Check, the Activity component is also placed at various locations or junctures throughout the module. This component may require you to solve questions, explore short case studies, or conduct an observation or research. It may even require you to evaluate a given scenario. When you come across an Activity, you should try to reflect on what you have gathered from the module and apply it to real situations. You should, at the same time, engage yourself in higher order thinking where you might be required to analyse, synthesise and evaluate instead of only having to recall and define. Copyright © Open University Malaysia (OUM) COURSE GUIDE xv Summary: You will find this component at the end of each topic. This component helps you to recap the whole topic. By going through the summary, you should be able to gauge your knowledge retention level. Should you find points in the summary that you do not fully understand, it would be a good idea for you to revisit the details in the module. Key Terms: This component can be found at the end of each topic. You should go through this component to remind yourself of important terms or jargon used throughout the module. Should you find terms here that you are not able to explain, you should look for the terms in the module. References: The References section is where a list of relevant and useful textbooks, journals, articles, electronic contents or sources can be found. The list can appear in a few locations such as in the Course Guide (at the References section), at the end of every topic or at the back of the module. You are encouraged to read or refer to the suggested sources to obtain the additional information needed and to enhance your overall understanding of the course. PRIOR KNOWLEDGE Although this course assumes no previous knowledge of educational assessment and measurement, you are encouraged to tap into your experiences as a teacher, instructor, lecturer or trainer and relate them to the principles of assessment and measurement discussed. ASSESSMENT METHOD Please refer to myINSPIRE. TAN SRI DR ABDULLAH SANUSI (TSDAS) DIGITAL LIBRARY The TSDAS Digital Library has a wide range of print and online resources for the use of its learners. This comprehensive digital library, which is accessible through the OUM portal, provides access to more than 30 online databases comprising e-journals, e-theses, e-books and more. Examples of databases available are EBSCOhost, ProQuest, SpringerLink, Books247, InfoSci Books, Emerald Management Plus and Ebrary Electronic Books. As an OUM learner, you are encouraged to make full use of the resources available through this library. Copyright © Open University Malaysia (OUM) xvi COURSE GUIDE Copyright © Open University Malaysia (OUM) Topic The Role of 1 Assessment in Teaching and Learning LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Differentiate between test, measurement and assessment; 3. Explain the purposes of assessment; 4. Discuss the general principles of assessment; and 5. Identify four types of assessment and examine their differences. INTRODUCTION This topic discusses the difference between test, measurement, and evaluation, the purposes of assessment and some general principles of assessment. Also explored are the differences between formative assessment and summative assessment as well as the differences between criterion tests and norm-referenced tests. Copyright © Open University Malaysia (OUM) 2 TOPIC 1 1.1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING TEST, MEASUREMENT AND ASSESSMENT Many people are confused about the fundamental differences between test, measurement and assessment as they are all used in education. Do you know what they entail? Let us find out the answer in this subtopic. 1.1.1 Test Most of us are familiar with tests because at some point in our lives, we are required to sit for tests. In school, tests are given to measure our academic aptitude and evaluate whether we have gained any understanding from our learning. In the workplace, tests are conducted to select people for specific jobs, for promotion and to encourage re-learning. Physicians, lawyers, insurance consultants, realestate agents, engineers, civil servants and many other professionals are required to take tests to demonstrate their competence in specific areas and in some cases to be granted licence to practise their profession or trade. Throughout their professional careers, teachers, counsellors and school administrators are required to give, score and interpret a wide variety of tests. For example, school administrators rate the performance of individual teachers and school counsellors record the performance of their clients. It is possible that a teacher may construct, administer and mark thousands of tests during his or her career! According to the joint committee of the American Psychological Association (APA), the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME), a test may be thought of as a set of tasks or questions intended to elicit particular types of behaviour when presented under standardised conditions to yield a score that has desirable psychometric properties. Psychometrics is concerned with the objective measurement of skills and knowledge, abilities, attitudes, personality traits and educational achievement. So, when a teacher assigns a set of questions to determine studentsÊ achievement in Mathematics, he or she is conducting a Mathematics test. While most people know what a test is, many have difficulty differentiating it from measurement, evaluation and assessment. Some have even argued that they are the same! Copyright © Open University Malaysia (OUM) TOPIC 1 1.1.2 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 3 Measurement Generally, measurement is the act of assigning numbers to a phenomenon. In education, it is the process by which the attributes of a person are measured and assigned numbers. Remember, it is a process which indicates that there are certain steps involved. As educators, we frequently measure human attributes such as attitudes, academic achievements, aptitudes, interests, personalities and so forth. Hence, to measure these attributes, we have to use certain instruments so that we can conclude that, for example, Ahmad is better in Mathematics than Kumar, while Tajang is more inclined towards Science than Kong Beng. We measure to obtain information about „what is‰. Such information may or may not be useful, depending on the accuracy of the instruments we use and our skill at using them. For example, we measure temperature using a thermometer and so the thermometer is an instrument. How do you measure performance in Mathematics? We use a Mathematics test which is an instrument containing questions and problems to be solved by students. The number of right responses obtained by a student is an indication of his performance in Mathematics. Note that we are only collecting information. The information collected is a numerical description of the degree to which an individual possesses an attribute. Measurement answers the question „How much does an individual possess a particular attribute?‰ Note that we are not assessing! Assessment is therefore quite different from measurement. 1.1.3 Assessment or Evaluation The literature has used the terms „assessment‰ and „evaluation‰ in education as two different concepts and yet the two terms are used interchangeably i.e. they are regarded as similar. For example, some authors use the term „formative evaluation‰ while others use the term „formative assessment‰. We will use the two terms interchangeably because there is too much overlap in the interpretations of the two concepts. In this module, we will use the term „assessment‰. Generally, assessment is viewed as the process of collecting information with the purpose of making decisions about students. Copyright © Open University Malaysia (OUM) 4 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING We may collect information using various tests, observations of students and interviews. Rowntree (1974) views assessment as a human encounter in which one person interacts with another directly or indirectly with the purpose of obtaining and interpreting information about the knowledge, understanding, abilities and attitudes possessed by that person. For example, based on assessment information, we can determine, whether Chee Keong needs special classes to assist him in developing reading skills or whether Khairul, who was identified as dyslexic, needs special attention. The key words in the definition of assessment is collecting data and making decisions. However, to make decisions, one has to evaluate, which is the process of making judgement about a given situation. When we evaluate, we are saying that something is good, appropriate, valid, positive and so forth. To make an evaluation, we need information, and it is obtained by measuring using a reliable instrument. For example, you measure the temperature in the classroom and it is 30C, which is simply information. Some students may find the temperature in the room too warm for learning, while others may say that it is ideal for learning. Educators are constantly evaluating students and it is usually done in comparison with some standards. For example, if the objective of the lesson was for students to apply BoyleÊs Law to the solution of a problem and 80 per cent of learners were able to solve the problem, then the teacher may conclude that his or her teaching of the principle was quite successful. So, evaluation is the comparison of what is measured against some defined criteria, to determine whether the criteria have been achieved, and whether it is appropriate, good, reasonable, and valid and so forth. The three terms „test‰, „measurement‰ and „assessment‰ are easily confused because all may be involved in a single process. For example, to determine a studentÊs performance in Mathematics, a teacher may assign him or her a task or a set of questions, which is a test to obtain a numeric score, which is a measurement. Based on the score, the teacher decides whether this particular student is good, average or poor in Mathematics, which is an assessment. SELF-CHECK 1.1 Explain the differences between test, measurement and assessment. Copyright © Open University Malaysia (OUM) TOPIC 1 1.2 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 5 THE WHY, WHAT AND HOW OF EDUCATIONAL ASSESSMENT The practice among many educators who want to assess students is to either use ready-made tests or construct their own tests or a combination of both. In the United States, it is common for teachers to use various standardised tests in assessing their students. These tests have national norms, against which teachers can compare the performance of their students. For example, a student scores 78 in a Mathematics test; the score when compared with the norm may put the student in the 85th percentile. It means that 15 per cent of students tested earlier scored higher than him or her. It also means that 85 per cent of students tested earlier scored lower than him or her. Figure 1.1 shows the why, what and how of assessment. Figure 1.1: The why, what and how of assessment The focus of this course will be on how teachers can build their own assessment instruments, how to ensure their instruments are effective, how to interpret assessment results and how to report these assessment results. In this topic, we will discuss the role of assessment, which is, „Why do we assess?‰ In Topic 2, we will focus on the foundation of assessment, which is „What to assess?‰ which involves determining the behaviours of students to be assessed. Topic 3 will focus on planning classroom tests. Topics 4 to 7 will address the issue „How to assess?‰ In Topics 8 and 9, we will attempt to answer the question „How do we know our assessment is effective?‰ Finally, in Topic 10, we will deal with the question „How do we interpret assessment results?‰ Copyright © Open University Malaysia (OUM) 6 TOPIC 1 1.3 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING PURPOSES OF ASSESSMENT Let us begin this subtopic by asking the question, „Why do we as educators assess students?‰ Some of us may find the question rather strange. The following may be a likely scenario: Question: Why do you assess? Answer: Well, I assess to find out whether my students understand what has been taught. Question: What do you mean by „understand‰? Answer: Whether they can remember what I taught them and how to solve problems. Question: What do you do with the test results? Answer: Well, I give students the right answers and point out their mistakes in answering the questions. Educators may give the above reasons when asked about the purpose of assessment. In the context of education, assessment is performed to gain an understanding of an individualÊs strengths and weaknesses in order to make appropriate educational decisions. The best educational decisions are based on information, and better decisions are usually based on more information (Salvia & Ysseldyke, 1995). Based on the reasons for assessment provided by Harlen (1978) and Deale (1975), two main reasons or purposes of assessment may be identified (refer to Figure 1.2). Figure 1.2: Purposes of assessment These two purposes are further explained in the next subtopics. Copyright © Open University Malaysia (OUM) TOPIC 1 1.3.1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 7 To Help Learning With regard to learning, assessment is aimed at providing information that will help make decisions concerning remediation or enrichment, placement, exceptionality and certification. It also aims at providing information to parents so that they are kept informed of their childrenÊs learning progress in school. Likewise, school administration collects assessment information to determine how the school is performing and for student counselling purposes (refer to Table 1.2). Table 1.2: Why We Assess: To Help Learning Aspect Questions to be Answered Diagnosis for remedial action Should the student be sent for remedial classes so that difficulty in learning can be overcome? Diagnosis for enrichment Should the student be provided with enrichment activities? Exceptionality Does the student have special learning needs that require special education assistance? Placement Should the student be streamed to X or Y? Progress To what extent is the student making progress towards specific instructional goals? Communication to parents How is the child doing in school and how can parents help? Certification What are the strengths and weaknesses in the overall performance of a student in the specific areas assessed? Administration and counselling How is the school performing in comparison with other schools? Why should students be referred for counselling? (a) Diagnosis Diagnostic assessment is performed at the beginning of a lesson or unit for a particular subject area to assess studentsÊ readiness and background for what is about to be taught. This pre-instructional assessment is done when you decide that you need information on a student, group of students or a whole class before you can proceed with the most effective form of instruction. For example, you can administer a Reading Test to Year One students to assess their reading level. Based on the information, you may want to assign weak readers for special intervention or remedial action programme. Copyright © Open University Malaysia (OUM) 8 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING Alternatively, the tests may reveal that some students are reading at an extremely high level and you may want to recommend that they be assigned to an enrichment programme. (b) Exceptionality Assessment is also conducted to make decisions on exceptionality. Based on the information obtained from assessment, teachers may make decisions as to whether a particular student needs to be assigned to a class with exceptional students. Exceptional students are those who are physically, mentally, emotionally or behaviourally different from the normal population. For example, based on assessment information, a child may be discovered to be dyslexic and may be assigned to a special treatment programme. In another example, a student who has been diagnosed as having learning disability may be assigned to a special education programme. (c) Certification Certification is perhaps the most important reason for assessment. For example, Sijil Pelajaran Malaysia (SPM) is an examination aimed at providing students with certificates. The marks obtained are converted into letter grades signifying performance in different subject areas and used as a basis for comparison between students. The certificate obtained is further used in selecting students for further studies, scholarships or jobs. (d) Placement Besides certification, assessment is conducted for the purpose of placement. Students are endowed with varying abilities and one of the tasks of the school is to place them in classes according to their aptitude and interest. For example, performance in the SPM is used as the basis for placing students in the arts or science stream in Form Six. Assessment is also used to stream students according to their academic performance. It has been the tradition that the A and B classes will consist of high achievers from the end-ofsemester examinations or end-of-year examinations. Placement tests have even been used in preschools to stream children according to their literacy levels! The practice of placing students according to academic achievement has been debated for decades with some educationists arguing against it and others supporting it. (e) Communicate to Parents Families want to know how their children are doing in school and family members appreciate specific indicators of studentsÊ progress. Showing examples of a childÊs work over time enables parents to personally assess the growth and progress of their child. It is essential to tell the whole story when Copyright © Open University Malaysia (OUM) TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 9 reporting information about performance progress. Talking with families about standards, sharing student work samples, using rubrics in conferences and differentiating between performance and progress are some ways to ensure that families are given an accurate picture of student learning. (f) School Administration and Counselling A school collects assessment information in order to determine how the school is performing in relation to other schools for a particular semester or year. Assessment results are also used to compare performance over the years for the same school. Based on the results, school administrators may institute measures to remedy weaknesses such as channelling more resources into teaching students who are performing poorly in their studies. This kind of measure is pertinent, in view of the increasing number of students who are unable to read and write at a satisfactory level. Assessment results (especially relating to socio-emotional development) may be used by school administrators and counsellors in planning intervention strategies for at-risk students. Assessment by counsellors will enable them to identify students presenting certain socio-emotional problems that require counselling services or referral to specialists such as psychiatrists, legal counsellors and law enforcement authorities. 1.3.2 To Improve Teaching With regard to teaching, assessment provides information regarding achievement of intended learning outcomes, effectiveness of teaching methods and learning materials. If 70 per cent of your students fail in a test, do you investigate whether your teaching and learning strategy is appropriate or do you attribute it to your students being academically weak or not having revised their work? Most teachers would attribute the poor performance to the latter. This is not a fair judgment about your studentsÊ performance and abilities. The problem might lie with the teachers. Assessment information is valuable in indicating which of the learning outcomes have been successfully achieved and which concepts students have the most difficulty with and require special attention. Assessment results are also valuable in providing clues to the effectiveness of the teaching strategy implemented and teaching materials used. Besides, assessment information might indicate whether students have the required prior knowledge to grasp the concepts and principles discussed. All this assessment information will indicate to the teachers what they should do to improve their teaching. They should reflect on the information and examine their approaches, methods and techniques of teaching. Finally, Copyright © Open University Malaysia (OUM) 10 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING assessment data may also provide insight into why some teachers are more successful in teaching a particular group of students as compared to others (refer to Table 1.3). Table 1.3: Why We Assess: To Improve Teaching Aspect Questions to be Answered Objectives Were the desired learning outcomes achieved? Teaching method Were the teaching methods effective? Prior knowledge Did students have the relevant prior knowledge? Teaching materials Were the teaching materials effective? Teacher differences Were particular teachers more effective than others? ACTIVITY 1.1 In the myINSPIRE online forum, discuss the following: (a) „Streaming according to academic abilities should be discouraged in schools‰. (b) To what extent have you used assessment data to review your teaching and learning strategies? 1.4 GENERAL PRINCIPLES OF ASSESSMENT There are literally hundreds of guiding principles of assessment generated by various sources such as educational institutions and individual scholars. The following are a few principles, which can be applied in every level of assessment: (a) What is to be assessed has to be clearly specified. The specification of the characteristics to be measured should precede the selection or development of assessment procedures. In other words, in assessing student learning, the intended learning outcomes should be clearly specified. Only with a clear specification of the intended learning outcomes to be measured would appropriate assessment procedures or methods be selected. The following is an example of an intended learning outcome for an assessment course: Copyright © Open University Malaysia (OUM) TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 11 By the end the lesson, the student will be able to write effective learning outcomes that include lower-order and higher-order cognitive skills for a one-semester course. A clear statement of learning outcome normally consists of three components, a verb, condition and standard. The verb describes what the student will be doing or the behaviour, the condition refers to the context under which the behaviour is to occur and the standard indicates the criteria of acceptable level of performance. The three components of the intended learning outcome are as follows (refer to Table 1.4): Table 1.4: Components in Learning Outcome Verb Write Condition effective learning outcomes that include lower-order and higher-order cognitive skills Standard for a one-semester course. The verb used in the statement of learning outcome should be specific, measurable, achievable and realistic. Avoid words such as „understand‰, „appreciate‰, „know‰ and „learn‰. Note that it is not mandatory for every learning outcome to have all the three components of verb, condition and standard. However, it must at least have the verb and the condition, while the standard may be optional as exemplified as follows: By the end the lesson, the students will be able to write effective learning outcomes that include lower-order and higher-order cognitive skills. (b) An assessment procedure or method should be selected based on its relevance to the characteristics or performance to be measured. When selecting an assessment procedure to measure a specific learning outcome, teachers should always ask themselves whether the procedure is the most effective method for measuring the learning or development to be assessed. There must be a close match between the intended learning outcomes and the types of assessment tasks to be used. For example, if the development of the ability to organise ideas is being measured, the use of multiple-choice test would be a poor choice. Instead, the appropriate assessment method to be used should be essay questions. Copyright © Open University Malaysia (OUM) 12 TOPIC 1 (c) Different assessment procedures are required to provide a complete picture of student achievement and development. This is because no single assessment procedure can assess all the different learning outcomes in a school curriculum. Different assessment procedures are different in their usefulness. For example, multiple-choice questions are useful for measuring knowledge, understanding and application of outcomes, while essay tests are appropriate for measuring the ability to organise and express ideas. Projects that require conducting library research are needed to measure certain skills in formulating and solving problems. Observational techniques are needed to assess performance skills and various aspects of student behaviour. (d) The assessment must be aligned to instruction. What to be assessed in the classroom must be consistent with what has been taught and vice versa. For example, it would not be fair to assess studentsÊ higher-order thinking skills when what is taught is only lower-order level thinking skills. Of course, what is taught in class must be in line with what has been planned as indicated by the learning outcomes for the course. According to Biggs and Tang (2007), the relationship among assessment, instruction and learning outcomes is referred to as constructive alignment (refer to Figure 1.3) THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING Figure 1.3: Constructive alignment 1.5 TYPES OF ASSESSMENT Before we proceed to discuss assessment in detail, you need to be clear about these often-used concepts in assessment: (a) Formative and summative assessments (or evaluation); and (b) Criterion-referenced and norm-referenced tests. Copyright © Open University Malaysia (OUM) TOPIC 1 1.5.1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 13 Formative versus Summative Assessments Assessment can be done at various times throughout the school year. A comprehensive assessment plan will include both formative and summative assessments. The point at which assessment occurs and the aim of assessing distinguish these two types of assessment. (a) Formative Assessment Formative assessment is often done at the beginning or during the school year, thus providing the opportunity for obtaining timely information about student learning in a particular subject area or at a particular point in a programme. Classroom assessment is one of the most common formative assessment techniques. The purpose of this technique is to improve the quality of student learning. It should not be evaluative or involved grading students. In formative assessment, the teacher compares the performance of a student with the performance of other students in the class and not all students in the same year (or form). Usually, a small section of the content is tested to determine whether the learning outcomes have been achieved. Formative assessment is action-oriented and forms the basis for improvement of instructional methods (Scriven, 1996). For example, if a teacher observes that some students have not grasped a concept, he or she may design a review activity or use a different instructional strategy. Likewise, students can monitor their progress with periodic quizzes and performance tasks. The results of formative assessments are used to modify and validate instruction. In short, formative assessments are ongoing and include reviews and observations of what is happening in the classroom. Some examples of formative assessment are monthly tests, weekly quizzes, class exercises and homework. (b) Summative Assessment When the cook tastes the soup, thatÊs formative evaluation; when the guests taste the soup, thatÊs summative evaluation. (Robert Stakes) Copyright © Open University Malaysia (OUM) 14 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING Summative assessment is comprehensive in nature, provides accountability and is used to check the level of learning at the end of the programme (which may be at the end of the semester, end of the year or after two years). For example, after five years in secondary school, students take Sijil Pelajaran Malaysia (SPM) which is summative in nature since it is based on the cumulative learning experiences of students. Summative assessments are typically used to evaluate the effectiveness of an instructional programme at the end of an academic year or at a pre-determined time. The goal of summative assessments is to make a judgement on a studentÊs competency after an instructional phase is completed. For example, national examinations are administered in Malaysia each year. It is a summative assessment to determine each studentÊs acquisition of knowledge in several subject areas during a period of five years. Summative evaluations are used to determine whether students have mastered specific competencies and letter grades are assigned to assess student achievement. Besides the national examinations such as Sijil Pelajaran Malaysia (SPM) and Sijil Tinggi Pelajaran Malaysia (STPM), end of the year or semester examinations in schools, colleges and universities can also be considered as examples of summative assessment. The question that arises is whether summative assessment data can be used formatively. The answer is positive and it can be done for the following purposes: (i) To Improve Learning among Students Based on summative assessment data, poor and good students may be identified and be given different attention in the subsequent year or semester. (ii) To Improve Teaching Methods Based on summative assessment data, teachers are able to find out if their teaching methods or strategies are appropriate and effective. (iii) To Plan and Improve Curriculum Based on summative assessment data, teachers and administrator can identify if the curriculum designed is appropriate for the studentsÊ ability levels and the needs of the industry. Copyright © Open University Malaysia (OUM) TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 15 Let us look at Table 1.5 which summarises the differences between formative and summative assessments. Table 1.5: Differences between Formative and Summative Assessments Criteria Formative Assessment Summative Assessment Timing Conducted throughout the teaching-learning process (monthly, weekly or even daily). Conducted at the end of a teaching-learning phase (such as end of semester or year) Method Paper-and-pencil tests, observations, quizzes, exercises, practical sessions administered to the group and individually Paper-and-pencil tests, oral tests administered to the group Aim To assess learning progress To identify needs for remediation or enrichment To assess achievement of the instructional goals of a course or programme i.e. terminal exam To certify students and improve curriculum Final exam, qualifying tests, national examinations (UPSR, SPM, STPM, etc.) Example 1.5.2 Monthly tests, weekly quizzes, daily reports, etc. Norm-referenced versus Criterion-referenced Tests The main difference between norm-referenced and criterion-referenced tests lies in the purpose or aim of assessing students, the way in which content is selected and the scoring processes which define how the test results are interpreted. (a) Norm-referenced Tests The major reason for norm-referenced tests is to classify students. It is designed to highlight achievement differences between and among students to produce a dependable rank order of students across a continuum of achievement from high achievers to low achievers (Stiggins, 1994). With norm-referenced tests, a representative group of students is given a test and their scores form the norm after having gone through a complex administration and analysis. Anyone taking the norm-referenced test can compare his or her score against the norm. For example, a score of 70 on a norm-referenced test will not mean much until it is compared with the norm. Copyright © Open University Malaysia (OUM) 16 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING When compared with the norm, if the studentÊs score is in the 80th percentile, this means that he or she performed as well or better than 20 per cent of the students in the norm group. This type of information can be useful for deciding whether or not the student needs remedial assistance or is a candidate for the gifted programme. However, the score gives little information about what the student actually knows or can do. A major criticism of norm-referenced tests is that they tend to focus on assessing low-level, basic skills (Romberg, Zarinnia & Williams, 1989). (b) Criterion-referenced Tests Criterion-referenced tests determine what students can or cannot do, and not how they compare with others (Anastasi, 1988). Criterion-referenced tests report how well students are doing relative to a pre-determined performance level on a specified set of educational goals or outcomes included in the curriculum. Criterion-referenced tests are used when teachers wish to know how well students have learnt the content and skills which they are expected to have mastered. This information may be used to determine how well the students are learning the desired curriculum and how well the school is teaching that curriculum. Criterion-referenced tests give detailed information about how well each student has performed on each of the educational goals or outcomes included in that test. For instance, a criterionreferenced test score might describe which arithmetic operations a student could perform or the level of reading difficulty he or she experienced. Let us look at Table 1.6 which summarises the differences between these two types of assessments. Table 1.4: Differences between Norm-referenced and Criterion-referenced Tests Criteria Aim Norm-referenced Test Compares a studentÊs performance with that of other students Selects students for certification Criterion-referenced Test Compares a studentÊs performance against some criteria Determines the extent to which a student has acquired the knowledge or skill Improves teaching and learning Types of questions Questions from simple to difficult Questions of nearly similar difficulty relating to the criteria Reporting of results Grades are assigned No grades are assigned (whether skill or knowledge has been achieved or not) Copyright © Open University Malaysia (OUM) TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING Content coverage Wide content coverage Specific aspects of the content Examples UPSR, SPM, STPM national examinations, end-of-semester examinations, end-ofyear examinations Class tests, exercises and assignments 17 SELF-CHECK 1.2 1. List the main differences between formative and summative assessments. 2. Explain the differences between norm-referenced and criterionreferenced tests. 1.6 TRENDS IN ASSESSMENT In the past two decades, major changes have occurred in assessment practices in many parts of the world. The following trends in educational assessment have been identified: (a) Written examinations are gradually being replaced by more continuous assessment and coursework; (b) There is a move towards more student involvement and choice in assessment; (c) Group assessment is becoming more popular in an effort to emphasise collaborative learning among students and to reduce excessive competition; (d) Subject areas and courses state more explicitly the expectations in assessment. Students are clearer about the kinds of performance required of them when they are assessed. This is unlike earlier practice where assessment was so secretive that students had to figure out for themselves what was required of them; (e) An understanding of the process is now seen as equally important to knowledge of facts. This is in line with the general shift from product-based assessment towards process-based assessment; and Copyright © Open University Malaysia (OUM) 18 TOPIC 1 (f) Student-focused „learning outcomes‰ have begun to replace teacheroriented „objectives‰. The focus is more on what the students will learn rather than what the teacher plans to teach. This is in line with the principle of outcomes-based teaching and learning. THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING ACTIVITY 1.2 To what extent do you agree with the current trends in assessment? Discuss this issue with your coursemates in the myINSPIRE online forum. A test may be thought of as a set of tasks or questions intended to elicit particular types of behaviours when presented under standardised conditions to yield scores that have desirable psychometric properties. Measurement in education is the process by which the attributes of a person are measured and assigned numbers. Assessment is the process of collecting information with the purpose of making decisions about students. Assessment is aimed at helping learning and improving teaching. What is to be assessed must be clearly specified in the intended learning outcomes. An assessment procedure or method should be selected based on its relevance to the characteristics or performance to be measured. The assessment must be aligned to instruction and learning outcomes. The relationship among assessment, instruction and learning outcomes is referred to as constructive alignment. Different assessment procedures are required to provide a complete picture of student achievement and development. There are four types of assessment namely formative, summative, normreferenced and criterion-referenced. Copyright © Open University Malaysia (OUM) TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING 19 Formative assessment is often done at the beginning of or during, the school year, thus providing the opportunity to obtain timely information about students' learning in a particular subject area or at a particular point in a programme. Summative assessment is comprehensive in nature, provides accountability and is used to check the level of learning at the end of a programme. The major reason for norm-referenced tests is to classify students. These tests are designed to highlight achievement differences between and among students to produce a dependable rank order of students. Criterion-referenced tests determine what students can or cannot do, and not how they compare with others. Among the trends in assessment are more continuous assessment and coursework as well as more choices of assessment. Assessment Measurement Constructive alignment Norm-referenced Criterion-referenced Psychometrics Formative assessment Summative assessment Learning outcome Test Anastasi, A. (1988). Psychological testing. New York, NY: MacMillan. Biggs, J., & Tang, C. (2007). Teaching for quality learning at university. Berkshire, England: McGraw Hill. Deale, R. N. (1975). Assessment and testing in secondary school. Chicago, IL: Evans Bros. Copyright © Open University Malaysia (OUM) 20 TOPIC 1 THE ROLE OF ASSESSMENT IN TEACHING AND LEARNING Flanagan, D. P., Genshaft, J., & Harrison, P. L. (1997). Contemporary intellectual assessment: Theories, tests and issues. New York, NY: Guildford Press. Harlen, W. (1978). Does content matter in primary science? School Science Review, 59(Jun), 614ă625. Irvine, P. (1986). Sir Francis Galton (1822ă1911). Journal of Special Education, 20(1), 6ă7. Romberg, T. A., Zarinnia E. A., & Williams, S. R. (1989). The influence of mandated testing on mathematics instruction: Grade eight teachersÊ perceptions. Madison, WI: National Center for Research in Mathematical Sciences Education. Rowntree, D. (1974). Educational technology in curriculum development. New York, NY: Harper & Row. Salvia, J., & Ysseldyke, J. E. (1995). Assessment (6th ed.). Boston, MA: Houghton Mifflin. Scriven, M. (1996). Types of evaluation and types of evaluator. American Journal of Evaluation, 17(2), 151ă161. Stiggins, R. J. (1994). Student-centered classroom assessment. New York, NY: Merrill. Copyright © Open University Malaysia (OUM) Topic Foundation of 2 Assessment: What to Assess? LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Justify the behaviours to be measured to present a holistic assessment of students; 2. Describe the various levels of cognitive learning outcomes to be assessed; 3. Explain the various levels of affective learning outcomes to be assessed; and 4. Describe the various levels of psychomotor learning outcomes to be assessed. INTRODUCTION If you were to ask a teacher about the things that should be assessed in the classroom, the immediate response would most probably be „the facts and concepts taught‰. The facts and concepts may be in Science, History, Geography, Language, Arts, Religious Education and other similar subjects. The Malaysian Philosophy of Education states that education should aim towards the holistic development of the individual. Copyright © Open University Malaysia (OUM) 22 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Hence, it is only logical that the assessment system should also seek to assess more than the acquisition of facts and concepts of a subject area. What about assessment of physical and motor abilities? What about socio-emotional behaviours such as attitudes, interests, personality and so forth? Do they not contribute to the holistic person? Let us find out the answer as this topic will highlight what to assess. 2.1 IDENTIFYING WHAT TO ASSESS When educators are asked what should be assessed in the classroom, the majority would refer to evaluating the acquisition of facts, concepts, principles, procedures and methods of a subject area. You might find a minority of educators who insist that skills acquired by students should also be assessed especially in subjects such as physical education, art, drama, music, technical drawing, carpentry, automobile engineering and so forth. Even fewer educators would propose that the socioemotional behaviour of students should also be assessed. Let us refer to the National Philosophy of Malaysian Education (see Figure 2.1), which has important implications for assessment. Figure 2.1: The National Philosophy of Malaysian Education Theoretically, a comprehensive assessment system should seek to provide information on the extent to which the National Philosophy of Education has achieved its goal. In other words, the assessment system should seek to determine: (a) Whether our schools have developed „the potential of individuals in a holistic and integrated manner‰; (b) Whether students have developed „intellectually, spiritually, emotionally and physically‰; Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 23 (c) Whether students are „knowledgeable and competent‰ and „possess high moral standards‰; (d) Whether students have achieved a „high level of personal well-being‰; and (e) Whether students are equipped with abilities and attitudes that will enable them „to contribute to the harmony and betterment of the family, society and the nation at large‰. Yet, in actual practice, assessment tends to overemphasise on assessing intellectual competence which translates into the measurement of cognitive learning outcomes of specific subject areas. The other aspects of the holistic individual are given minimal attention because of various reasons. For example, how does a teacher assess spiritual or emotional growth or development? These are constructs that are difficult to evaluate and extremely subjective. Hence, it is no surprise that assessment of cognitive outcomes has remained the focus of most assessment systems all over the world because it is relatively easier to observe and measure. However, in this topic, we will make an attempt to present a more „holistic‰ assessment of learning which focuses on three main types of human behaviour. These are behaviours psychometricians and psychologists have attempted to assess and are closely aligned in realising the goals of the National Philosophy of Malaysian Education. 2.1.1 Three Types of Learning Outcomes Few people will dispute that the purpose of schooling is the development of a holistic person. In the late 1950s and early 1960s, a group of psychologists and psychometricians proposed that schools should seek to assess three domains of learning outcomes which are: (a) Cognitive learning outcomes (knowledge or mental skills); (b) Affective learning outcomes (growth in feelings or emotional areas); and (c) Psychomotor learning outcomes (manual or physical skills). Copyright © Open University Malaysia (OUM) 24 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? These three domains are closely interrelated as shown in Figure 2.2. Figure 2.2: Holistic assessment of students Domains can be thought of as categories. Educators often refer to these three domains as KSA (knowledge, skills and attitude). Each domain consists of subdivisions, starting from the simplest behaviours to the most complex, thus forming a taxonomy of learning outcomes. Each taxonomy of learning behaviour can be thought of as „the goals of the schooling process‰. That is, after schooling, the students should have acquired new skills, knowledge and/or attitudes. However, the levels of each division outlined are not absolutes. While there are other systems or hierarchies that have been devised in the educational world, these three taxonomies are easily understood and are probably the most widely used today. To assess these three domains, one has to identify and isolate those behaviours that represent these domains. When we assess, we evaluate some aspects of the studentÊs behaviour, such as his or her ability to compare, explain, analyse, solve, draw, pronounce, feel, reflect and so forth. The term „behaviour‰ is used broadly to include the studentÊs ability to think (cognitive), feel (affective) and perform a skill (psychomotor). For example, you have just taught the topic „The Rainforests of Malaysia‰ and now you want to assess your students in the following ways: (a) Their Thinking You may ask them to list the characteristics of the Malaysian rainforest and compare them with those of the coniferous forests of Canada. (b) Their Feelings (Emotions, Attitudes) You may ask them to design an exhibition on how students can contribute towards conserving the rainforest. (c) Their Skills You may ask them to prepare satellite maps on the changing Malaysian rainforest using websites from the Internet. Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 25 ACTIVITY 2.1 To what extent are affective and psychomotor behaviours assessed in Malaysian schools? Discuss this with your coursemates in the myINSPIRE online forum. 2.2 ASSESSING COGNITIVE LEARNING OUTCOMES OR BEHAVIOUR When we evaluate or assess a human being, we are assessing or evaluating the behaviour of a person. This might be a bit confusing for some people. Are we not assessing a personÊs understanding of the facts, concepts and principles of a subject area? Every subject, be it History, Science, Geography, Economics or Mathematics, has its unique repertoire of facts, concepts, principles, generalisations, theories, laws, procedures and methods to be transmitted to students. This concept can be illustrated as in Figure 2.3. Figure 2.3: Contents of a subject assessed Copyright © Open University Malaysia (OUM) 26 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? When we assess, we do not assess studentsÊ storage of facts, concepts or principles of a subject, but rather what students are able to do with the facts, concepts or principles of the subject area. For example, we evaluate studentsÊ ability to compare facts, explain concepts, analyse a generalisation (or statement) or solve a problem based on a given principle. In other words, we assess understanding or mastery of a body of knowledge based on what students are able to do with the contents of the subject. 2.2.1 Bloom’s Taxonomy In 1956, Benjamin Bloom headed a group of educational psychologists who developed a classification of levels of intellectual behaviour important to learning. They found that over 95 per cent of test questions students encountered required them to think only at the lowest possible level i.e. the recall of information. Bloom and his colleagues developed a widely accepted taxonomy (method of classification on differing levels) for cognitive learning outcomes. This is referred to as BloomÊs taxonomy (refer to Figure 2.4). Figure 2.4: BloomÊs taxonomy of cognitive learning outcomes There are six levels in BloomÊs classification with the lowest level termed knowledge. The knowledge level is followed by five increasingly complex levels of mental abilities namely comprehension, application, analysis, synthesis and evaluation. (a) Knowledge (C1) The behaviours at the knowledge level require students to recall specific information. The knowledge level is the lowest cognitive level. Examples of verbs describing behaviours at the knowledge level include the ability to list, define, name, state, recall, match, identify, tell, label, underline, locate, recognise and so forth. Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 27 For example, students are able to recite the factors leading to World War II, quote formula for density and force, and list laboratory safety rules. (b) Comprehension (C2) The behaviours at the comprehension level, which is a higher level of mental ability than the knowledge level, require an understanding of the meaning of concepts and principles, translation of words and phrases into oneÊs own words, interpolation which involves filling in missing information, and interpretation which involves inferring and going beyond the given information. Examples of verbs describing behaviours at the comprehension level are explain, distinguish, infer, interpret, convert, generalise, defend, estimate, extend, paraphrase, retell by using own words, rewrite, translate and so forth. For example, students are able to rewrite NewtonÊs three laws of motion, explain in oneÊs own words the steps for performing a complex task and translate an equation into a computer spreadsheet. (c) Application (C3) The behaviours at the application level require students to apply a rule or principle learnt in the classroom into novel or new situations in the workplace or unprompted use of an abstraction. Examples of verbs describing behaviours at the application level are apply, change, compute, demonstrate, discover, manipulate, modify, give an example, operate, predict, prepare, produce, relate, show, solve, use and so forth. For example, students are able to use the formula for projectile motion to calculate the maximum distance a long jumper can jump and apply the principles of statistics to evaluate the reliability of a written test. (d) Analysis (C4) The behaviours at the analysis level require students to identify component parts and describe their relationship, separate material or concepts into component parts so that its organisational structure may be understood and distinguish between facts and inferences. Examples of verbs describing behaviours at the analysis level are analyse, break down, compare, contrast, diagnose, deconstruct, examine, dissect, differentiate, discriminate, distinguish, illustrate, infer, outline, relate, select and separate. For example, students are able to troubleshoot a piece of equipment by using logical deduction, recognise logical fallacies in reasoning, gather information from a company and determine needs for training. Copyright © Open University Malaysia (OUM) 28 TOPIC 2 (e) Synthesis (C5) The behaviours at the synthesis level require students to build a structure or pattern from diverse elements, put parts together to form a whole with emphasis on creating a new meaning or structure. Examples of verbs describing behaviours at the synthesis level are categorise, combine, compile, compose, create, devise, design, generate, modify, organise, plan, rearrange, reconstruct, relate, reorganise, find an unusual way, formulate, revise, rewrite, tell, write and so forth. FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? For example, students are able to write a creative short story, design a method to perform a specific task, integrate ideas from several sources to solve a problem and revise a process to improve the outcome. (f) Evaluation (C6) The behaviours at the evaluation level require students to make a judgement about materials and methods, as well as the value of ideas or materials. Examples of verbs describing behaviours at the evaluation level are appraise, conclude, criticise, critique, defend, rank, give your own opinion, discriminate, evaluate, value, justify, relate, support and so forth. For example, students are able to evaluate and decide on the most effective solution to a problem and justify the choice of a new procedure or course of action. 2.2.2 The Helpful Hundred Heinich, Molenda, Russell and Smaldino (2001) suggested 100 verbs which highlighted performance or behaviours that were observable and measurable. These 100 verbs are not the only ones but they are a great reference for educators. Table 2.1 displays the verbs that would be appropriate to use when you are writing learning outcomes in each level of BloomÊs taxonomy. Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 29 Table 2.1: The Helpful Hundred add contrast generate operate ski alphabetise convert graph order solve analyse correct grasp organise sort apply cut grind outline specify arrange deduce hit pack square assemble defend hold paint state attend define identify plot subtract bisect demonstrate illustrate position suggest build derive indicate predict swing cave describe install prepare tabulate categorise design kick present throw choose designate label produce time classify diagram locate pronounce translate colour distinguish make read type compare drill manipulate reconstruct underline complete estimate match reduce verbalise compose evaluate measure remove verify compute explain modify revise weave conduct extrapolate multiply select weigh construct fit name sketch write Do note that there is a lot of overlap in the use of verbs to describe behaviours. The same verb may be used to describe different behaviours. For example, the verb „explain‰ may be used to describe the behaviours of evaluation (C6), analysis (C4) and comprehension (C2) depending on the context it is used as shown as follows: (a) Students are able to explain how effective essay questions are in assessing studentsÊ critical thinking ability. (C6) (b) Students are able to explain how essay questions are different from multiple choice questions in assessing studentsÊ performance. (C4) (c) Students are able explain, in their own words, the criteria that they would consider when formulating an essay question as a test item. (C2) Copyright © Open University Malaysia (OUM) 30 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Likewise, different verbs may be used to describe the same behaviour. For example, the behaviour of „analysis‰ may be expressed using the verbs such as „compare and contrast‰, „explain‰ and „distinguish‰ as follows: (a) Students are able to compare and contrast formative assessment from summative assessment. (C4) (b) Students are able to explain the difference between formative assessment and summative assessment. (C4) (c) Students are able to distinguish formative assessment from summative assessment. (C4) In 2001, Anderson and Krathwohl modified the original BloomÊs taxonomy and identified and isolated the following list of behaviours that an assessment system should address (refer to Table 2.2). Table 2.2: Revised Version of BloomÊs Taxonomy Category and Cognitive Process 1. 2. 3. 4. Alternative Names Remember (a) Recognising Ć Identifying (b) Recalling Ć Retrieving Understand (a) Interpreting Ć Clarifying, paraphrasing, representing, translating (b) Exemplifying Ć Illustrating, instantiating (c) Classifying Ć Categorising, subsuming (d) Summarising Ć Abstracting, generalising (e) Inferring Ć Concluding, extrapolating, interpolating, predicting (f) Comparing Ć Contrasting, mapping, matching (g) Explaining Ć Constructing models Apply (a) Executing Ć Carrying out (b) Implementing Ć Using Analyse (a) Differentiating Ć Discriminating, distinguishing, focusing, selecting (b) Organising Ć Finding coherence, integrating, outlining, structuring (c) Attributing Ć Deconstructing Copyright © Open University Malaysia (OUM) TOPIC 2 5. 6. FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 31 Evaluate (a) Checking Ć Coordinating, detecting, monitoring, testing (b) Critiquing Ć Judging Create (a) Generating Ć Hypothesising (b) Planning Ć Designing (c) Producing Ć Constructing Source: Anderson and Krathwohl (2001) Note that the first two original levels of „knowledge‰ and „comprehension‰ were replaced with „remember‰ and „understand‰ respectively. The „synthesis‰ level was renamed „create‰. Note that in the original taxonomy, the sequence was „synthesis‰ followed by „evaluate‰. In the modified taxonomy, however, the sequence was rearranged to „evaluate‰ followed by „create‰. As you can see, the primary differences between the „original‰ and the revised taxonomy are not in the listings or rewordings from nouns to verbs, or in the renaming of some of the components or even in the re-positioning of the last two categories. The major differences lie in the more useful and comprehensive additions of how the taxonomy intersects and acts upon different types and levels of knowledge ă factual, conceptual, procedural and metacognitive. (a) Factual Knowledge It refers to essential facts, terminology, details or elements students must know or be familiar with in order to understand a discipline or solve a problem in it. (b) Conceptual Knowledge The knowledge of classifications, principles, generalisations, theories, models or structures pertinent to a particular disciplinary area. (c) Procedural Knowledge It refers to information or knowledge that helps students to do something specific to a discipline, subject or area of study. It also refers to methods of inquiry, very specific or finite skills, algorithms, techniques and particular methodologies. Copyright © Open University Malaysia (OUM) 32 TOPIC 2 (d) Metacognitive Knowledge Metacognition is, simply, thinking about oneÊs thinking. More precisely, it refers to the processes used to plan, monitor and assess oneÊs understanding and performance. Activities such as planning how to approach a given learning task, monitoring comprehension and evaluating progress towards the completion of a task are metacognitive in nature. FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? SELF-CHECK 2.1 1. Explain the differences between analysis and synthesis according to BloomÊs taxonomy. 2. How is the revised version of BloomÊs taxonomy different from the original version? ACTIVITY 2.2 Discuss in the myINSPIRE online forum: (a) Do you agree that BloomÊs taxonomy is a hierarchy of cognitive abilities? Why? (b) Do you agree that you need to be able to „analyse‰ before being able to „evaluate‰? Why? 2.3 ASSESSING AFFECTIVE LEARNING OUTCOMES OR BEHAVIOUR Affective characteristics involve the feelings or emotions of a person. Attitudes, values, self-esteem, locus of control, self-efficacy, interests, aspirations and anxiety are all examples of affective characteristics. Unfortunately, affective outcomes have not been a central part of schooling, even though they are arguably as important as or even more important than, any cognitive or psychomotor domain of learning outcomes targeted by schools. Some possible reasons for the historical lack of emphasis on affective outcomes include the following: Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 33 (a) The belief that the development of appropriate feelings is the task of family and religion. (b) The belief that appropriate feelings develop automatically from knowledge and experience with content and do not require any special pedagogical attention. (c) Attitudinal and value-oriented instructions are difficult to develop and assess because: (i) Affective goals are intangible; (ii) Affective learning outcomes cannot be attained in the typical periods of instruction offered in schools; (iii) Affective characteristics are considered to be private rather than public matters; and (iv) There are no sound methods to gather information about affective characteristics. However, affective goals are no more intangible than cognitive ones. Some have claimed that affective behaviours can be developed automatically when specific knowledge is taught while others argue that affective behaviours have to be explicitly developed in schools. Affective goals do not necessarily take longer to achieve in the classroom than cognitive goals. All that is required is to state a goal more concretely and behaviourally oriented so that it can be assessed and monitored. There is also the belief that affective characteristics are private and should not be made public. While people value their privacy, the public also has the right to information. If the information gathered is needed to make a decision, then gathering of such information is not generally considered an invasion of privacy. For example, if the assessment is used to determine if a student needs further attention such as special education, then gathering such information is not an invasion of privacy. On the other hand, if the information being sought is not relevant to the stated purpose, then gathering of such information is likely to be an invasion of privacy. Similarly, information about affect can be used either for good or ill purposes. For example, if a Mathematics teacher discovers that a student has a negative attitude towards Mathematics and ridicules that student in front of the class, then the information has been misused. However, if the teacher uses the information to change his or her instructional methods so as to help the student develop a more positive attitude towards Mathematics, then the information has been used wisely. Copyright © Open University Malaysia (OUM) 34 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Krathwohl, Bloom and Bertram (1973) and his colleagues developed the affective domain which deals with things related to emotion, such as feelings, values, appreciation, enthusiasm, motivation and attitudes. The five major categories listed the simplest behaviour to the most complex: receiving, responding, valuing, organisation and characterisation (refer to Figure 2.5). Figure 2.5: Krathwohl, Bloom and BertramÊs taxonomy of affective learning outcomes Source: Krathwohl et al. (1973) These categories are further explained as follows: (a) (b) Receiving (A1) The behaviours at the receiving level require the student to be aware, willing to hear and focused or attentive. Verbs describing behaviours at the receiving level include ask, listen, choose, describe, follow, give, hold, locate, name, point to, select, reply and so forth. For example, the student: (i) Listens to others with respect; and (ii) Listens and remembers the names of other students. Responding (A2) The behaviours at the responding level require the student to be an active participant, attend and react to a particular phenomenon, willing to respond and gain satisfaction in responding (motivation). Verbs describing behaviours at the responding level include answer, assist, aid, comply, conform, discuss, greet, help, label, perform, practise, present, read, recite, report, select, tell, write and so forth. For example, the student: (i) Participates in class discussion; (ii) Gives a presentation; and (iii) Questions new ideals, concepts or models in order to fully understand them. Copyright © Open University Malaysia (OUM) TOPIC 2 (c) FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 35 Valuing (A3) This level relates to the worth or value a person attaches to a particular object, phenomenon or behaviour. This ranges from simple acceptance to the more complex state of commitment. Valuing is based on the internalisation of a set of specified values, while clues to these values are expressed in the student as overt behaviour and are often identifiable. Verbs describing behaviours at the valuing level include demonstrate, differentiate, follow, form, initiate, invite, join, justify, propose, read, report, select, share, study, work and so forth. For example, the student: (i) Demonstrates belief in the democratic process; (ii) Is sensitive towards individual and cultural differences (values diversity); (iii) Shows the ability to solve problems; (iv) Proposes a plan for social improvement; and (v) (d) Follows through with commitment. Organisation (A4) At this level, people organise values into priorities by contrasting different values, resolving conflicts between them and creating a unique value system. The emphasis is on comparing, relating and synthesising values. Verbs describing behaviours at the level of organisation are adhere, alter, arrange, combine, compare, complete, defend, explain, formulate, generalise, identify, integrate, modify, order, organise, prepare, relate, synthesise and so forth. For example, the student: (i) Recognises the need for balance between freedom and responsible behaviour; (ii) Accepts responsibility for his or her behaviour; (iii) Explains the role of systematic planning in solving problems; (iv) Accepts professional ethical standards; (v) Creates a life plan in harmony with abilities, interests and beliefs; and (vi) Prioritises time effectively to meet the needs of the organisation, family and self. Copyright © Open University Malaysia (OUM) 36 TOPIC 2 (e) Characterisation (A5) At this level, a personÊs value system controls his or her behaviour. The behaviour is pervasive, consistent, predictable and most importantly, characteristic of the student. Verbs describing behaviours at this level include act, discriminate, display, influence, listen, modify, perform, practise, propose, qualify, question, revise, serve, solve and verify. For example, the student: FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? (i) Shows self-reliance when working independently; (ii) Cooperates in group activities (displays teamwork); (iii) Uses an objective approach in problem solving; (iv) Displays a professional commitment to ethical practice on a daily basis; (v) Revises judgement and changes behaviour in light of new evidence; and (vi) Values people for what they are and not how they look. Table 2.3 shows how affective taxonomy may be applied to a value such as honesty. It traces the development of an affective attribute such as honesty from the „receiving‰ level until the „characterisation‰ level where the value becomes a part of the individualÊs character. Table 2.3: Affective Taxonomy for Honesty Individual Character Explanation Receiving (attending) Aware that certain things are honest or dishonest Responding Saying honesty is better and behaving accordingly Valuing Consistently (but not always) telling the truth Organisation Being honest in a variety of situations Characterisation by a value or value complex Honest in most situations, expects others to be honest and interacts with others fully and honestly Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 37 SELF-CHECK 2.2 1. Explain the differences between characterisation and valuing according to the affective taxonomy of learning outcomes. 2. „A student is operating at the responding level.‰ What does this mean? ACTIVITY 2.3 The Role of Affect in Education „Some say schools should be concerned only with content.‰ „It is impossible to teach content without also teaching affect.‰ „To what extent, if at all, should we be concerned with the assessment of affective learning outcomes?‰ In the myINSPIRE online forum, discuss the three statements in the context of the Malaysian education system. 2.4 ASSESSING PSYCHOMOTOR LEARNING OUTCOMES OR BEHAVIOUR The psychomotor domain includes physical movement, coordination and use of motor-skill areas. Development of these skills requires practice and is measured in terms of speed, precision, distance, procedures and techniques in execution. There are seven major categories listed in this domain from the simplest to the most complex behaviour as shown in Figure 2.6. Copyright © Open University Malaysia (OUM) 38 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Figure 2.6: Taxonomy of psychomotor learning outcomes These learning outcomes are further explained as follows: (a) Perception (P1) This is the ability to use sensory cues to guide motor activity. It ranges from sensory stimulation and cue selection to translation. Verbs describing these types of behaviours include choose, describe, detect, differentiate, distinguish, identify, isolate, relate, select and so forth. For example, the student: (i) Detects non-verbal communication cues from the coach; (ii) Estimates where a ball will land after it is thrown and then moves to the correct location to catch the ball; (iii) Adjusts the heat of the stove to the correct temperature by the smell and taste of food; and (iv) Adjusts the height of the ladder in relation to the point on the wall. (b) Set (P2) This includes mental, physical and emotional sets. These three sets are dispositions that predetermine a personÊs response to different situations (sometimes called mindsets). Verbs describing „set‰ include begin, display, explain, move, proceed, react, show, state and volunteer. For example, the student: (i) Knows and acts upon a sequence of steps in a manufacturing process; (ii) Recognises his or her abilities and limitations; and (iii) Shows the desire to learn a new process (motivation). Note: This subdivision of the psychomotor domain is closely related to the „responding‰ subdivision of the affective domain. Copyright © Open University Malaysia (OUM) TOPIC 2 (c) FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 39 Guided Response (P3) The early stages in learning a complex skill that includes imitation, and trial and error. Adequacy of performance is achieved by practising. Verbs describing „guided response‰ include copy, trace, follow, react, reproduce and respond. For example, the student: (i) Performs a mathematical equation as demonstrated; (ii) Follows instructions when building a model of a kampung house; and (iii) Responds to the hand signals of the coach while learning gymnastics. (d) Mechanism (P4) This is the intermediate stage in learning a complex skill. Learned responses have become habitual and the movements can be performed with some confidence and proficiency. Verbs describing „mechanism‰ include assemble, calibrate, construct, dismantle, display, fasten, fix, grind, heat, manipulate, measure, mend, mix and organise. For example, the student: (i) Uses a computer; (ii) Repairs a leaking tap; (iii) Fixes a three-pin electrical plug; and (iv) Rides a motorbike. (e) Complex Overt Response (P5) The skilful performance of motor acts that involve complex movement patterns. Proficiency is indicated by a quick, accurate and highly coordinated performance, requiring a minimum of energy. This category includes performing without hesitation and automatic performance. For example, players often utter sounds of satisfaction or expletives as soon as they hit a tennis ball (like world famous tennis players Maria Sharapova and Serena Williams) or a golf ball (golfers will immediately know they have hit a bad shot!) because they can tell by the feel of the act and what the result will be. Verbs describing „complex overt responses‰ include assemble, build, calibrate, construct, dismantle, display, fasten, fix, grind, heat, manipulate, measure, mend, mix, organise and sketch. For example, the student: (i) Manoeuvres a car into a tight parallel parking spot; (ii) Operates a computer quickly and accurately; and (iii) Displays competence while playing the piano. Copyright © Open University Malaysia (OUM) 40 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Note: Many of the verbs are the same as „mechanism‰, but there are adverbs or adjectives that indicate that the performance is quicker, better and more accurate. (f) Adaptation (P6) Skills are well developed and the individual can modify movement patterns to fit special requirements. Verbs describing „adaptation‰ include adapt, alter, change, rearrange, reorganise, revise and vary. For example, the student: (i) Responds effectively to unexpected experiences; (ii) Modifies instruction to meet the needs of the students; and (iii) Performs a task with a machine that it was not originally intended to do (machine is not damaged and there is no danger in performing the new task). (g) Origination (P7) Creating new movement or pattern to fit a particular situation or specific problem. Learning outcomes emphasise creativity based on highlydeveloped skills. Verbs describing „origination‰ include arrange, build, combine, compose, construct, create, design, initiate, make, and originate. For example, the student: (i) Constructs a new theory; (ii) Develops a new technique for goalkeeping; and (i) Creates a new gymnastic routine. Table 2.4 shows how psychomotor taxonomy may be applied to kicking a football. It traces the development of the psychomotor skill of kicking a football from the „perception‰ level until the „origination‰ level. Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 41 Table 2.4: Psychomotor Taxonomy for Kicking a Football Level Explanation Perception Able to estimate where the ball would land after it was kicked Responding Shows the desire to learn and perform a kicking technique Guided response Able to kick the ball under guidance through trial and error or imitation Mechanism Able to kick the ball mechanically with some confidence and proficiency Complex overt response Able to kick the ball skilfully using a proper technique learnt Adaptation Able to modify the kicking technique to suit different situations Origination Able to create a new kicking technique SELF-CHECK 2.3 1. Explain the differences between adaptation and guided response according to the taxonomy of psychomotor learning outcomes. 2. „A student is operating at the origination level.‰ What does this mean? 2.5 IMPORTANT TRENDS IN WHAT TO ASSESS Since the influence of testing on curriculum and instruction is now widely acknowledged, educators, policymakers and others are turning to alternative assessment methods as a tool for educational reform. The call is to move away from traditional objective and essay tests towards alternative assessments focusing on authentic assessment and performance assessment (we will discuss these assessment methods in Topics 5 and 6). Various techniques have been proposed to assess learners more holistically, focusing on both the product and process of learning (refer to Figure 2.7). Copyright © Open University Malaysia (OUM) 42 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Figure 2.7: Trends in what to assess Source: Dietel, Herman and Knuth(1991) Assessment of cognitive outcomes has remained the focus of most assessment systems all over the world because it is relatively easier to observe and measure. Each domain of learning consists of subdivisions, starting from the simplest behaviours to the most complex, thus forming a taxonomy of learning outcomes. When we evaluate or assess a human being, we are assessing or evaluating the behaviour of a person. Every subject area has its unique repertoire of facts, concepts, principles, generalisations, theories, laws, procedures and methods to be transmitted to learners. Copyright © Open University Malaysia (OUM) TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? 43 There are six levels in BloomÊs taxonomy of cognitive learning outcomes with the lowest level termed knowledge, followed by five increasingly difficult levels of mental abilities: comprehension, application, analysis, synthesis and evaluation. The six levels in the revised version are remembering, understanding, applying, analysing, evaluating and creating. Affective characteristics involve the feelings or emotions of a person. Attitudes, values, self-esteem, locus of control, self-efficacy, interests, aspirations and anxiety are all examples of affective characteristics. The five major categories of the affective domain from the simplest behaviour to the most complex are receiving, responding, valuing, organisation and characterisation. The psychomotor domain includes physical movement, coordination and use of the motor-skill areas. The seven major categories of the psychomotor domain from the simplest behaviour to the most complex are perception, set, guided response, mechanism, complex overt response, adaptation and origination. The ideal situation is an alignment of objectives, instruction and assessment. The trend in assessment is to move away from traditional objective and essay tests towards alternative assessments focusing on authentic assessment and performance assessment. Affective outcomes Cognitive-constructivist Authentic assessment Cognitive outcomes Behaviour Holistic assessment Behavioural view Psychomotor outcomes BloomÊs taxonomy The Helpful Hundred Copyright © Open University Malaysia (OUM) 44 TOPIC 2 FOUNDATION OF ASSESSMENT: WHAT TO ASSESS? Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of BloomÊs taxonomy of educational objectives. Boston, MA: Allyn & Bacon. Dietel, R., Herman, J., & Knuth, R. (1991). What does research say about assessment. Retrieved from https://bit.ly/2ECzOXP Dwyer, F. M. (1991). A paradigm for generating curriculum design oriented research questions in distance education. Second American Symposium Research in Distance Education. University Park, PA: Pennsylvania State University. Heinich, R., Molenda, M., Russell, J. D., & Smaldino, S. E. (2001). Instructional media and technologies for learning (7th ed.). Englewood Cliffs, NJ: Prentice Hall. Krathwohl, D., Bloom, B., & Bertram, B. (1973). Taxonomy of educational objectives, the classification of educational goals, handbook II: Affective domain. New York, NY: David McKay. Copyright © Open University Malaysia (OUM) Topic Planning 3 Classroom Tests LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Describe the process of planning a classroom test; 2. Explain the purposes of a test and their impotance in test planning; 3. Describe how learning outcomes to be assessed affect test planning; 4. Select the best item types for a test in line with learning outcomes; 5. Develop a table of specifications for a test; 6. Identify appropriate marking schemes for an essay test; and 7. Explain the general principles of constructing relevant test items. INTRODUCTION In this topic, we will focus on the process of planning classroom tests. Testing is part of the teaching and learning process. The importance of planning and writing a reliable, valid and fair test cannot be underestimated. Designing tests is an important part of assessing studentsÊ understanding of course content and their level of competency in applying what they have learnt. Whether you use lowstakes quizzes or high-stakes mid-semester and final examination tests, careful design will help provide more calibrated results. Assessments should reveal how well students have learnt what teachers want them to learn while instruction ensures that they learn it. Copyright © Open University Malaysia (OUM) 46 TOPIC 3 PLANNING CLASSROOM TESTS Thus, thinking about summative assessment at the end of a programme of teaching is not enough. It is also helpful to think about assessment at every stage of the planning process, because identifying the ways in which teachers will assess their students will help clarify what it is that teachers want them to learn, and this in turn will help determine the most suitable learning activities. This topic will discuss the general guidelines applicable to most assessment tools when planning a test. Topics 4 and 5 will discuss in detail the objectives of essay tests. The authentic assessment tools such as projects and portfolios will be discussed in the respective topics. 3.1 PURPOSES OF CLASSROOM TESTING Tests can refer to traditional paper-and-pencil or computer-based tests, such as multiple choice, short answer and essay tests. Tests provide teachers with objective feedback as to how much students are learning and how much they have understood what they have learnt. Commercially published achievement tests to some extent can provide evaluation of the knowledge levels of individual students, but provide only limited instructional guidance in assessing the wide range of skills taught in any given classroom. Teachers know their students and they are the best assessors of their students. Tests developed by the individual teachers for use with their own class are most instructionally relevant. Teachers can tailor tests to emphasise the information they consider important and to match the ability levels of their students. If carefully constructed, classroom tests can provide teachers with accurate and useful information about the knowledge retained by their students. The key to this process is the test questions that are used to elicit evidence of learning. Test questions and tasks are not just planning tools; they also form an essential part of the teaching sequence. Incorporating the tasks into teaching and using the evidence about the student learning to determine what happens next in the lesson is truly an embedded formative assessment. Copyright © Open University Malaysia (OUM) TOPIC 3 3.2 PLANNING CLASSROOM TESTS 47 PLANNING A CLASSROOM TEST A well constructed test must have high quality items. The well constructed test is an instrument that provides accurate measure of test takerÊs ability within a particular domain. It is worth spending time writing high quality items for the tests. In order to produce high quality questions, the test construction has to be properly planned. Let us look at the following steps of planning a test (refer to Figure 3.1). Figure 3.1: Planning a test 3.2.1 Deciding Its Purposes The first step in test planning is to decide on the purpose of the test. Tests can be used for many different purposes. If a test is to be used formatively, it should indicate precisely what the student needs to study, and to what level. The purpose of formative tests is to assess progress and to direct the learning process. These tests will have limited sample of content and learning outcomes. Teachers must prepare sufficient mix of easy and difficult items. These items are used to make corrective prescriptions such as practice exercises for some students who do not perform satisfactorily in the tests. Copyright © Open University Malaysia (OUM) 48 TOPIC 3 PLANNING CLASSROOM TESTS If a test is to be used summatively, the coverage of content and learning outcomes would be different from that of formative tests. Summative tests are normally conducted at the end of a teaching and learning phase, for example, at the end of a course. They are used to determine the studentsÊ mastery level of the course and help teachers to decide whether a particular student can proceed to the next level of his or her studies. The summative tests should therefore cover the whole content areas and learning outcomes of the course, or should at least cover a representative sample of the contents and learning outcomes of the course. The test items are also varied in their levels of difficulty and complexity as defined by the learning outcomes. Tests can also serve a diagnostic purpose. Diagnostic tests are used to find out what students know and do not know, and their strengths and weaknesses. They typically happen at the start of a new phase of education, like when they start learning a new course. The tests normally cover topics (content as well as learning outcomes) that students will study in the upcoming course. The test items included in the test are usually simple. Besides, diagnostic tests are also used to „diagnose‰ the learning difficulties encountered by students. When used for this purpose, the tests will cover specific content areas and learning outcomes and hope to unravel the causes of the learning problems so that remediation can be implemented. 3.2.2 Specifying the Intended Learning Outcomes The focus of instruction in a course of study is not mere acquisition of knowledge by students but more importantly on how they can use and apply the acquired knowledge in different and meaningful situations. The latter has been referred to as course learning outcomes (CLOs), which should cover the cognitive, affective and psychomotor domains as explained in Topic 2. In other words, the emphasis in instruction should be on the mastery of CLOs when teachers deliver the content covered in the topics of the course. The syllabus of a course should therefore present not only the relevant content areas in the form of topics but also indicate the CLOs to be achieved. A course of study might have a number of topics but only three to five CLOs. For instance, for an Educational Assessment course, there may be 10 topics to be covered with four CLOs, which are spread across the 10 topics as shown in Table 3.1. Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS 49 Table 3.1: Mapping of Course Learning Outcomes Across Topics CLO Topic 1 Explain the Different Principles and Theories of Educational Testing and Assessment (C2) Compare the Different Methods of Educational Testing and Assessment (C4) Develop Different Assessment Methods for Use in the Classroom (C3) Critically Evaluate the Suitability of Different Assessment Methods for Use in the Classroom (C6) x 2 x 3 x x x 4 x x x 5 x 6 x x 7 x x 8 x 9 x 10 x Note: The parentheses indicate the levels of complexity according to the BloomÊs taxonomy In line with the principle of constructive alignment, assessment of a course should also focus on the mastery of CLOs. CLOs are normally written in general terms. Under each topic, the learning outcome is more specific and is often referred to as an intended learning outcome (ILO). In assessing a topic of a course, it is imperative that its ILO is clearly specified. Table 3.2 states examples of CLO and its related ILO for a specific topic in the Educational Assessment course (i.e. Portfolio Assessment) Table 3.2: Example of Course Learning Outcome (CLO) and Intended Learning Outcome (ILO) CLO Critically evaluate the suitability of different assessment methods for use in the classroom (C6) ILO Critically evaluate the usefulness of portfolios as an assessment tool (C6) Copyright © Open University Malaysia (OUM) 50 TOPIC 3 PLANNING CLASSROOM TESTS A word of caution. Remember, not all ILOs can be assessed by tests. Tests are only appropriate in assessing cognitive learning outcomes. For example, of the following three intended learning outcomes (ILO), only ILO 1 can be assessed by a test using an essay question. On the other hand, ILO 2, which belongs to the pyschomotor domain is more appropriately assessed by practical work via teacher observation, while ILO 3 which belongs to the affective domain, may be assessed during the implementation of the class project via peer evaluation. ILO 1 Explain the differences among the cognitive, affective and psychomotor domains of learning outcomes. ILO 2 Demonstrate the proper technique of executing a table tennis top-spin in service. ILO 3 Work collaboratively with other students in the team to complete the class project. SELF-CHECK 3.1 1. What type of learning outcome in BloomÊs taxonomy can be assessed by tests? Why? 2. How is the intended learning outcome (ILO) different from course learning outcome (CLO)? 3.2.3 Selecting Best Item Types Once the intended learning outcomes (ILOs) for the topics to be assessed have been specified, the next step in planning a test is to select the best item types. Different item types have different purposes and are different in their usefulness. Table 3.3 shows two common item types used in a test ă multiple-choice and essay questions and their respective purposes and usefulness. Refer to Topics 4 and 5 for more details. Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS 51 Table 3.3: Item Types and Their Respective Purposes and Usefulness No. 1 2 Item Type Multiple-choice Essay Purpose and Usefuness Test for factual knowledge Assess a large number of items Score rapidly, accurately and objectively Require candidates to write an extended piece on a certain topic Assess higher-order thinking skills such as analysing, sythesising and evaluating in BloomÊs taxonomy It is thus imperative that the item types selected in assessment should be relevant to the ILO to be assessed. There must be a close match between the ILOs and the types of items to be used. For example, if the ILO is to develop the ability to organise ideas, the use of multiple-choice test would be a poor choice. The best item type would be an essay question. The following are two intended learning outcomes (refer to Table 3.4). Can you select the best item types to assess them? Table 3.4: Examples of Intended Learning Outcomes ILO 1 Discuss the usefulness of portfolios as an assessment tool in education. ILO 2 Define what a portfolio is. ILO 1 requires students to present a discussion. They need to thoroughly review, examine, debate or argue the pros and cons of a subject. To do this, they need to write an extended response. ILO 1 can only be assessed by an essay test. However, ILO 2 merely requires students to identify a definition. A multiple-choice question (MCQ) is good enough to perform the assessment task. Copyright © Open University Malaysia (OUM) 52 TOPIC 3 PLANNING CLASSROOM TESTS ACTIVITY 3.1 The following is a list of learning outcomes. Identify the best item type to assess each of them. No. Learning Outcome MCQ/Essay 1 Name the levels of BloomÊs taxonomy and identify the intellectual behaviour each refers to. 2 Devise a table of specification, complete with information on what to assess and how to assess. 3 Discuss the strengths and weaknesses of using essay questions as an assessment tool. 4 Defend the use of portfolios for classroom assessment. 5 Define norm-referenced and criterion-referenced assessments. 6 Explain the purposes of assessment in education. 7 Describe the process involved in planning a test. 8 Illustrate the use of item analysis in assessing the quality of a MCQ. 9 Differentiate between formative and summative assessments. 10 Develop appropriate scoring rubrics as marking schemes for essay questions. 11 State the advantages and disadvantages multiple-choice items as an assessment tool. 12 Examine the usefulness of project work as an assessment tool. of Copyright © Open University Malaysia (OUM) TOPIC 3 3.2.4 PLANNING CLASSROOM TESTS 53 Developing a Table of Specifications Making a test blueprint or table of specifications is the next important step that teachers should do. The table presents the topics of the course, the cognitive complexity levels of the test items according to BloomÊs taxonomy, the number of test items corresponding to the number of hours devoted to the topics and course learning outcomes in class. In fact, the decision of exactly how many test items to include in a test is based on the importance of the topics and learning outcomes as indicated by student learning time, the item types used, and also the amount of time available for testing. A table of specifications is a two-way table with the cognitive complexity levels across the top, and the topics and course learning outcomes to be covered by a test and hours of interaction down one side. The item numbers associated with each topic are presented under the complexity level as determined by the CLO. Table 3.5 presents an example of a table of specifications with MCQs as the item type. For ease of understanding, let us assume that the test will only cover the first three complexity levels of BloomÊs taxonomy, namely Knowledge (C1), Comprehension (C2) and Application (C3). Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS Table 3.5: Table of Specifications: MCQ Item Type 54 Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS 55 In this example, the vertical columns on the left of the two-way table show a list of the topics covered in class and the amount of time spent on those topics. The amount of time spent on the topics as shown in the column „Hours of Interaction‰ is used as a basis to compute the weightage or percentage (% hours) and the marks allocated. For a test with MCQs, the marks allocated also indicate the number of test items for each topic. In this hypothetical case, the teacher has spent 20 hours teaching the three topics of which 4 hours are alloted to Topic 1. Thus, 4 hours from a total of 20 hours amount to 20 per cent or six items from the total of 30 items as planned by the teacher. Likewise, the weightage or percentage and the marks alloted for Topics 2 and 3 are computed in the same manner. The weightage and number of items for Topic 2 are 30 per cent and nine items respectively. For Topic 3, they are 50 per cent and 15 items respectively. Based on the cognitive complexity level of the CLO for each topic, the teacher will decide on the number of items to be included under each level. This information is presented in the cells of the column on Item No. For example, the cognitive complexity level of the CLO1 for Topic 1 is C2, the teacher has decided to have two items at C1 (i.e. items 1 and 2) and four items at C2 (i.e. items 10, 11, 12 and 13). Of course, he or she can decide to have all the six items framed at C2, but not at C3. For Topic 2, the number of items required is nine at C2. Again, the teacher has decided to have some items at C1 (i.e. four items) and the rest at C2 (i.e. five items) to make up the required number of items. Topic 3 seems to be the most important topic and it requires 15 items, i.e. half of the total in the test, and the teacher has decided to have three items at C1, six items at C2 and another six items at C3. Overall, of the total 30 items in the test, 30 per cent of them are at C1, 50 per cent at C2 and 20 per cent at C3. The teacher, of course, might have a reason for such a distribution. Perhaps, he or she feels that this is the beginning of the course, and he or she wants to focus on the understanding of the key concepts of the course. Whatever it is, the decision is the prerogative of the teacher who knows best on what and how he or she wants to assess the students. Table 3.6 is another example of a table of specifications. The table focuses on essay items. Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS Table 3.6: Table of Specifications: Essay Questions 56 Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS 57 The first vertical column on the left presents the five topics identified for assessment, followed by hours of interaction for each topic in the second column. Based on this formation, the teacher can work out the weightage in terms of % hours and marks for each topic. In this hypothetical case, the teacher has spent 50 hours teaching the five topics of which 5 hours each are alloted to Topics 1 and 2. Thus, 5 hours from a total of 50 hours amount to 10 per cent or 10 marks from the total of 100 total marks as planned by the teacher. Likewise, the weightage or percentage and the marks alloted for Topic 2, Topic 3, Topic 4 and Topic 5 are computed in the same manner. The weightage and marks alloted for Topic 2, Topic 3, Topic 4 and Topic 5 are 10, 20, 20 and 40 respectively. Based on the marks alloted and the cognitive complexity levels of the CLOs, the teacher then decides how he or she is going to distribute the marks according to the levels of complexity. For example, for Topic 1, he or she can have one essay item carrying 10 marks at C3 or two essay items, one at C2 and the other at C3 and each carries 5 marks. In this hypothetical case, the teacher has decided to have two items for Topic 1 and another two for Topic 2, each carrying 5 marks. The four items make up the Section A of the test. For Topic 3, he or she has decided to distribute the 20 marks between two items, each carrying 10 marks. He or she has done the same for Topic 4. This makes up the Section B. Section C is alloted to Topic 5 with two items, each carrying 20 marks. This is just an example of how the marks for each topic are distributed and the number of items decided. There can be many other variations, of course. So far, we have looked at the table of specifications in the form of a two-way table. A table of specifications can be in the form of a three-way table with item types as an additional level. Whatever the format, the table of specifications is a very useful piece of document in assessment. This kind of table ensures that a fair and representative sample of items or questions appear in the test. Teachers cannot measure every piece of content in the syllabus and cannot ask every question they might wish to ask. A table of specifications allows the teacher to construct a test which focuses on the key contents as defined by the weights in percentages given to them. A table of specifications provides the teacher with evidence that a test has content validity, that it covers what should be covered. This table also allows teacher to view the test as a whole. The teacher, especially a newly trained one, is advised to have this table of specifications together with the subject syllabus reviewed by the subject expert or the subject head of department whether the test plan has included what it is supposed to measure. In other words, it is important that the table of specifications must have content validity. To ensure this, the students should ideally not be given choices in a test. Without choices, all students are thus assessed equally. Copyright © Open University Malaysia (OUM) 58 TOPIC 3 PLANNING CLASSROOM TESTS SELF-CHECK 3.2 What is a table of specifications? ACTIVITY 3.2 1. Have you used a table of specifications? 2. Identify a course of your choice, prepare a table of specifications for a test. Share your answers with your coursemates in the myINSPIRE online forum. 3.2.5 Constructing Test Items Once a valid table of specifications has been prepared, the next step is constructing the test items. While the different item types such as multiple choice, short answer, true-false, matching and essay items are constructed differently, the following principles apply to constructing test items in general. (a) Make the instructions for each type of item simple and brief; (b) Use simple and clear language in the questions; (c) Write items that are appropriate for the learning outcomes to be measured; (d) Do not provide clue or suggest the answer to one question in the body of another question; (e) Avoid writing questions in the negative. If you must use negatives, highlight them, as they may mislead students into answering incorrectly; (f) Specify the precision of answers; (g) Try as far as possible to write your own questions. Check to make sure the questions fit the learning objectives and requirements in the table of specifications if you need to use questions from other sources; and (h) If the item was revised, recheck its relevance. Copyright © Open University Malaysia (OUM) TOPIC 3 PLANNING CLASSROOM TESTS 59 In writing test items, you must also consider the length of the test as well as the reading level of your students. You do not want students to feel rushed and frustrated because they are not able to demonstrate their knowledge of the material in the allotted time. Some general guidelines regarding time requirements for secondary school student test takers are shown in Table 3.7. Table 3.7: Allotment of Time for Each Type of Question Task Approximate Time per Item True-false items 20ă30 seconds Multiple choice (factual) 40ă60 seconds Multiple choice (complex) 70ă90 seconds Matching (five stems/six choices) 2ă4 minutes Short answer 2ă4 minutes Multiple choice (with calculations) 2ă5 minutes Word problems (simple math) 5ă10 minutes Short essays 15ă20 minutes Data analysis/graphing 15ă25 minutes Extended essays 35ă50 minutes If you are combining multiple choice and essay items, these estimates may help you decide how many of each type of items to include. One mistake often made by many educators is having too many questions for the time allowed. Once your items are developed, make sure that you include clear directions to the students. For the objective items, specify that they should select one answer for each item and indicate the point value of each question, especially if you are weighting sections of the test differently. For essay items, indicate the point value and suggested time to be spent on the item (we will discuss essay questions in more detail in Topic 5). If you are teaching a large class with close seating arrangements and are giving an objective test, you may want to consider administering several versions of your test to decrease the opportunities for cheating. You can create versions of your test with different arrangements of the items. More detailed guidelines to prepare and write multiple choice, short answer, truefalse, matching, essay, portfolios and projects will be discussed in Topics 4, 5, 6 and 7 respectively. Copyright © Open University Malaysia (OUM) 60 TOPIC 3 PLANNING CLASSROOM TESTS ACTIVITY 3.3 To what extent do you agree with the allotment of time for each type of question shown in Table 3.7? Justify your answer by discussing it with your coursemates in the myINSPIRE online forum. 3.2.6 Preparing Marking Schemes Preparing a marking scheme well in advance of testing date will give teachers ample time to review their questions and make changes to answers when necessary. The teacher should make it a habit to write a model answer which can be easily understood by others. This model answer can be used by other teachers who act as external examiners, if need be. For objective test items, the model answers are simple. The marking scheme is just a list of answers with the marks alloted for each. However, for essay items, the marking schemes can be a bit complicated and require special skills and knowledge to prepare. The marking schemes may take the form of a checklist, a rubric or a combination of both. Refer to Topic 5 for a detailed explanation on the marking scheme. Coordination on the use of marking schemes should be done once the test answer scripts are collected. Teachers should try to read answers from some scripts and review the correct answers in the marking scheme. Teachers may sometimes find that students have interpreted the test question in a way that is different from what is intended. Students may come up with excellent answers that may fall slightly outside what has been asked. Consider giving these students marks accordingly. Likewise, teachers should make a note in the marking scheme for any error made earlier but carried through the answer; marks should be deducted if the rest of the response is sound. SELF-CHECK 3.3 Why is it neccessary for a test to be accompanied by a marking scheme? Copyright © Open University Malaysia (OUM) TOPIC 3 3.3 PLANNING CLASSROOM TESTS 61 ASSESSING TEACHER’S OWN TEST Regardless of the kind of tests teachers use, they can assess their effectiveness by asking the following questions: (a) Did I Test for What I Thought I Was Testing for? If you wanted to know whether students could apply a concept to a new situation, but mostly asked questions determining whether they could label parts or define terms, then you tested for recall rather than application. (b) Did I Test What I Taught? For example, your questions may have tested the studentsÊ understanding of surface features or procedures, while you had been lecturing on causation or relation ă not so much what the names of the bones of the foot are, but how they work together when we walk. (c) Did I Test for What I Emphasised in Class? Make sure that you have asked most of the questions about the material you feel is the most important, especially if you have emphasised it in class. Avoid questions on obscure material that are weighted the same as questions on crucial material. (d) Is the Material I Tested for Really What I Wanted Students to Learn? For example, if you wanted students to use analytical skills such as the ability to recognise patterns or draw inferences, but only used true-false questions requiring non-inferential recall, you might try writing more complex truefalse or MCQs. Students should know what is expected of them. They should be able to identify the characteristics of a satisfactory answer and understand the relative importance of those characteristics. This can be achieved in many ways: you can provide feedback on tests, describe your expectations in class, or post model solutions on a class blog. Teachers are encouraged to make notes on the scripts. When exams are returned to the students, the notes will help them understand their mistakes and correct them. SELF-CHECK 3.4 Describe the steps involved in planning a test. Copyright © Open University Malaysia (OUM) 62 TOPIC 3 PLANNING CLASSROOM TESTS The first step in test planning is to decide on the purpose of the test. Tests can be used for many different purposes. The next step is to consider the learning outcomes and their complexity levels as defined by BloomÊs taxonomy. The teachers will have to select the appropriate knowledge and skills to be assessed and include more questions about the most important learning outcomes. The learning outcomes that the teachers want to emphasise will determine not only what material to include on the test, but also the specific form the test will take. Making a test blueprint or table of specifications is the next important step that teachers should do. The table describes the topics, the behaviour of the students, the number of questions on the test corresponding to the number of hours devoted to the topics in class. The table of specifications helps to ensure that there is a match between what is taught and what is tested. Classroom assessment is driven by classroom teaching which itself is driven by learning outcomes. The test format used is one of the main driving factors in the studentsÊ learning behaviour. Checklist Intended learning outcome (ILO) Complexity levels of BloomÊs taxonomy Marking schemes Course learning outcome (CLO) Rubrics Table of specifications Hours of interaction Copyright © Open University Malaysia (OUM) Topic How to Assess? 4 ă Objective Tests LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Define an objective test and list the different types of objective tests; 2. Construct short-answer questions; 3. Construct multiple-choice questions; 4. Develop true-false questions; and 5. Prepare matching questions. INTRODUCTION In Topic 2, we discussed the need to assess students holistically based on cognitive, affective and psychomotor learning outcomes, and in Topic 3, we looked at the steps involved in planning a class test. In this topic, we will focus on using objective tests in assessing various kinds of behaviour in the classroom. Four types of objective tests are examined and the guidelines for the construction of each type of test are discussed. The advantages and limitations of these types of objective tests are explained too. Copyright © Open University Malaysia (OUM) 64 TOPIC 4 4.1 HOW TO ASSESS? – OBJECTIVE TESTS WHAT IS AN OBJECTIVE TEST? When objective tests were first used in 1845 by George Fisher in the United States, they were not well-received by the society. However, over the years, they have gained acceptance and are now widely used in schools, industries, businesses, professional organisations, universities and colleges. In fact, they have become the most popular format of assessing various types of human abilities, competencies and socio-emotional attributes. What is an objective test? An objective test is a written test consisting of items or questions which require the respondent to answer by supplying a word, phrase or symbol or by selecting from a list of possible answers. The former is referred to as supply-type items while the latter is referred to as selection-type items. The common supply-type items are short-answer questions and the selection-type items are multiple-choice questions, true-false questions and matching questions. The word objective means „accurate‰. An objective item or question is „accurate‰ because there is only one correct answer and the marking cannot be influenced by the personal preferences and prejudices of the marker. In other words, it is not „subjective‰ and not open to varying interpretations. This is one of the reasons why the objective test is popular in measuring human abilities, competencies and many other psychological attributes such as personality, interest and attitude. Figure 4.1 describes how objective tests were used in Malaysian schools since their inception and also how they are used today. Figure 4.1: Objective tests in Malaysian schools Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 65 Objective tests vary depending on how the questions are presented. The four common types of questions used in most objective tests are multiple-choice questions, true-false questions, matching questions and short-answer questions (refer to Figure 4.2). Figure 4.2: Common formats of objective tests 4.2 MULTIPLE-CHOICE QUESTIONS (MCQS) Let us take a look at one of the most popular objective tests which is the multiplechoice question. 4.2.1 What is a Multiple-choice Question? Multiple-choice questions or MCQs are widely used in many different settings because they can be used to measure low-level cognitive outcomes as well as more complex cognitive outcomes. It is challenging to write test items to tap into higherorder thinking. All the demands of good item writing can only be met when test writers have been well-trained. Above all, test writers need to have expertise in the subject area being tested so they can gauge the difficulty and content coverage of the test items. Copyright © Open University Malaysia (OUM) 66 TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS Multiple-choice questions are the most difficult to prepare. These questions have two parts: (a) A stem that contains the question; and (b) Four or five options with one containing the correct answer called the key response. Three-option multiple-choice questions are also gaining acceptance. The other incorrect options are called distractors. The stem may be presented as a question or a statement, while the options can be a word, phrase, numbers, symbols and so forth. The role of the distractors is to attract the attention of respondents who are not sure of the correct answer. A traditional multiple-choice question (or item) is one in which a student chooses one answer from a number of choices supplied (as illustrated in Figure 4.3). Figure 4.3: Multiple-choice question (a) The stem should: (i) Be in the form of a question or statement to be completed; (ii) Be expressed clearly and concisely, avoiding poor grammar, complex syntax, ambiguity and double negatives; (iii) Generally present a positive question (if a negative is used, it should be emphasised with italics or underlining); (iv) Generally ask for one answer only (the correct or the best answer); and (v) Include as many of the words common to all alternatives as possible. Copyright © Open University Malaysia (OUM) TOPIC 4 (b) HOW TO ASSESS? – OBJECTIVE TESTS 67 The options or alternatives should: (i) Ensure that each item has either three, four or five alternatives, all of which should be mutually exclusive and not too long; (ii) All follow grammatically from the stem and be parallel in grammatical form; (iii) Be unambiguous and expressed simply enough to make clear the essential differences between them; and (iv) Ensure that the intended answer or key be clearly correct to the informed, while the distractors should be definitely incorrect, but plausible. SELF-CHECK 3.1 1. What is an objective test? 2. Why is multiple-choice questions (MCQs) test a popular form of objective test? 4.2.2 Construction of Multiple-choice Questions Did you know that MCQs test writing is a profession? By that, we mean that good test writers are professionally trained in designing test items. Test writers have knowledge of the rules of constructing items, but at the same time they have the creativity to construct items that capture studentsÊ attention. Test items need to be succinct but clear in meaning. McKenna and Bull (1999) offered some guidelines for constructing stems for multiple-choice questions. All the options in multiple-choice items need to be plausible, but they also need to separate students of different ability levels. Let us take a look at these guidelines. (a) When writing stems, present a single, definite statement to be completed or answered by one of the several given choices (see Example 4.1). Copyright © Open University Malaysia (OUM) 68 TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS Example 4.1: Weak Question Improved Question World War II was: A. The result of the failure of the League of Nations In which of these time periods was World War II fought? A. 1914ă1917 B. Horrible B. 1929ă1934 C. Fought in Europe, Asia and Africa C. 1939ă1945 D. Fought during the period of 1939ă 1945 D. 1951ă1955 Note: In the weak question, there is no clue from the stem to what the question is asking. The improved version identifies more clearly the question and offers the student a set of homogeneous choices. (b) When writing stems, avoid unnecessary and irrelevant material (see Example 4.2). Example 4.2: Weak Question Improved Question For almost a century, the Rhine river has been used by Europeans for a variety of purposes. However, in recent years, the increased river traffic has resulted in increased levels of diesel pollution in the waterway. Which of the following would be the most dramatic result if, because of the pollution, the Council of Ministers of the European Community decided to close the Rhine to all shipping? Which of the following would be the most dramatic result if, because of diesel pollution from ships, the river Rhine was closed to all shipping? A. Increased prices for Ruhr products B. Shortage of water for Italian industries C. Reduced competitiveness of the French Aerospace Industry D. Closure of the busy river Rhine ports of Rotterdam, Marseilles and Genoa A. Increased prices for Ruhr products B. Shortage of water for Italian industries C. Reduced competitiveness of the French Aerospace Industry D. Closure of the busy river Rhine ports of Rotterdam, Marseilles and Genoa Note: The weak question is too wordy and contains unnecessary material. Copyright © Open University Malaysia (OUM) TOPIC 4 (c) HOW TO ASSESS? – OBJECTIVE TESTS 69 When writing stems, use clear, straightforward language. Questions that are constructed using complex wording may become a test of reading comprehension rather than an assessment of studentÊs performance with regard to a specific learning outcome (see Example 4.3). Example 4.3: Weak Question As the level of fertility approaches its nadir, what is the most likely ramification for the citizenry of a developing nation? A. A decrease in the labour force participation rate of women B. A downward trend in the youth dependency ratio C. A broader base in the population pyramid D. An increased infant mortality rate Improved Question A major decline in fertility in a developing nation is likely to produce A. A decrease in the labour force participation rate of women B. A downward trend in the youth dependency ratio C. A broader base in the population pyramid D. An increased infant mortality rate Note: In the improved question, the word „nadir‰ is replaced with „decline‰ and „ramifications‰ is replaced with „produce‰. These are more straightforward words. (d) When writing stems, use negatives sparingly. If negatives must be used, capitalise, underscore or bold (see Example 4.4). Example 4.4: Weak Question Improved Question Which of the following is not a symptom of osteoporosis? Which of the following is a symptom of osteoporosis? A. Decreased bone density A. Decreased bone density B. Frequent bone fractures B. Raised body temperature C. Raised body temperature C. Painful joints D. Lower back pain D. Hair loss Note: The improved question is stated in the positive so as to avoid the use of the negative „not‰. Copyright © Open University Malaysia (OUM) 70 TOPIC 4 (e) When writing stems, put as much of the question in the stem as possible, rather than duplicating material in each of the options (see Example 4.5). HOW TO ASSESS? – OBJECTIVE TESTS Example 4.5: Weak Question Improved Question Theorists of pluralism have asserted which of the following? A. The maintenance of democracy requires a large middle class B. The maintenance of democracy requires autonomous centres of countervailing power C. The maintenance of democracy requires the existence of a multiplicity of religious groups D. Theorists of pluralism have asserted that the maintenance of democracy requires A. A large middle class B. The separation of governmental powers C. Autonomous centres of countervailing power D. The existence of a multiplicity of religious groups The maintenance of democracy requires the separation of governmental powers Note: In the improved question, the phrase „maintenance of democracy‰ is included in the stem so as not to duplicate it in each option. (f) When writing stems, avoid giving away the answer because of grammatical cues (see Example 4.6). Example 4.6: Weak Question Improved Question A fertile area in the desert in which the water table reaches the ground surface is called an A fertile area in the desert in which the water table reaches the ground surface is called a/an A. Mirage A. Lake B. Oasis B. Mirage C. Lake C. Oasis D. Polder D. Polder Note: The weak question uses the article „an‰ which identifies choice B as the correct response. Ending the stem with „a/an‰ improves the question. Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 71 (g) When writing stems, avoid asking an opinion as much as possible. (h) Avoid using the words „always‰ and „never‰ in the stem as test-wise students are likely to rule such universal statements out of consideration. (i) When writing distractors for single response MCQs, make sure that there is only one correct response (see Example 4.7). Example 4.7: Weak Question Improved Question What is the main source of pollution of Malaysian rivers? What is the main source of pollution of Malaysian rivers? A. Land clearing A. Open burning B. Open burning B. Coastal erosion C. Solid waste dumping C. Solid waste dumping D. Coastal erosion D. Carbon dioxide emission Note: In the weak question, both options A and C can be considered correct. (j) When writing distractors, use only plausible and attractive alternatives (see Example 4.8). Example 4.8: Weak Question Improved Question Who was the third Prime Minister of Malaysia? Who was the third Prime Minister of Malaysia? A. Hussein Onn A. Hussein Onn B. Ghafar Baba B. Abdullah Badawi C. Mahathir Mohamad C. Mahathir Mohamad D. Musa Hitam D. Abdul Razak Hussein Note: In the weak question, B and D are not serious distractors. Copyright © Open University Malaysia (OUM) 72 TOPIC 4 (k) When writing distractors, if possible, avoid the choices „All of the above‰ and „None of the above‰. If you do include them, make sure that they appear as correct answers some of the time. HOW TO ASSESS? – OBJECTIVE TESTS It is tempting to resort to these alternatives but their use can be flawed. To begin with, they often appear as an alternative that is not the correct response. If you do use them, be sure that they constitute the correct answer part of the time. An „All of the above‰ alternative can be exploited by a testwise student who will recognise it as the correct choice by identifying only two correct alternatives. Similarly, a student who can identify one wrong alternative can then rule this response out. Clearly, the studentÊs chance of guessing the correct answer improves as he or she employs these techniques. Although a similar process of elimination is not possible with „None of the above‰, it is the case that when this option is used as the correct answer, the question is only testing the studentÊs ability to rule out the wrong answers and this does not guarantee that the student knows the correct one (Gronlund, 1988). (l) Distractors based on common student errors or misconceptions are very effective. One technique for compiling distractors is to ask students to respond to open-ended short answer questions, perhaps as formative assessments. Identify which incorrect responses appear most frequently and use them as distractors for a multiple-choice version of the question. (m) Do not create distractors that are so close to the correct answer that they may confuse students who really know the answer to the question. Distractors should differ from the key in a substantial way, not just in some minor nuance of phrasing or emphasis. (n) Provide a sufficient number of distractors. You will probably choose to use three, four or five alternatives in a multiplechoice question. Until recently, it was thought that three or four distractors were necessary for the item to be suitably difficult. However, a study by Owen and Freeman suggests that three choices are sufficient (Owen & Freeman, 1987). Clearly, the higher the number of distractors, the less likely it is for the correct answer to be chosen through guessing provided that all alternatives are of equal difficulty. Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 73 ACTIVITY 4.1 1. Do you agree that you should not use negatives in the stems of MCQs? Why? 2. Do you agree that you should not use „All of the above‰ and „None of the above‰ as distractors in MCQs? Why? 3. Select 10 multiple-choice questions in your subject area and analyse the distractors of each item using the guidelines mentioned earlier. 4. Suggest how you would improve weak distractors. Share your answers with your coursemates in the myINSPIRE online forum. 4.2.3 Advantages of Multiple-choice Questions Multiple-choice questions are widely used to measure knowledge outcomes and various types of learning outcomes. They are popular because of the following reasons: (a) Learning outcomes from simple to complex can be measured; (b) There are highly structured and clear tasks are provided; (c) A broad sample of achievement can be measured; (d) Incorrect alternatives or options provide diagnostic information; (e) Scores are less influenced by guessing than true-false items; (f) Scores are more reliable than subjectively scored items (such as essays); (g) Scoring is easy, objective and reliable; (h) Item analysis can reveal how difficult each item was and how well it discriminated between the stronger and weaker students in the class; (i) Performance can be compared from class to class and year to year; Copyright © Open University Malaysia (OUM) 74 TOPIC 4 (j) Can cover a lot of material very efficiently (about one item per minute of testing time); and (k) Items can be written so that students must discriminate among options that vary in degree of correctness. 4.2.4 HOW TO ASSESS? – OBJECTIVE TESTS Limitations of Multiple-choice Questions While there are many advantages of using multiple-choice questions, there are also many limitations in using such items. These limitations are: (a) Constructing good items is time-consuming; (b) It is frequently difficult to find plausible distractors; (c) MCQs are not as effective for measuring some types of problem-solving skills and the ability to organise and express ideas; (d) Scores can be influenced by studentsÊ reading ability; (e) There is a lack of feedback on individual thought processes ă it is difficult to determine why individual students selected incorrect responses; (f) Students can sometimes read more into the question than was intended; (g) It often focuses on testing factual information and fails to test higher levels of cognitive thinking; (h) Sometimes, there is more than one defensible „correct‰ answer; (i) They place a high degree of independence on the studentÊs reading ability and the constructorÊs writing ability; (j) Does not provide a measure of writing ability; and (k) May encourage guessing. Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 75 Last but not least, let us look at Figure 4.4 which highlights some procedural rules when constructing multiple-choice questions. Figure 4.4: Procedural rules for the construction of multiple-choice questions SELF-CHECK 4.2 1. What are some advantages of using multiple-choice questions? 2. List some limitations or weaknesses of multiple-choice questions. 4.3 TRUE-FALSE QUESTIONS The next type of objective test is the true-false question. Here, we will discuss the rationale for its use as well as its limitations. Copyright © Open University Malaysia (OUM) 76 TOPIC 4 4.3.1 HOW TO ASSESS? – OBJECTIVE TESTS What are True-False Questions? In the most basic format, true-false questions are those in which a statement is presented and the student indicates in some manner whether the statement is true or false. In other words, there are only two possible responses for each item and the student chooses between them. A true-false question is a specialised form of the multiple-choice format in which there are only two possible alternatives. These questions can be used when the test designer wishes to measure a studentÊs ability to identify whether statements of fact are accurate or not. The true-false questions can be used for testing knowledge and judgement in many subjects. When grouped together, a series of true-false questions on a specific topic or scenario can test a more complex understanding of an issue. They can be structured to lead a student through a logical pathway and can reveal part of the thinking process employed by the student in order to solve a given problem. Let us see Example 4.9. Example 4.9: True False A whale is a mammal because it gives birth to its young. 4.3.2 Advantages of True-False Questions True-false questions can be quickly written and can cover a lot of content. Truefalse questions are well-suited for testing studentsÊ recall or comprehension. Students can generally respond to many questions covering a lot of content in a fairly short amount of time. From the teacherÊs perspective, these questions can be written quickly and are easy to score. Since they can be objectively scored, the scores are more reliable than for items that are at least partially dependent on the teacherÊs judgement. Generally, they are easier to construct compared to multiplechoice questions because there is no need to develop distractors. Hence, they are less time-consuming compared to constructing multiple-choice questions. Copyright © Open University Malaysia (OUM) TOPIC 4 4.3.3 HOW TO ASSESS? – OBJECTIVE TESTS 77 Limitations of True-False Questions However, true-false questions have a number of limitations, notably: (a) Guessing A student has a one in two chances of guessing the correct answer of a question. Scores on true-false items tend to be high because of the ease of guessing correct answers when the answers are not known. With only two choices (true or false), the student can expect to guess correctly on half of the items for which correct answers are not known. Thus, if a student knows the correct answers to 10 questions out of 20 and guesses on the other 10, the student can expect a score of 15. The teacher can anticipate scores ranging from approximately 50 per cent for a student who did nothing but guess on all items to 100 per cent for a student who knew the material. (b) Tendency to Use the Original Text Since these items are in the form of statements, there is sometimes a tendency to take quotations from the text, expecting the student to recognise a correct quotation or note a change (sometimes minor) in wording. There may also be a tendency to include trivial or inconsequential material from the text. Both of these practices are discouraged. (c) Difficult to Set It can be difficult to write a statement which is unambiguously true or false, particularly for complex material. (d) Unable to Discriminate Different Abilities The format does not discriminate among students of different abilities as well as other question types. Copyright © Open University Malaysia (OUM) 78 TOPIC 4 4.3.4 HOW TO ASSESS? – OBJECTIVE TESTS Suggestions for Constructing True-False Questions Here are some suggestions for constructing true-false questions: (a) Include only one main idea in each item (see Example 4.10). Example 4.10: Poor Item Better Item The study of biology helps us understand living organisms and predict the weather. (b) The study of biology helps understand living organism. us As in multiple-choice questions, use negatives sparingly. Avoid also double negatives as they tend to contribute to the ambiguity of the statement. Statement words like none, no and not should be avoided as far as possible (see Example 4.11). Example 4.11: (c) Poor Item Better Item None of the steps in the experiment were unnecessary. All the steps in the experiment were necessary. Avoid broad, general statements. Most of these statements are false unless qualified (see Example 4.12). Example 4.12: Poor Item Better Item Short-answer questions are more favourable than essay questions in testing. Short-answer questions are more favourable than essay questions in testing factual information. Copyright © Open University Malaysia (OUM) TOPIC 4 (d) HOW TO ASSESS? – OBJECTIVE TESTS 79 Avoid long complex sentences. Such sentences also test reading comprehension besides the achievement to be measured (see Example 4.13). Example 4.13: Poor Item Better Item Despite the theoretical and experimental difficulties of determining the exact pH value of a solution, it is possible to determine whether a solution is acidic by the red colour formed on litmus paper when inserted into the solution. Litmus paper turns red in an acidic solution. (e) Try using true-false questions in combination with other materials, such as graphs, maps and written material. This combination allows for the testing of more advanced learning. (f) Avoid lifting statements directly from assigned readings, notes or other course materials so that recall alone will not lead to a correct answer. (g) In general, avoid the use of words which would signal the correct response to the test-wise student. Absolutes such as „none‰, „never‰, „always‰, „all‰ and „impossible‰ tend to be false, while qualifiers such as „usually‰, „generally‰, „sometimes‰ and „often‰ are likely to be true. (h) A similar situation occurs with the use of „can‰ in a true-false statement. If the student knows of a single case in which something „can‰ be done, it would be true. (i) Ambiguous or vague statements and terms, such as „largely‰, „long time‰, „regularly‰, „some‰ and „usually‰ are best avoided in the interest of clarity. Some terms have more than one meaning and may be interpreted differently by individuals. (j) True statements should be about the same length as false statements. There is a tendency to add details in true statements to make them more precise. (k) Word the statement so precisely that it can be judged unmistakably as either true or false. (l) Statements of opinion should be attributed to some source. (m) Keep statements short and use simple language structure. Copyright © Open University Malaysia (OUM) 80 TOPIC 4 (n) Avoid verbal clues (specific determiners) that can indicate the answer. (o) Test important ideas rather than trivia. (p) Do not present items in an easily learned pattern. HOW TO ASSESS? – OBJECTIVE TESTS MATCHING QUESTIONS 4.4 Matching questions are used in measuring a studentÊs ability to identify the relationship between two lists of terms, phrases, statements, definitions, dates, events, people and so forth. In addition, one matching question can replace several true-false questions. 4.4.1 Construction of Matching Questions In developing matching questions, you have to identify two columns of material listed vertically. The items in Column A (or I) are usually called premises and assigned numbers (1, 2, 3 and so on) while items in Column B (or II) are called responses and designated capital letters (A, B, C and so on). The student reads a premise (Column A) and finds the correct response from among those in Column B. The student then prints the letter of the correct response in the blank besides the premise in Column A. An alternative is to have the student draw a line from the correct response to the premise, but this is more time consuming to score. One way to reduce the possibility of guessing correct answers is to list a larger number of responses (Column B) than premises (Column A), as shown in Example 4.14: Example 4.14: Directions: Column A contains statements describing selected Asian cities. For each description, find the appropriate city in Column B. Each city in Column B can be used only once. Column A 1. 2. 3. 4. 5. Column B Ancient capital of Thailand: ___________ Largest city in Sumatera: ______________ Capital of Myanmar: _________________ Formerly known as Saigon: ___________ Former capital of Pakistan: ____________ A. B. C. D. E. F. G. Ayutthaya Ho Chi Minh City Karachi Medan Yangon Hanoi Surabaya Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 81 Another way to decrease the possibility of guessing is to allow responses to be used more than once. Directions to the students should be very clear about the use of responses. Some psychometricians suggest that no more than five to eight premises (Column A) in one set are given. For each premise, the student has to read through the entire list of responses (or those still unused) to find the matching response. For this reason, the shorter elements should be in Column B, rather than Column A, to minimise the amount of reading needed for each item. Responses (Column B) should be listed in logical order if there is one (chronological, by size and so on). If there is no apparent order, the responses should be listed alphabetically. Premises (Column A) should not be listed in the same order as the responses. Care must be taken to ensure that the association keyed as the correct response is unquestionably correct and that the numbered item could not be rightly associated with any other choice. 4.4.2 Advantages of Matching Questions Like other types of assessments, there are advantages and disadvantages to matching questions as well. Let us go through the advantages first. (a) Matching questions are particularly good at assessing a studentÊs understanding of relationships. They can test recall by requiring a student to match the following elements (McBeath, 1992): (i) Definitions ă Terms; (ii) Historical events ă Dates; (iii) Achievements ă People; (iv) Statements ă Postulates; and (v) (b) Descriptions ă Principles. They can also assess a studentÊs ability to apply knowledge by requiring a test-taker to match the following: (i) Examples ă Terms; (ii) Functions ă Parts; (iii) Classifications ă Structures; Copyright © Open University Malaysia (OUM) 82 TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS (iv) Applications ă Postulates; and (v) Problems ă Principles. (c) Matching questions format is really a variation of the multiple-choice format. If you find that you are writing MCQs which share the same answer choices, you may consider grouping the questions into a matching item. (d) Matching questions are generally easy to write and score when the content is tested and objectives are suitable for matching questions. (e) Matching questions are highly efficient as a large amount of knowledge can be sampled in a short amount of time. 4.4.3 Limitations of Matching Questions There are also limitations when using this type of assessment, such as: (a) Matching questions are limited to material that can be listed in two columns and there may not be much material that lends itself to such a format; (b) If there are four items in a matching question and the student knows the answer for three of them, the fourth item is a giveaway through elimination; (c) Difficult to differentiate between effective and ineffective items; (d) Often leads to testing of trivial facts or bits of information; and (e) Often criticised for encouraging rote memorisation. 4.4.4 Suggestions for Constructing Good Matching Questions When assessing students, we must prepare quality questions. Here are some suggestions for constructing good matching questions: (a) Provide clear directions. They should explain how many times responses can be used; (b) Keep the information in each column as homogeneous as possible; (c) Include more responses than premises or allow the responses to be used more than once; (d) Put the items with more words in Column A; Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 83 (e) Correct answers should not be obvious to those who do not know the content being taught; (f) There should not be keywords appearing in both the premise and response, providing clues to the correct answer; and (g) All of the responses and premises for a matching item should appear on the same page. SELF-CHECK 4.3 1. What are some advantages of matching questions? 2. List some limitations of matching questions. 4.5 SHORT-ANSWER QUESTIONS A short-answer question is basically a supply-type item. It exists in two formats, namely direct question and completion question formats. The following are examples of short-answer questions (refer to Table 4.1): Table 4.1: Direct Question versus Completion Question Direct Question Completion Question Who was the first Prime Minister of Malaysia? (Answer: Tunku Abdul Rahman) The first Prime Minister of Malaysia was ___________. (Answer: Tunku Abdul Rahman) What is the value of x in the equation 2x + 5 = 9? (Answer: 2) In the equation 2x + 5 = 9, x = __________________. (Answer: 2) You may refer to Nitko (2004) for more examples. Copyright © Open University Malaysia (OUM) 84 TOPIC 4 4.5.1 HOW TO ASSESS? – OBJECTIVE TESTS Strengths and Weaknesses of Short-answer Questions The short-answer questions are generally used to measure simple learning outcomes. It is used almost exclusively to measure memorised information (except for learning outcomes on problem-solving in Mathematics and Science). This has partly made the short-answer question one of the easiest to construct. Another strength of the short-answer questions is that the possibility of guessing which often occurs in the selection-type item can be reduced. In this case, learners must supply the correct answer when they respond to the question. They must either recall the information asked for or make the necessary computations to obtain the answer. They cannot rely on their partial knowledge to choose the correct answer from the list of alternatives. Many short-answer questions can be set for a specific period of time. A test paper of short-answer questions is thus able to cover a fairly wide coverage of content of the course to be assessed. This enhances the content validity of the test. One major weakness of the short-answer questions is that it cannot be used to measure complex learning outcomes such as organising ideas, presenting an argument or evaluating information. What is required of learners is simply providing a word, phrase or symbol. Scoring of answers to the short-answer questions can also pose a problem. Unless the question is carefully phrased, learners can provide answers of varying degree of correctness. For example, the answer to a question such as „When was Malaysia formed?‰ could either be „In 1963‰ or „On 16 September 1963‰. The teacher has to decide whether learners who give the partial answer have the same level of knowledge as those who provide the complete answer. Besides, learnersÊ answers can also be contaminated by spelling errors. If spelling is taken into consideration, the test scores of learners will reflect their level of knowledge of the content assessed as well as their spelling ability. If spelling is not considered in the scoring, the teacher has to decide whether the misspelled word actually represents the correct answer. Copyright © Open University Malaysia (OUM) TOPIC 4 4.5.2 HOW TO ASSESS? – OBJECTIVE TESTS 85 Guidelines on Constructing Short-answer Questions Although the construction of short-answer questions is comparatively easier than other types of objective items, they have a variety of defects which should be avoided to ensure that they will function as intended. The following are some guidelines for the construction of short-answer questions. (a) Word the question so that the intended answer is brief and specific. As far as possible, the question should be phrased in such a way that only one answer is correct (see Example 4.15). Example 4.15: (b) Poor Item Better Item An animal that eats the flesh of other animals is ____________. An animal that eats the flesh of other animals is classified as ______________. (Possible answers: a wolf, a lion, hungry,...etc) (One specific answer: carnivorous) Use direct questions instead of incomplete statements. The meaning of the items is often clearer if they are phrased as direct questions (see Example 4.16). Example 4.16: Poor Item Better Item The author of Alice in Wonderland was ______________. (Possible answers: a story writer, a mathematician, an Englishman, and buried in 1898) What is the pen name of the author of Alice in Wonderland? (Answer: Lewis Carroll) Copyright © Open University Malaysia (OUM) 86 TOPIC 4 (c) If the question requires a numerical answer, indicate the units in which the answer is to be expressed (see Example 4.17). HOW TO ASSESS? – OBJECTIVE TESTS Example 4.17: Poor Item When did America? Colombus Better Item discover (Possible answers: the 15th century, 1492) (d) In what year did Colombus discover America? (1492) For an incomplete statement type of question, put the blank towards the end of the sentence (see Example 4.18). Example 4.18: Poor Item Better Item ____________ is the capital of Malaysia. The capital _____________. of Malaysia is (Answer: Kuala Lumpur) (e) For an incomplete statement type of question, limit blanks to one or two. If there are more than two blanks in a statement, the question becomes unintelligible or ambiguous (see Example 4.19). Example 4.19: Poor Item Better Item _________ and __________ are two methods of scoring _________. Two different methods of scoring essay tests are the __________ and ___________ methods. (Answers: analytic, holistic) Copyright © Open University Malaysia (OUM) TOPIC 4 (f) HOW TO ASSESS? – OBJECTIVE TESTS 87 Avoid irrelevant clues (see Example 4.20). Example 4.20: Poor Item Better Item A specialist in urban planning is called an ___________. A specialist in city planning is called a(n) ____________. (Answer: urbanist) (Answer: urbanist) (g) Do not copy statements verbatim from textbooks. When you copy material, you encourage students to do rote memorisation. (h) A completion item should omit important words, not trivial words. Use the item to assess a studentÊs knowledge of an important fact or concept. (i) Keep all the blanks of completion items the same length so as not to cue the students to the possible answer. SELF-CHECK 4.4 1. What are the strengths of short-answer questions? 2. Elaborate on some weaknesses of short-answer questions. ACTIVITY 4.2 1. Select five true-false questions in your subject area and analyse each item using the guidelines mentioned earlier. 2. Select five matching questions in your subject area and analyse each item using the guidelines mentioned earlier. 3. Suggest how you would improve the weak items for each type of question. Share your answers with your coursemates in the myINSPIRE online forum. Copyright © Open University Malaysia (OUM) 88 TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS An objective test is a written test consisting of items or questions which require the respondent to select from a list of possible answers. An objective item or question is „accurate‰ because it cannot be influenced by the personal preferences and prejudices of the marker. Objective tests vary depending on how the questions are presented. The three common types of questions used in most objective tests are multiple-choice questions, matching questions and true-false questions. Multiple-choice questions have two parts: a stem that contains the question, and three, four or five options with one of them containing the correct answer. The correct option is called the key response and incorrect options are called distractors. Multiple-choice questions are widely used because they can measure learning outcomes from simple to complex. They are highly structured and clear tasks are provided to test a broad sample of what has been learnt. Multiple-choice questions, however, are difficult to construct, tend to measure low-level learning outcomes, lend themselves to guessing and do not measure studentsÊ writing ability. True-false questions are those in which a statement is presented and the student indicates whether the statement is true or false. True-false questions can be written quickly and are easy to score. Since they can be objectively scored, the scores are more reliable than for items that are at least partially dependent on the teacherÊs judgement. Avoid lifting statements directly from assigned readings, notes or other course materials so that recall alone will not lead to a correct answer. Matching questions are used in measuring a studentÊs ability to identify the relationship between two lists of terms, phrases, statements, definitions, dates, events, people and so forth. To reduce the possibility of guessing correct answers in matching questions, list a larger number of responses than premises and allow responses to be used more than once. Copyright © Open University Malaysia (OUM) TOPIC 4 HOW TO ASSESS? – OBJECTIVE TESTS 89 In writing test items, you must consider the length of the test or examination as well as the reading level of your students. The two types of short-answer questions are direct questions and completion questions. Allotment of time Objective tests Alternatives Premises Distractors Responses Guessing Short-answer questions Matching questions Stem Multiple-choice questions True-false questions Gronlund, N. E. (1988). How to construct achievement tests. Englewood Cliffs, NJ: Prentice Hall. McBeath, R. (1992). Instructing and evaluating in higher education: A guidebook for planning learning outcomes. Englewood Cliffs, NJ: Educational Technology. McKenna, C., & Bull, J. (1999). Designing effective objective test questions: An introductory workshop. Retrieved from https://bit.ly/2It9v8K Nitko, A. J. (2004). Educational assessment of students (4th ed.). New Jersey, NJ: Pearson. Owen, S. V., & Freeman, R. D. (1987). WhatÊs wrong with three option multiple items? Educational & Psychological Measurement, 47, 513ă522. Copyright © Open University Malaysia (OUM) Topic How to Assess? 5 ă Essay Tests LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Define and list the criteria for an essay question; 2. Explain the formats of essay tests; 3. List the advantages and limitations of essay questions; 4. Construct well-written essay questions that assess learning outcomes given; and 5. Describe different types of marking schemes for essays. INTRODUCTION In Topic 4, we discussed in detail the use of objective tests in assessing students. In this topic, we will examine a different type of test called the essay test. The essay test is a popular technique for assessing learning and is used extensively at all levels of education. It is also widely used in assessing learning outcomes in business and professional examinations. Essay questions are used because they challenge students to create their own responses rather than simply selecting a response. Essay questions have the potential to reveal studentsÊ abilities to reason, create, analyse and synthesise, which may not be effectively assessed using objective tests. Copyright © Open University Malaysia (OUM) TOPIC 5 5.1 HOW TO ASSESS? – ESSAY TESTS 91 WHAT IS AN ESSAY QUESTION? According to Stalnaker (1951), an essay is „a test item which requires a response composed by the examinee usually in the form of one or more sentences of a nature that no single response or pattern of responses can be listed as correct, and the accuracy and quality of which can be judged subjectively only by one skilled or informed in the subject.‰ Though the definition was provided a long time ago, it is a comprehensive definition. Elaborating on this definition, Reiner, Bothell, Sudweeks and Wood (2002) argued that to qualify as an essay question, it should meet the following four criteria: (a) The learner has to compose rather than select his or her response or answer. In essay questions, students have to construct their own answer and decide on what material to include in their response. Objective test questions (MCQ, true-false, matching) on the other hand, require students to select the answer from a list of possibilities. (b) The response or answer the learner provides will consist of one or more sentences. Students do not respond with a „yes‰ or „no‰ but instead have to respond in the form of sentences. In theory, there is no limit to the length of the answer. However, in most cases, its length is predetermined by the demand of the question and the time limit allotted for the test question. (c) There is no one single correct response or answer. In other words, the question should be composed so that it does not ask for one single correct response. For example, the question „Who killed JWW Birch?‰ assesses verbatim recall or memory and not the ability to think. Hence, it cannot qualify as an essay question. You can modify the question „Who killed JWW Birch? Explain the factors that led to the killing.‰ Now, this is an essay question that assesses studentsÊ ability to think and give reasons for the killing supported with relevant evidence. (d) The accuracy and quality of studentsÊ responses or answers to essay questions must be judged subjectively by a specialist in the subject. The nature of essay questions is such that only specialists in the subject can judge to what degree responses (or answers) to an essay question are complete, accurate and relevant. Good essay questions encourage students to think deeply about their answers that can be judged only by someone with appropriate experience and expertise in the content area. Thus, content expertise is essential for both writing and grading essay tests. For example, the question „List three reasons for the opening of Penang by the British in 1789‰ requires students to recall a set list of items. The person marking or grading the essay does not have to be a subject matter expert to know Copyright © Open University Malaysia (OUM) 92 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS whether the student has listed the three reasons correctly as long as the list of three reasons is available as an answer key. For the question „To what extent is commerce the main reason for the opening of Penang by the British in 1789?‰, a subject matter expert is needed to grade or mark the answer to this essay test question. 5.2 FORMATS OF ESSAY TESTS Essay formats are usually classified into two groups: restricted response essay questions and extended response essay questions. Both types are useful tools but for different purposes. (a) Restricted Response Essay Questions Restricted response essay questions restrict or limit both the content and the form of studentsÊ answers. The following are three examples: (i) Discuss two advantages and two disadvantages of essay questions in measuring studentsÊ performance. (ii) List five guidelines for writing good essay items. For each guideline, write a short statement explaining why it is useful in improving the validity of essay assessment. (iii) Distinguish the formative assessment from the summative assessment in terms of their aims, the timing of the implementation and the content coverage. As shown in the examples, students are specifically informed what and how they should respond to the questions. They indicate the number of points required and/or the scope of the responses. The restriction or limitation on the studentsÊ responses can also be done by including an interpretative material (e.g. a graph, a paragraph describing a particular problem or an extract from a literary work) and students are asked to respond to one or two questions based on it. The restricted response questions are more structured and are useful for measuring learning outcomes requiring the interpretation and application of knowledge in a specific area. They narrow the focus of the assessment task to a specific and well-defined performance. The nature of these questions makes it more likely that the students will interpret each question the way it is intended. The teacher is also in a better position to assess the correctness of studentsÊ answers when a question is focused and all students interpret it in the same way. When the teacher is clear about what makes up correct Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 93 answers, it improves scoring reliability and the scoresÊ validity. Although restricting studentsÊ responses makes it possible to measure more specific learning outcomes, these same restrictions make them less valuable as a measure of those learning outcomes emphasising integration, organisation and originality. For higher-order learning outcomes, greater freedom of response is needed. (b) Extended Response Essay Questions Extended response essay questions provide less structure and this promotes greater creativity, integration and organisation of material. The following are three examples: (i) Examine to what extent essay questions are effective in measuring studentsÊ performance. (ii) Evaluate the usefulness of multiple-choice questions as an assessment tool in education. (iii) „Research without theory is blind.‰ Discuss. In responding to extended response essay questions, students are free to select any information that they think pertinent, to organise the answer in accordance with their best judgement, to integrate and to evaluate ideas they deem appropriate. This freedom enables them to demonstrate their ability to analyse problems, organise their ideas, describe in their own words, and/or develop a coherent argument. The extended-response essay questions are therefore useful in assessing higher-order thinking skills. They can also be used to assess writing skills. The freedom for students to respond to extended response essay questions can cause some problems. First, there is usually no single correct answer to the question. Students are free to choose the way to respond, and the degree of correctness or merit of their answers can only be judged by a skilled subject-matter expert. A large number of examiners is required if the assessment involves a big student population. Inter-rater reliability in scoring can be an issue. Second, the same freedom that enables the demonstration of creative expression and other higher-order thinking skills makes the extended response essay question inefficient for measuring more specific learning outcomes. Third, the extended response essay questions require good writing skills on the part of the students. This type of question is thus disadvantageous to students whose writing skills are poor. Due to these limitations, it is often recommended that more restricted response essay questions to be used in place of extended response essay questions. Copyright © Open University Malaysia (OUM) 94 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS ACTIVITY 5.1 Select a few essay questions that have been used in tests or examinations. To what extent do these questions meet the criteria of an essay question as defined by Stalnaker (1951) and elaborated by Reiner et al. (2002)? Discuss with your coursemates in the myINSPIRE online forum. 5.3 ADVANTAGES OF ESSAY QUESTIONS Essay questions are used to assess learning because of the following reasons: (a) Essay questions provide an effective way of assessing complex learning outcomes. They allow one to assess studentsÊ ability to synthesise, organise and express ideas, and evaluate the worth of ideas. These abilities cannot be effectively assessed directly with other paper-and-pencil test items. (b) Essay questions allow students to demonstrate their reasoning. These questions not only allow students to present an answer to a question but also to explain how they have arrived at their conclusions. This allows teachers to gain insight into a studentÊs way of viewing and solving problems. With such insight, teachers can detect problems which students may have with their reasoning process and help them overcome these problems. (c) Essay questions provide authentic experiences. Constructing responses is closer to real life than selecting responses as in the case of objective tests. Problem solving and decision making are vital life competencies which require the ability to construct a solution or decision rather than selecting a solution or decision from a limited set of possibilities. In the work environment, it is unlikely that an employer will give a list of „four options‰ for a worker to choose from when the latter is asked to solve a problem. In most cases, the worker will be required to construct a response. Copyright © Open University Malaysia (OUM) TOPIC 5 5.4 HOW TO ASSESS? – ESSAY TESTS 95 DECIDING WHETHER TO USE ESSAY QUESTIONS OR OBJECTIVE QUESTIONS Keep in mind that essay questions should strive for higher-order thinking skills. Therefore, the decision whether to use essay questions or objective questions in examinations can be problematic for some educators. In such a situation, one has to go back to the objectives of assessment. What kinds of learning outcomes do you intend to assess? Essay questions are generally suitable to assess: (a) StudentsÊ understanding of subject matter or content; and (b) Thinking skills that require more than simple verbatim recall of information by challenging the students to reason with their knowledge. It is challenging to write test items to tap into higher-order thinking. However, studentsÊ understanding of subject matter or content, and many of the other higher-order thinking skills, can also be assessed through objective items. When in doubt about whether to use an essay question or an objective question, just remember that essay questions are used to assess studentsÊ ability to construct rather than select answers. To determine what type of test (essay or objective) to use, it is helpful that you examine the verb(s) that best describe the desired ability to be assessed (refer to Topic 2). These verbs indicate what students are expected to do and how they should respond. They serve to focus on the studentsÊ responses and channel them towards the performance of specific tasks. Some verbs clearly indicate that students need to construct rather than select their answer (such as to explain). Other verbs indicate that the intended learning outcome is focused on studentsÊ ability to recall information (such as to list). Perhaps, recall is best assessed through objectively scored items. Verbs that test for understanding of subject matter or content or other forms of higher-order thinking, but do not specify whether the student is to construct or select the response (such as to interpret) can be assessed either by essay questions or objective items. Copyright © Open University Malaysia (OUM) 96 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS ACTIVITY 5.2 Compare, explain, arrange, apply, state, classify, design, illustrate, describe, name, complete, choose, defend and name. Decide which of the verbs in the list are best assessed by essay questions or objective tests or both objective and essay questions. Post your answer on the myINSPIRE online forum. 5.5 LIMITATIONS OF ESSAY QUESTIONS While essay questions are popular because they enable the assessment of higherorder learning outcomes, this format of evaluating students in examinations has a number of limitations which should be kept in mind. (a) One purpose of testing is to assess a studentÊs mastery of subject matter. In most cases, it is not possible to assess the studentÊs mastery of the complete subject matter domain with just a few questions. Because of the time it takes for students to respond to essay questions and for markers to mark studentsÊ responses, the number of essay questions that can be included in a test is limited. Therefore, using essay questions will limit the degree to which the test is representative of the subject matter domain, thereby reducing content validity. For instance, a test of 80 multiple-choice questions will most likely cover more of the content domain than a test of three to four essay questions. (b) Essay questions have limitations in reliability. While essay questions allow students some flexibility in formulating their responses, the reliability of marking or grading is questionable. Different markers or graders may vary in their marking or grading of the same or similar responses (inter-scorer reliability) and one marker can vary significantly in his or her marking or grading consistency across questions depending on many factors (intrascorer reliability). Therefore, essay answers of similar quality may receive notably different scores. Characteristics of the learner, length and legibility of responses, and personal preferences of the marker or grader with regard to the content and structure of the response are some of the factors that may lead to unreliable marking or grading. Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 97 (c) Essay questions require more time for marking student responses. Teachers need to invest a large amount of time to read and mark studentsÊ responses to essay questions. On the other hand, relatively little or no time is required for teachers to score objective test items like multiple-choice items and matching exercises. (d) As mentioned earlier, one of the strengths of essay questions is that they provide students with authentic experiences because students are challenged to construct rather than select their responses. To what extent does the short time normally allotted to test affect student response? Students have relatively little time to construct their responses and this time limit does not allow them to give appropriate attention to the complex process of organising, writing and reviewing their responses. In fact, in responding to essay questions, students use a writing process that is quite different from the typical process that produces excellent writing (draft, review, revise and evaluate). In addition, students usually have no resources to aid their writing when answering essay questions (dictionary or thesaurus). This disadvantage may offset whatever advantage accrued from the fact that responses to essay questions are more authentic than responses to multiplechoice items. 5.6 MISCONCEPTIONS ABOUT ESSAY QUESTIONS IN EXAMINATIONS Other than the limitations of essay questions discussed earlier, there are also some misconceptions about this form of assessment. These misconceptions are: (a) By Their Very Nature, Essay Questions Assess Higher-order Thinking Whether or not an essay item assesses higher-order thinking depends on the design of the question and how studentsÊ responses are scored. Not all essay questions can assess higher-order thinking skills. Indeed, it is possible to write essay questions that simply assess recall. Also, if a teacher designs an essay question meant to assess higher-order thinking but then scores studentsÊ responses in a way that only rewards recall ability, that teacher is not assessing higher-order thinking. Therefore, teachers must be welltrained to design and write higher-order thinking questions. Copyright © Open University Malaysia (OUM) 98 TOPIC 5 (b) Essay Questions are Easy to Construct Essay questions are easier to construct than multiple-choice items because teachers do not have to create effective distractors. However, that does not mean that good essay questions are easy to construct. They may be easier to construct in a relative sense, but they still require a lot of effort and time. Essay questions that are hastily constructed without much thought and review usually function poorly. (c) The Use of Essay Questions Eliminates the Problem of Guessing One of the drawbacks of objective test items is that students sometimes get the right answer by guessing which of the presented options is correct. This problem does not exist with essay questions because students need to generate the answer rather than identifying it from a set of options provided. At the same time, the use of essay questions introduces bluffing, another form of guessing. Some students are „good‰ at using various methods of bluffing (vague generalities, padding, name-dropping) to add credibility to an otherwise weak answer. Thus, the use of essay questions changes the nature of the guessing that occurs, but does not eliminate it. (d) Essay Questions Benefit All Students by Placing Emphasis on the Importance of Written Communication Skills Written communication is a life competency that is required for effective and successful performance in many vocations. Essay questions challenge students to organise and express subject matter and problem solutions in their own words, thereby giving them a chance to practise written communication skills that will be helpful to them in future vocational responsibilities. At the same time, the focus on written communication skills is also a serious disadvantage for students who have marginal writing skills but know the subject matter being assessed. If students who are knowledgeable in the subject obtain low scores because of their inability to write well, the validity of the test scores will be diminished. (e) Essay Questions Encourage Students to Prepare More Thoroughly Some research seems to indicate that students are more thorough in their preparation for examinations using essay questions than in their preparation for objective examinations such as those using multiple-choice questions. However, after an extensive review of existing literature and research on this topic, Crooks (1988) concluded that studentsÊ extent of preparation is based more on the expectations teachers set upon them (higher-order thinking and breadth and depth of content) than the type of test questions they expect to be given in examinations. HOW TO ASSESS? – ESSAY TESTS Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 99 SELF-CHECK 5.1 1. What are some limitations in the use of essay questions? 2. List some of the misconceptions about essay questions. ACTIVITY 5.3 Compare the following two essay questions and decide which one assesses higher-order thinking skills. (a) „What are the major advantages and limitations of solar energy?‰ (b) „Given its advantages and limitations, should governments spend money developing solar energy?‰ Post your answer on the myINSPIRE online forum. 5.7 GUIDELINES ON CONSTRUCTING ESSAY QUESTIONS When constructing essay questions, whether they are for coursework assessments or examinations, the most important thing is to ensure that students have a clear idea of what they are expected to do after they have read the question or problem presented. Here are specific guidelines that can help you improve existing essay questions and create new ones. (a) Clearly Define the Intended Learning Outcome to be Assessed by the Question Knowing the intended learning outcome is crucial for designing essay questions. In specifying the intended learning outcome, teachers clarify the performance that students should be able to demonstrate as a result of what they have learnt. The intended learning outcome typically begins with a verb that describes an observable behaviour or action that students should demonstrate. The focus is on what students should and should not be able to do in the learning or teaching process. Reviewing a list of verbs can help to clarify what ability students should demonstrate, thereby defining the intended learning outcome to be assessed (refer to subtopic 4.8). Copyright © Open University Malaysia (OUM) 100 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS (b) Avoid Using Essay Questions for Intended Learning Outcomes that are Better Assessed with Other Kinds of Assessment Some types of learning outcomes can be more efficiently and more reliably assessed with objective tests than with essay questions. Since essay questions sample a limited range of subject matter or content, are more timeconsuming to score and involve greater subjectivity in scoring, the use of essay questions should be reserved for learning outcomes that cannot be better assessed by some other means. Let us look at Example 5.1. Example 5.1: Learning Outcome: To be able to differentiate the reproductive habits of birds and amphibians. Essay Question: What are the differences in egg laying characteristics between birds and amphibians? Note: This learning outcome can be better assessed by an objective test. Objective Item: Which of the following differences between birds and amphibians is correct? Birds (c) Amphibians A Lay a few eggs at a time Lay many eggs at a time B Lay eggs Give birth C Do not incubate eggs Incubate eggs D Lay eggs in nest Lay eggs on land Clarity About the Task and Scope Essay questions have two variable elements ă the degree to which the task is structured and the degree to which the scope of the content is focused. There is still confusion among educators as to whether more structure (of the task required) and more focus (on the content) are better than less structure and less focus. When the task is more structured and the scope of content is more focused, two problems are reduced: (i) The problem of student responses containing ideas that were not meant to be assessed; and (ii) The problem of extreme subjectivity when scoring student answers or responses. Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 101 Although more structure helps to avoid these problems, how much and what kind of structure and focus to provide are dependent on the intended learning outcome that is to be assessed by the essay question. The process of writing effective essay questions involves defining the task and delimiting the scope of the content in an effort to create an effective question that is aligned with the intended learning outcome to be assessed by it (as illustrated in Figure 5.1). Figure 5.1: Alignment between content, learning activities and assessment tasks Source: Phillips, Ansary Ahmed and Kuldip Kaur (2005) This alignment is absolutely necessary for obtaining studentsÊ responses that can be accepted as evidence that a student has achieved the intended learning outcome. Hence, the essay question must be carefully and thoughtfully written in such a way that it elicits student responses that provide the teacher with valid and reliable evidence about the studentsÊ achievement of the intended learning outcome. Failure to establish adequate and effective limits for studentsÊ answers to the question may result in students setting their own boundaries for their responses. This means that students might provide answers that are outside the intended task or address only a part of the intended task. If this happens, then the teacher is left with unreliable and invalid information about the studentsÊ achievement of the intended learning outcome. Also, there is no basis for marking or grading studentsÊ answers. Therefore, it is the responsibility of the teacher to write essay questions in such a way that they provide students with clear boundaries for their answers or responses. Let us look at Example 5.2. Copyright © Open University Malaysia (OUM) 102 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS Example 5.2: Improving Clarity of Task and Scope of Essay Questions Weak Essay Question: Evaluate the impact of the Industrial Revolution on England. The verb is „evaluate‰, which is the task the student is supposed to do. The scope of the question is the impact of the Industrial Revolution on England. Very little guidance is given to students about the task of evaluating and the scope of the task. A student reading the question may ask: (i) The impact on what in England? The economy? Foreign trade? A particular group of people? (The scope is not clear.) (ii) Evaluate based on what criteria? The significance of the revolution? The quality of life in England? Progress in technological advancements? (The task is not clear.) (iii) What exactly do you want me to do in my evaluation? (The task is not clear.) Improved Essay Question: Evaluate the impact of the Industrial Revolution on the quality of family life in England. Explain whether families were able to provide for the education of their children. The improved question determines the task for students by specifying a particular unit of society in England affected by the Industrial Revolution (family). The task is also determined by giving students a criterion for evaluating the impact of the Industrial Revolution (whether or not families were able to provide for their childrenÊs education). Students are clearer about what must be done to „evaluate‰. They need to explain how family life has changed and judge whether or not the changes are an improvement for the children. SELF-CHECK 5.2 1. When would you decide to use an objective item rather than an essay question to assess learning? 2. What is the difference between the task and the scope of an essay question? Copyright © Open University Malaysia (OUM) TOPIC 5 (d) HOW TO ASSESS? – ESSAY TESTS 103 Questions that are Fair One of the challenges that teachers face in composing essay questions is that because of their extensive experience with the subject matter, they may be tempted to demand unreasonable content expertise on the part of the students. Hence, teachers need to make sure that their students can „be expected to have adequate material with which to answer the question‰ (Stalnaker, 1951). In addition, teachers should ask themselves if students can be expected to adequately perform the thought processes which are required of them in the task. For assessment to be fair, teachers need to provide their students with sufficient instruction and practice in the subject matter required for the thought processes to be assessed. Another important element is to avoid using indeterminate questions. A question is indeterminate if it is so unstructured that students can redefine the problem and focus on some aspect of it with which they are thoroughly familiar or if experts in the subject matter cannot agree that one answer is better than another. One way to avoid indeterminate questions is to stay away from vocabulary that is ambiguous. For example, teachers should avoid using the verb „discuss‰ in an essay question. This verb is simply too broad and vague. Moreover, teachers should also avoid including vocabulary that is too advanced for students. (e) Specify the Approximate Time Limit and Marks Allotted to Each Question Specifying the approximate time limit helps students allocate their time in answering several essay questions. Without such guidelines, students may feel at a loss as to how much time to spend on a question. When deciding the guidelines for how much time should be spent on a question, keep the slower students and students with certain disabilities in mind. Also make sure that students can be realistically expected to provide an adequate answer in the given and/or suggested time. Similarly, state the marks allotted to each question so that students can estimate how much they should write to answer the question. (f) Use Several Relatively Short Essay Questions Rather than One Long Question Only a very limited number of essay questions can be included in a test because of the time it takes for students to respond to them and the time it takes for teachers to grade the studentsÊ responses. This creates a challenge with regard to designing valid essay questions. Shorter essay questions are better suited to assess the depth of student learning within a subject, whereas longer test essay questions are better suited to assess the breadth of student learning within a subject. Hence, there is a trade-off when choosing between several short essay questions or one long question. Focus on assessing the Copyright © Open University Malaysia (OUM) 104 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS depth of student learning within a subject limits the assessment of the breadth of student learning within the same subject. Meanwhile, focus on assessing the breadth of student learning within a subject limits the assessment of the depth of student learning within the same subject. When choosing between using several short essay questions or a long question, also keep in mind that short essays are generally easier to mark than long essays. (g) Avoid the Use of Optional Questions Students should not be permitted to choose one essay question to answer from two or more optional questions. The use of optional questions should be avoided for the following reasons: (i) Students may waste time deciding on an option; and (ii) Some questions are likely to be harder which could make the comparative assessment of studentsÊ abilities unfair. The issue of the use of optional questions is debatable. It is often practised, especially in higher education and students often demand that they be given choices. The practice is acceptable if it can be assured that the questions have equivalent difficulty levels and the tasks as well as the scope required by the questions are equivalent. Last but not least, let us improve the essay questions through preview and review. Improving Essay Questions Through Preview and Review The following steps can help you improve the essay item before and after you administer it to your students. PREVIEW (before handing out the essay question to the students) Predict StudentsÊ Responses Try to respond to the question from the perspective of a typical student. Evaluate whether students have the content knowledge and the skills necessary to adequately respond to the question. After detecting possible weaknesses of the essay questions, repair them before handing them out in the exam. Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 105 Write a Model Answer Before using a question, write model answer(s) or at least an outline of major points that should be included in an answer. Writing the model answer allows reflection on the clarity of the essay question. Furthermore, the model answer serves as a basis for the grading of student responses. Once the model answer has been written, compare its alignment with the question and the intended learning outcome, and make changes as needed to assure that the intended learning outcome, the question and the model answer are aligned with one another. Before using the question in a test, ask a knowledgeable person in the subject to critically review the essay question, the model answer and the intended learning outcome to determine how well they are aligned with each other. REVIEW (after receiving the student responses) Review StudentsÊ Responses to the Essay Question After students have answered the questions, carefully review the range of answers given and the manner in which students seem to have interpreted the question. Make revisions based on the findings. Writing good essay questions is a process that requires time and practice. Carefully studying the studentsÊ responses can help to evaluate studentsÊ understanding of the question as well as the effectiveness of the question in assessing the intended learning outcomes. Copyright © Open University Malaysia (OUM) 106 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS In addition, you can use a checklist as shown in Figure 5.2 to check your essay questions. Figure 5.2: A checklist for writing essay questions SELF-CHECK 5.3 1. Why should you specify the time allotted for answering each question? 2. Why should you avoid optional questions? 3. What is meant when it is said that questions should be „fair‰? 4. What should you do before and after administering a test? Copyright © Open University Malaysia (OUM) TOPIC 5 5.8 HOW TO ASSESS? – ESSAY TESTS 107 VERBS DESCRIBING VARIOUS KINDS OF MENTAL TASKS Using the list suggested by Moss and Holder (1988), and Anderson and Krathwohl (2001), Reiner et al. (2002) proposed the following list of verbs that describe mental tasks to be performed (refer to Table 5.1). Table 5.1: Verbs, Definitions and Examples Verbs Definitions Examples Analyse Break material into its constituent parts and determine how the parts relate to one another and to an overall structure or purpose. Analyse the meaning of the line „He saw a dead crow, in a drain, near the post office‰ in the poem The Dead Crow. Apply Decide which abstractions (concepts, principles, rules, laws, theories, generalisations) are relevant in a problem situation. Apply the principles of supply and demand to explain why the consumer price index (CPI) in Malaysia has increased in the last three months. Attribute Determine a point of view, bias, Determine the point of view of the value or intent underlying the author in the article about her presented material. political perspective. Classify Determine which category belongs Classify the organisms into to something. vertebrates and invertebrates. Compare Identify and describe points of Compare the role of the Dewan similarity. Rakyat and Dewan Negara. Compose Make or form by combining Compose an effective plan for things, parts or elements. solving flooding problems in Kuala Lumpur. Contrast Bring out the points of difference. Create Put elements together to form a Create a comprehensive solution for coherent or functional whole, the traffic problems in Kuala reorganise elements into a new Lumpur. pattern or structure. Contrast the contribution of Tun Hussein Onn and Tun Abdul Razak Hussein to the political stability of Malaysia. Copyright © Open University Malaysia (OUM) 108 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS Critique Detect consistencies and Judge which of the two methods is inconsistencies between a product the best way for reducing high and relevant external criteria; absenteeism in the workplace. detect the appropriateness of a procedure for a given problem. Defend Develop and present an argument Defend the decision to raise fuel to support a recommendation, to prices by the government. maintain or revise a policy, programme or propose a course of action. Define Give the meaning of a word or Define the concept; place it in the class to weathering". which it belongs and distinguish it from other items in the same class. Describe Give an account of; tell or depict in Describe the contribution of ZaÊba words; represent or delineate by a to the development of Bahasa word picture. Melayu. Design Devise a procedure accomplishing some task. term „chemical for Design an experiment to prove that 21 per cent of air is composed of oxygen. Differentiate Distinguish relevant from Distinguish between supply and irrelevant parts or important from demand in determining price. unimportant parts of presented material. Explain Make clear the cause or reason of Explain the causes of the First something; construct a cause-and- World War. effect model of a system; tell „how‰ to do; tell the meaning of. Evaluate Make judgements based on criteria Evaluate the contribution of the and standards; determine the microchip in telecommunications. significance, value, quality or relevance of; give the good points and the bad ones; identify and describe the advantages and limitations. Generate Come up with alternative Generate hypotheses to account for hypotheses, examples, solutions, an observed phenomenon. proposals based on criteria. Identify Recognise as being a particular Identify the characteristics of the person or thing. Mediterranean climate. Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 109 Illustrate Use a word picture, a diagram, a Illustrate the use of catapults in the chart or a concrete example to amphibious warfare of Alexander. clarify a point. Infer Draw a logical conclusion from What can you infer happened in the presented information. experiment? Interpret Give the meaning of; change from Interpret the poetic line, „The sound one form of representation (such as of a cobweb snapping is the noise of numerical) to another (such as my life.‰ verbal). Justify Show good reasons for; give your Justify the American entry into the evidence; present facts to support Second World War. your position. List Create a series of names or other List the major functions of the items. human heart. Predict Know or tell beforehand with Predict the outcome of a chemical precision of calculation, reaction. knowledge or shrewd inference from facts or experience what will happen. Propose Offer for consideration, acceptance Propose a solution for landslides or action; suggest. along the North-South Highway. Recognise Locate knowledge in long-term Recognise the important events in memory that is consistent with the road to independence in presented material. Malaysia. Recall Retrieve relevant knowledge from Recall the dates of important events long-term memory. in Islamic history. Summarise Sum up; give the main points Summarise the ways in which man briefly. preserves food. Trace Follow the course of; follow the Trace the development of television trail of; give a description of in school instruction. progress. The definitions specify thought processes a person must perform to complete the mental tasks. Note that this list is not exhaustive and local examples have been introduced to illustrate the mental tasks required in each essay question. Copyright © Open University Malaysia (OUM) 110 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS ACTIVITY 5.4 Discuss the following with your coursemates in the myINSPIRE online forum: (a) Select some essay questions in your subject area and examine whether the verbs used are similar to those in the list given in Table 5.1. Do you think the tasks required by the verbs used are appropriate? Justify. (b) Do you think students are able to differentiate between the tasks required in the verbs listed? Justify. (c) Are teachers able to describe to students the tasks required by using these verbs? Explain. 5.9 MARKING AN ESSAY Marking or grading of essays is a notoriously unreliable activity. If we read an essay at two different times, the chances are high that we will give the essay a different grade each time. If two or more of us read the essay, our grades will likely differ, often dramatically so. We all like to think we are exceptions, but study after study of well-meaning and conscientious teachers show that essay grading is unreliable (Ebel, 1972; McKeachie, 1987). Eliminating the problem is unlikely, but we can take steps to improve grading reliability. Using a scoring guide or marking scheme helps control the shifting of standards that inevitably take place as we read a collection of essays and papers. The common types of marking scheme used in scoring studentsÊ responses to essay questions are diagrammatically presented as follows (refer to Figure 5.3): Figure 5.3: Types of marking scheme Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 111 A marking scheme may take the form of a checklist, a rubric or a combination of both. (a) Checklist In a checklist, a score is awarded for every correct or relevant point in a response. The sum of these individual scores provides the final score of the response. Table 5.2 is an example of a checklist. Table 5.2: Sample of a Checklist Reference Suggested answers Topic 5, Section 5.7, p. 74 Strengths Essay questions provide an effective way of assessing complex learning outcomes. Essay questions allow students to demonstrate their reasoning and creativity. Essay questions provide authentic experiences because students are given the opportunity to organise, write and review their responses. Guessing is very much reduced. (Accept any other appropriate answers.) Marks allocation Award 1 mark for each point. (1 mark 4 = 4 marks) This marking scheme can be used to assess studentsÊ responses to an essay question that ask for the strengths of essay questions as an assessment tool. A checklist is easy to use. The teacher just needs to read through the studentÊs response and checks the number of points for the calculation of marks. A checklist is useful to assess factual content and it is relatively easy to construct. The teacher just needs to present a list of points required in the response and decide on the marks for each point. However, a checklist with a list of points does not provide for the assessment of intangible learning outcomes such as „to discuss‰, „to evaluate‰ or „to explain‰ and other complexity levels of BloomÊs taxonomy. It also has limited feedback for formative purposes and students cannot use it as a guide for writing assignments. Copyright © Open University Malaysia (OUM) 112 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS (b) Rubric The two most common approaches used in scoring rubrics are the holistic and the analytic methods. (i) Holistic Method (Global or Impressionistic Marking) The holistic approach to scoring essay questions involves reading an entire response and assigning it to a category identified by a score or grade. This method involves considering the studentÊs answer as a whole and judging the total quality of the answer relative to other studentsÊ responses or the total quality of the answer based on certain criteria that have been developed. Think of it as sorting into bins. You read the answer to a particular question and assign it to the appropriate bin. The best answers go into the „exemplary‰ bin, the good ones go into the „good‰ bin and the weak answers go into the „poor‰ bin (refer to Table 5.3). Table 5.3: Sample of a Marking Scheme Using the Holistic Method Level of Achievement 7ă8 (Exemplary) Descriptor Addresses the question States a relevant argument Presents arguments in a logical order Uses acceptable style and grammar (no errors) 5ă6 (Good) Combination of above traits, but less consistently represented (few errors) 3ă4 (Adequate) Does not address the question explicitly, though does so tangentially States a somewhat relevant argument Presents some arguments in a logical order Uses adequate style and grammar (some errors) 1ă2 Does not address the question (Poor) States no relevant arguments Is not clearly or logically organised Fails to use acceptable style and grammar 0 Irrelevant response or no answer Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 113 Then, points are written on each paper appropriate to the bin it is in. It is based on an overall impression. The holistic method is also referred to as global or impressionistic marking. One of the strengths of holistic rubric is that studentsÊ responses can be scored quite quickly. The teacher needs to read through the studentÊs response and decide in which band of scores the response lies. This rubric can provide an overview of student performance but it does not provide detailed information about studentÊs performance. It may be difficult to provide an overall score to the studentÊs response. How best can a teacher use the holistic method in scoring studentsÊ responses? Before he or she starts marking, the teacher can develop a description of the type of response that would illustrate each category, and then try out this draft version using several actual papers. After reading and categorising all of the papers, it is a good idea to reexamine the papers within a category to see if they are similar enough in quality to receive the same points or grade. It may be faster to read essays holistically and provide only an overall score or grade, but students do not receive much feedback about their strengths and weaknesses. Some instructors who use holistic scoring also write brief comments on each paper to point out one or two strengths and/or weaknesses so students will have a better idea of why their responses received the scores they did. (ii) Analytic Method The analytic method of marking is the system most frequently used in large-scale public examinations and also by teachers in the classroom. Its basic tool is a two-dimensional table with the performance criteria down the vertical column on the left and the performance levels across the top row. The cells then present the performance descriptors as shown in Table 5.4. Copyright © Open University Malaysia (OUM) Table 5.4: Sample of a Marking Scheme Using the Analytic Method 114 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 115 The holistic scoring gives students a single, overall assessment score for the response as a whole. The analytic scoring provides students with at least a rating score for each criterion. For example, based on the rubric, a studentÊs response may get 3 points for focus/organisation, 2 points for elaboration and 4 points for mechanics, giving a total of 9 marks. Alternatively, an analytic rubric may take the form of a weighted rubric, whereby different weights (value) are assigned to different criteria and include an overall achievement by totalling the criteria. Refer to Table 5.5 for a sample of a weighted analytic rubric. Copyright © Open University Malaysia (OUM) Table 5.5: Sample of a Marking Scheme Using the Weighted Analytic Method 116 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 117 To use the rubric, the performance level achieved by the student is multiplied by the weight to give a score for each criterion. For example, for focus/organisation, the score is 3 1.25 = 3.75, for elaboration, the score is 2 1.25 = 2.5 and for mechanics the score is 4 0.5 = 2.0. This gives the student a total of 8.25 marks out 12. The analytic rubric provides more detailed feedback on areas of strength and weakness because the performance criteria are given and each criterion can be weighted to reflect its relative importance in the studentÊs response. Generic rubrics which are not task specific can also be a useful aid to learning. Students can use them too as a guide to doing the assignments. As shown in Table 5.5, the performance descriptors are stated in general terms, and do not give away the answers. However, it takes more time to create and use than a holistic rubric. Moreover, it is important that each point for each criterion is well-defined. Otherwise, different raters may not arrive at the same score. 5.10 SUGGESTIONS FOR MARKING ESSAYS Here are some suggestions for marking or scoring essays: (a) Grade the papers anonymously. This will help control the influence of our expectations of the student on the evaluation of the answer. (b) Read and score the answers to one question before going on to the next question. In other words, score all the studentsÊ responses to Question 1 before looking at Question 2. This helps to keep one frame of reference and one set of criteria in mind through all the papers, which results in more consistent grading. It also prevents an impression that we form in reading one question from carrying over to our reading of the studentÊs next answer. (c) If a student has not done a good job on the first question, we may let this impression influence our evaluation of the studentÊs second answer. However, if other studentsÊ papers come in between, we are less likely to be influenced by the original impression. (d) If possible, try to grade all the answers to one particular question without interruption. Our standards might vary from morning to night or one day to the next. Copyright © Open University Malaysia (OUM) 118 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS (e) Shuffle all the papers after each item is scored. Changing the order of papers. this way reduces the context effect and the possibility that a studentÊs score may be the result of the location of the paper in relationship to other papers. If RakeshÊs „B‰ work is always following JamalÊs „A‰ work, then it might look more like „C‰ work and his grade would be lower than if his paper was somewhere else in the stack. (f) Decide in advance how you are going to handle extraneous factors and be consistent in applying the rule. Students should be informed about how you treat such things as misspelled words, neatness, handwriting, grammar and so on. (g) Be on the alert for bluffing. Some students who do not know the answer may write a well-organised coherent essay but one containing material irrelevant to the question. Decide how to treat irrelevant or inaccurate information contained in the studentsÊ answers. We should not give credit for irrelevant material. It is not fair to other students who may also have preferred to write on another topic, but instead wrote on the required question. (h) Write comments on the studentsÊ answers. Teacher comments make essay tests a good learning experience for students. They also serve to refresh your memory of your evaluation should the student question the grade given. (i) Be aware of the order in which papers are marked which can have an impact on the grades awarded. A marker may grow more critical (or more lenient) after having read several papers, thus the early papers may receive lower (or higher) marks than papers of similar quality that are scored later. (j) Also, when students are directed to take a stand on a controversial issue, the marker must be careful to ensure that the evidence and the way it is presented is evaluated, not the position taken by the student. If the student takes a position which differs from that of the marker, the marker must be aware of his or her own possible bias in marking the essay. Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 119 ACTIVITY 5.4 1. Compare the analytical method and holistic method of marking essays. 2. Which method is widely practised in your institution? Why? 3. Do you think there would be a difference in marking an answer using the two methods? Justify your answer. Post your answers on the myINSPIRE online forum. An essay question is a test item which requires a response composed by the examinee usually in the form of one or more sentences of a nature that no single response or pattern of responses can be listed as correct, and the accuracy and quality of which can be judged subjectively only by one skilled or informed in the subject matter. There are two types of essays based on their function: restricted response and extended response essay questions. Essay questions provide an effective way of assessing complex learning outcomes. Essay questions provide authentic experiences because constructing responses are closer to real life than selecting responses. It is not possible to assess a studentÊs mastery of the complete subject matter domain with just a few questions. Essay questions have two variable elements ă the degree to which the task is structured and the degree to which the scope of the content is focused. Whether or not an essay item assesses higher-order thinking depends on the design of the question and how studentsÊ responses are scored. Specifying the approximate time limit helps students allocate their time in answering several essay questions. Copyright © Open University Malaysia (OUM) 120 TOPIC 5 HOW TO ASSESS? – ESSAY TESTS Avoid using essay questions for intended learning outcomes that are better assessed with other kinds of assessment. Analytical marking is the system most frequently used in large-scale public examinations and also by teachers in the classroom. Its basic tool is the marking scheme with proper mark allocations for elements in the answer. The holistic approach to scoring essay questions involves reading an entire response and assigning it to one of several categories, each given a score or grade. Analytical method Holistic method Checklist Marking scheme Complex learning outcomes Mental tasks Constructed responses Model answer Essay Rubric Grading Time consuming Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of BloomÊs taxonomy of educational objectives. Boston, MA: Allyn & Bacon. Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58(4), 438ă481. Ebel, R. L. (1972). Essentials of educational measurement. Oxford, England: Prentice-Hall. McKeachie, W. J. (1987). Can evaluating instruction improve teaching? New Directions for Teaching and Learning, 31(1987), 3ă7. Copyright © Open University Malaysia (OUM) TOPIC 5 HOW TO ASSESS? – ESSAY TESTS 121 Moss, A., & Holder, C. (1988). Improving student learning: A guidebook for faculty in all disciplines. Dubuque, IO: Kendall/Hunt. Phillips, J. A., Ansary Ahmed, & Kuldip Kaur. (2005). Instructional design principles in the development of an e-learning graduate course. Paper presented at The International Conference in E-Learning. Bangkok, Thailand. Reiner, C. M., Bothell, T. W., Sudweeks, R. R., & Wood, B. (2002). Preparing effective essay questions. Stillwater, OK: New Forums Press. Stalnaker, J. M. (1951). The essay type examination. In E. F. Lindquist (Ed.), Educational measurement (pp. 495ă530). Menasha, WI: George Banta. Copyright © Open University Malaysia (OUM) Topic Authentic 6 Assessment LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Define authentic assessment; 2. Explain how to use authentic assessment; 3. Explain the advantages and disadvantages of authentic assessment; 4. Describe the characteristics of authentic assessment; and 5. Compare authentic assessment with traditional assessment. INTRODUCTION Many teachers use traditional assessment tools such as multiple-choice tests and essay type tests to assess their students. How well do these multiple-choice or essay tests really evaluate studentsÊ understanding and achievement? These traditional assessment tools do serve a role in the assessment of student outcomes. However, assessment does not always have to involve paper and pencil, but can instead be in the form of a project, an observation or a task that shows a student has learnt the material. Are these alternative assessments more effective than the traditional ones? Some classroom teachers are using testing strategies that do not focus entirely on recalling facts. Instead, they ask students to demonstrate the skills and concepts they have learnt. Teachers may want to ask the students to learn how to apply their skills to authentic tasks and projects or to have students demonstrate the Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 123 application of their knowledge in real life. The students must then be trained to perform meaningful tasks that replicate real-world challenges. In other words, students are asked to perform a task rather than select an answer from a readymade list. This strategy of asking students to perform real-world tasks that demonstrate meaningful application of essential knowledge and skills is called authentic assessment. Let us learn more about authentic assessment in the following subtopics. ACTIVITY 6.1 The following are two assessment procedures, A and B. Which is an authentic assessment and which is traditional assessment? Assessment A Students are asked to take a paper-and-pencil test on how to prepare for MCQs in an examination paper. Assessment B Students are asked to prepare for MCQs in an examination paper, administer it to a class of 30 students and then write a report. Justify your answer in the myINSPIRE online forum. 6.1 WHAT IS AUTHENTIC ASSESSMENT IN THE CLASSROOM? Authentic assessment, in contrast to the more traditional assessment, encourages the integration of teaching, learning and assessing. In the „traditional assessment model‰, teaching and learning are often separated from assessment. A test is administered after knowledge or skills have been acquired. Authentic assessment usually includes a task for students to perform and a rubric by which their performance on the task will be assessed. Thus doing science experiments, writing stories and reports, and solving mathematical problems that have real-world applications can all be considered as examples of authentic assessment. Useful achievement data can be obtained via authentic assessment. Copyright © Open University Malaysia (OUM) 124 TOPIC 6 AUTHENTIC ASSESSMENT Teachers can teach students mathematics, history and science, not just know them. Then, to assess what the students had learnt, teachers can ask students to perform the tasks that „replicate the challenges‰ faced by those using mathematics, history or conducting a scientific investigation. Well-designed traditional classroom assessments such as tests and quizzes can effectively determine whether or not students have acquired a body of knowledge. In contrast, authentic assessments ask students to demonstrate understanding by performing a more complex task usually representative of more meaningful application. These tasks involve asking students to analyse, synthesise and apply what they have learnt in a substantial manner and students create new meaning in the process as well. In short, authentic assessment helps answer the question, „How well can you use what you know?‰ but traditional testing helps answer the question, „Do you know it?‰ The usual or traditional classroom assessment such as multiple-choice tests and short-answer tests are just as important as the authentic assessment. In fact, the authentic assessment complements the traditional assessment. Authentic assessment has been gaining acceptance among early childhood and primary school teachers where traditional assessment may not be appropriate. 6.2 ALTERNATIVE NAMES FOR AUTHENTIC ASSESSMENT Did you know that authentic assessment is sometimes referred to as performance assessment, alternative assessment and direct assessment? It is called performance assessment or performance-based assessment because students are asked to perform meaningful tasks. Performance assessment is, „a test in which the test taker actually demonstrates the skills the test is intended to measure by doing real-world tasks that require those skills, rather than by answering questions asking how to do them‰ (Vander Ark, 2013). Project-based learning (PBL) and portfolio assignments are examples of performance assessment. With performance assessment, teachers observe students while they are performing in the classroom, and judge the level of proficiency demonstrated. As authentic tasks are rooted in curriculum, teachers can develop tasks based on what already works for them. Through this process, evidence-based assignments such as portfolios become more authentic and more meaningful to students. Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 125 The term alternative assessment is sometimes used because authentic assessment is an alternative to traditional assessments. Using checklists and rubrics in self and peer evaluation, students participate actively in evaluating themselves and one another. Alternative assessments measure performance in ways other than traditional paper-and-pencil, and short-answer tests. For example, a Klang Valley Science teacher may ask the students to identify the different pollutants in the Klang River and make a report to the local environmental council. Direct assessment is so-called because authentic assessments are direct measures that provide more direct evidence of meaningful application of knowledge and skills. If a student does well on a multiple-choice test, we might infer indirectly that the student could apply that knowledge in real-world contexts as well; but we would be more comfortable making that inference from a direct demonstration of that application such as in the example mentioned earlier, river pollutants. We do not just want students to know the content of the disciplines when they leave school; we want them to apply other knowledge and skills they have learnt. Direct evidence of student learning is tangible, visible, and measureable and tends to be more compelling evidence of exactly what students have and have not learnt. Teachers can directly look at studentsÊ work or performances to determine what they have learnt. 6.3 HOW TO USE AUTHENTIC ASSESSMENT? Authentic assessments focus on the learning process, sound instructional practices, and high-level thinking skills and proficiencies needed for success in the real world, and, therefore, may offer students who have been exposed to them huge advantages over those who have not. This helps students see themselves as active participants, who are working on a task of relevance, rather than passive recipients of obscure facts. It helps teachers by encouraging them to reflect on the relevance of what they teach and provides results that are useful for improving instruction. The following lists the steps which you can take to create your own authentic assessment: (a) Identify which standards you want your students to meet through this assessment; (b) Choose a relevant task for this standard or set of standards, so that students can demonstrate how they have or have not met the standards; Copyright © Open University Malaysia (OUM) 126 TOPIC 6 AUTHENTIC ASSESSMENT (c) Define the characteristics of good performance on this task. This will provide useful information regarding how well students have met the standards; and (d) Create a rubric or set of guidelines for students to follow so that they are able to assess their work as they perform the assigned task. Brady (2012) suggested some examples of authentic assessment strategies which include the following: (a) Exhibit an athletic skill; (b) Produce a short musical, dance or drama; (c) Publish a class brochure; (d) Perform a role, an oral presentation or an artistic display; (e) Plan or draw conceptual mind maps or flow charts; (f) Demonstrate the use of ICT tools such as webpages creation or video editing; (g) Construct models; (h) Produce creative writing; (i) Peer teaching, evaluating teacher-student feedback; and (a) Attempt unstructured tasks like problem-solving, open-ended questions, formal and informal observations. 6.4 ADVANTAGES OF AUTHENTIC ASSESSMENT According to Wiggins (1990), while standardised, multiple-choice tests can be valid indicators of academic performance, tests often mislead students into believing that learning requires cramming and mislead teachers into believing tests are after-the-fact, contrived and irrelevant. A move towards more authentic tasks and outcomes improves teaching and learning. In this respect, authentic assessment has many benefits, but the main benefits are as follows: (a) Authentic assessment provides parents and community members with directly observable products and understandable evidence concerning their childrenÊs performance. The quality of studentÊs work is more discernible to laypeople than when we must rely on abstract statistical figures. Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 127 (b) Authentic assessment uses tasks that reflect normal classroom activities or real-life learning as means for improving instruction; thus, allowing teachers to plan a comprehensive, developmentally-oriented curriculum based on their knowledge of each child. (c) Authentic assessment is consistent with the constructivist approach to learning. This approach emphasises that students should use their previous knowledge to build new knowledge structures, be actively involved in exploration and inquiry through task-like activities, and construct meaning from educational experience. Most authentic assessments engage students and actively involve them with complex tasks that require exploration and inquiry. (d) Authentic assessment tasks assess the studentsÊ ability on how well they can apply what they have learnt in real-life situations. An important school outcome is the ability of the students to solve problems and lead a useful life, rather than simply to answer questions about facts, principles and theories they have learnt. In other words, authentic assessments require students to demonstrate their ability to complete a task using their knowledge and skills from several areas rather than simply recalling information or saying how to do a task. (e) Authentic assessment tasks require an integration of knowledge, skills and abilities. Complex tasks, especially those that span for longer periods, require students to use different skills and abilities. Portfolios and projects, two common tools in authentic assessment, require a student to use knowledge from several different areas and many different abilities. (f) Authentic assessment focuses on higher-order thinking skills such as „applying, analysing, evaluating and creating‰, which are found in BloomÊs taxonomy. Authentic assessment evaluates thinking skills such as analysis, synthesis, evaluation and interpretation of facts and ideas ă skills which standardised tests generally avoid. (g) Embedding authentic assessment in the classroom allows for a wide range of assessment strategies. It involves the teacher-and-student collaboration in determining assessment (student-structured tasks). (h) Authentic assessment broadens the approach to student assessment. Introducing authentic assessment along with traditional assessment broadens the types of learning outcomes that a teacher can assess. It also offers students a variety of ways of expressing their learning, thus enhancing the validity of student evaluation. Copyright © Open University Malaysia (OUM) 128 TOPIC 6 AUTHENTIC ASSESSMENT (i) Authentic assessment focuses on studentÊs progress, rather than identifying their weaknesses. Authentic assessment lets teachers assess the processes students use as well as the products they produce. Many authentic tasks offer teachers the opportunity to watch the way a student goes about solving a problem or completing a task. Appropriate scoring rubrics help teachers collect information about the quality of the processes and strategies students use, as well as assess the quality of the finished product. 6.5 DISADVANTAGES OF AUTHENTIC ASSESSMENT Despite the usefulness of authentic assessment as an assessment tool, it has some drawbacks as well. Some of the criticisms are as follows: (a) High-quality Authentic Assessment Tasks Are Difficult to Develop First, they must match the complex learning outcomes that are being assessed. Teachers may decide that more than one learning outcome be assessed by the same complex task. They must also be aware that not every learning outcome can and should be assessed by authentic assessments. They should only select those that can and should. In crafting the tasks for assessment, teachers also have to decide if they want to assess the process, the product or both. Of course, most important of all, the tasks developed must allow for predetermined performance criteria. For that, the tasks must possess special characteristics. Refer to subtopic 6.6 for details. (b) High-quality Scoring Rubrics Are Difficult to Develop This is especially true when teachers want to assess complex cognitive and intangible affective learning outcomes or permit multiple answers and products. Failure to develop a high-quality rubric will affect the validity and reliability of assessment. (c) Completing Authentic Assessment Tasks Takes a Lot of Time Most authentic tasks take days, weeks or months to complete. For instance, a research project might take a few weeks and this might reduce the amount of instructional time. Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 129 (d) Scoring Authentic Assessment Tasks Takes a Lot of Time The more complex the tasks, the more time teachers can expect to spend on scoring. Complex tasks normally allow for many diverse outputs from the students. It is time consuming to score this type of outputs. Besides, assessment that focuses on the process requires that teachers monitor and score the output at different stages in the implementation of the tasks. (e) Scores from Tasks for Authentic Assessment May Have Lower Scorer Reliability With complex tasks, multiple outputs and answers, scoring depends on teachersÊ own competence. If two teachers are doing the assessment, they may mark the same output or answer of a student quite differently. This is not only frustrating to the student but lowers the reliability and validity of the assessment results. However, this problem can be solved by having welldefined rubrics and well-trained scorers to mark the studentsÊ output. (f) Authentic Assessments Have Low Reliability from the Content-sampling Point of View Normally, each authentic assessment task will only focus on specific subjectmatter content. As the task requires an extended period of time to complete, it is not possible to have a wide content coverage as in the traditional objective assessment formats which allow a broader content coverage in less time. (g) Completing Authentic Assessment Tasks May Be Discouraging to Less Able Students Complex tasks such as projects require students to sustain their interest and intensity over a long period of time. They may be overwhelmed by the high demands of the authentic assessment. Though group work may help by permitting peers to share the work and use each otherÊs differential knowledge and skills to complete the task, group work has its limitations in assessment. In sum, criticism of authentic assessments generally involves both the informal development of the assessments and difficulty in ensuring test validity and reliability given the subjective nature of human scoring rubrics as compared to computers scoring multiple-choice test items. Many teachers shy away from authentic assessments because these methodologies are time intensive to manage, grade, monitor and coordinate. Teachers find it hard to provide consistent grading scheme. The subjective method of grading may lead to bias. Teachers also find that this method is not practical for a big group of students. Copyright © Open University Malaysia (OUM) 130 TOPIC 6 AUTHENTIC ASSESSMENT Nevertheless, based on the value of authentic assessments to student outcomes, the advantages of authentic assessments outweigh these concerns. For example, once the assessment guidelines and grading rubric are created, they can be filed away and used year after year. As Linquist (1951) noted, there is nothing new about this authentic assessment methodology. This is not some kind of radical invention recently fabricated by the opponents of traditional tests to challenge the testing industry. Rather it is a proven method of evaluating human characteristics that has been in use for decades. ACTIVITY 6.2 Is authentic assessment practised in your institution? How is it done? If it is not being practised, explain why. Share your experience with your coursemates in the myINSPIRE online forum. 6.6 CHARACTERISTICS OF AUTHENTIC ASSESSMENT The main characteristics of authentic assessment have been summed up by Reeves, Herrington and Oliver (2002) who then contrasted its methodology to that of traditional assessment. According to Reeves et al. (2002), authentic assessment is characterised by the following: (a) Has Real-world Relevance The assessment is meant to focus on the impact of oneÊs work in real or realistic contexts. (b) Requires Students to Define the Tasks and Sub-tasks Needed to Complete the Activity Problems inherent in the activities are open to multiple interpretations rather than easily solved by the application of existing algorithms. Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 131 (c) Comprises Complex Tasks to Be Investigated by Students Over a Sustained Period of Time Activities are completed in days, weeks and months rather than minutes or hours. They require significant investment of time and intellectual resources. (d) Provides the Opportunity for Students to Examine the Task from Different Perspectives, Using a Variety of Resources The use of a variety of resources rather than a limited number of pre-selected references requires students to distinguish relevant information from irrelevant data. (e) Provides the Opportunity to Collaborate Collaboration is integral to the task, both within the course and the real world, rather than achievable by the individual learner. (f) Provides the Opportunity to Reflect Assessments need to enable learners to make choices and reflect on their learning, both individually and socially. (g) Can Be Integrated and Applied Across Different Subject Areas and Lead Beyond Domain-specific Outcomes Assessments encourage interdisciplinary perspectives and enable students to play diverse roles; thus, building robust expertise rather than knowledge limited to a single well-defined field or domain. (h) Authentic Activities are Seamlessly Integrated with Assessment Assessment of activities is seamlessly integrated with the major task in a manner that reflects real-world assessment, rather than separate artificial assessment removed from the nature of the task. (i) Creates Values The product, outcome or result of an assessment is polished and is valued by the student in its own right, rather than being treated as preparation for something else. (j) Allows Competing Solutions and Diversity of Outcomes Assessments allow a range and diversity of outcomes open to multiple solutions of an original nature, rather than a single correct response obtained by the application of rules and procedures. Copyright © Open University Malaysia (OUM) 132 TOPIC 6 AUTHENTIC ASSESSMENT 6.7 DIFFERENCES BETWEEN AUTHENTIC AND TRADITIONAL ASSESSMENTS Assessment is authentic when we directly examine studentÊs performance on worthy intellectual tasks. Traditional assessment, by contrast, relies on indirect or „proxy items‰ that though efficient, are simplistic substitutes from which we think valid inferences can be made about the studentÊs performance at those valued challenges (Wiggins, 1990). The differences can be summed up as in Table 6.1. Table 6.1: Comparisons between Authentic and Traditional Assessments Attributes Reasoning and practice Assessment and curriculum Authentic Assessment Schools must help students become proficient at performing the tasks they will encounter when they leave schools. To determine if teaching is successful, the school must then ask students to perform meaningful tasks that replicate real-world challenges to see if students are capable of doing so. Assessment drives the curriculum. That is, teachers first determine the tasks that students will perform to demonstrate their mastery and then a curriculum is developed that will enable students to perform those tasks well, which would include the acquisition of essential knowledge and skills. This has been referred to as planning backwards. Traditional Assessment Schools must teach this body of knowledge and skills. To determine if teaching is successful, the school must then test students to see if they acquired the knowledge and skills. The curriculum drives assessment. The body of knowledge is determined first. That knowledge becomes the curriculum that is delivered. Subsequently, the assessments are developed and administered to determine if acquisition of the curriculum occurred. Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 133 Types of assessment tasks Students are required to demonstrate understanding by performing a more complex task usually representative of more meaningful applications such as carrying out a class project and keeping portfolios. Students are required to take tests, usually the selection type, in which they are asked to select the correct answer from the choices provided. Nature of assessment tasks Real-life tasks are assigned for learners to perform in order to demonstrate their proficiency or competency. Tests that are contrived, e.g. MCQs are used to assess learnersÊ proficiency or understanding in a short period of time. Focus of assessment Construction or Application of Knowledge Recall or Recognition of Knowledge Assessment requires learners to be effective performers of the acquired knowledge. Therefore, during assessment learners are asked to analyse, synthesise and apply what they have learnt and create new meaning in the process. LearnersÊ responses in assessment Learner Structured Authentic assessments allow more student choices and construction in determining what is presented as evidence of proficiency. Even when students cannot choose their own topics or formats, there are usually multiple acceptable routes towards constructing a product or performance. In assessment, learners are only required to reveal if they can recognise and recall, normally facts that they have learnt out of context. Teacher Structured What a student can and will demonstrate has been carefully structured by the person(s) who developed the test. A studentÊs attention will understandably be focused on and limited to what is on the test. Copyright © Open University Malaysia (OUM) 134 TOPIC 6 AUTHENTIC ASSESSMENT Evidence of learnersÊ proficiency or competency Reliability and validity Authentic assessments offer more direct evidence of application and construction of knowledge. For example, asking a student to write a critique should provide more direct evidence of that skill than asking the student a series of multiple choice, analytical questions about a passage. The evidence is very indirect, particularly for claims of meaningful application in complex, real-world situations. For example, in MCQ, a student effectively cannot critique the arguments someone else has presented (an important skill often required in the real world). Validity depends in part upon whether the assessment simulates real-world tests of ability. Validity is normally determined by matching test items to the curriculum content. It is difficult to ensure reliability because of the subjective nature of scoring method (rubric) and the presence of varied but acceptable learnersÊ responses. It is possible to have high scoring reliability as the learnersÊ responses are fixed. For example, there is only one right answer to a multiple-choice item. Source: Adapted from Mueller (2005) SELF-CHECK 6.1 1. What is authentic assessment? 2. State the other names used to describe authentic assessment. 3. Highlight three differences between authentic and traditional assessments. Copyright © Open University Malaysia (OUM) TOPIC 6 AUTHENTIC ASSESSMENT 135 ACTIVITY 6.3 1. State the reasons why authentic assessment is a good replacement for traditional assessment. 2. Give an example of authentic assessment. Post your answers on the myINSPIRE online forum. The strategy of asking students to perform real-world tasks that demonstrate meaningful application of essential knowledge and skills is called authentic assessment. Authentic assessment is sometimes called performance assessment, alternative assessment or direct assessment. Authentic assessment has many advantages and traditional assessment complements the authentic assessment. An authentic assessment usually includes a task for students to perform and a rubric by which their performance on the task will be evaluated. Authentic assessment is a proven method of assessing human characteristics that has been in use for decades. Alternative assessment Indirect evidence Backwards design Kinaesthetic Contrived to real life Performance assessment Direct assessment Student structured Direct evidence Copyright © Open University Malaysia (OUM) 136 TOPIC 6 AUTHENTIC ASSESSMENT Brady, L. (2012). Assessment and reporting: Celebrating student achievement (4th ed.). Melbourne, Australia: Pearson. Kohn, A. (2006). The trouble with rubrics. English Journal, 95(4), 12ă15. Linquist, E. F. (1951). Preliminary considerations in objective test construction. In E.F. Linquist (Ed), Educational measurement (pp. 4ă22). Washington, DC: American Council on Education. Muller, J. (2005). The authentic assessment toolbox: Enhancing student learning through online faculty development. Journal of Online Learning and Teaching, 1(1), 1ă7. Reeves, T. C., Herrington, J., & Oliver, R. (2002). Authentic activity as a model for web-based learning. Presentation at the Annual Meeting of the American Educational Research Association, New Orleans, FL. Vander Ark, T. (2013). What is performance assessment? Retrieved from http://gettingsmart.com/2013/12/performance-assessment/ Wiggins, G. (1990). The case for authentic assessment. Practical Assessment, Research & Evaluation, 2(2), 1ă3. Wiggins, G., & McTighe, J. (1998). Understanding by design. Alexandria, VA: Association for Supervision and Curriculum Development (ASCD). Copyright © Open University Malaysia (OUM) Topic Project and 7 Portfolio Assessments LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Explain how to design an effective project for assessment; 2. Use different methods to assess group project work; 3. Discuss the usefulness of using projects as an assessment tool; 4. Describe the development of a portfolio; and 5. Discuss to what extent portfolios are useful as an assessment tool. INTRODUCTION Besides objective and essay tests, there are other methods of assessing students that you can use. In this topic, we will focus on two other assessment methods, project assessment and portfolio assessment. Both types of assessments are examples of authentic assessment, which you have learnt in Topic 6. We will discuss the project assessment first, followed by the portfolio assessment in this topic. Since both project and portfolio assessments are examples of authentic assessment, whatever we discuss under authentic assessment applies to them, except that the points discussed in this topic are more specific. Copyright © Open University Malaysia (OUM) 138 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 7.1 PROJECT ASSESSMENT Most of us have done some form of project work in school or university and know what a project is. However, when asked to define it, one will see varying interpretations of the project and its purpose. „Projects‰ can represent a range of tasks that can be done at home or in the classroom, by parents or groups of students, quickly or over time. While project-based learning (PBL) also features projects, in PBL, the focus is more on the process of learning and learner-peercontent interaction than the end product itself. A project is an activity in which time constraints have been largely removed and it can be undertaken individually or by a group and usually involves a significant element of work being done at home or out of school. Project work has its roots in the constructivist approach which evolved from the work of psychologists and educators such as Lev Vygotsky, Jerome Bruner, Jean Piaget and John Dewey. Constructivism views learning as the result of mental construction wherein students learn by constructing new ideas or concepts based on their current and previous knowledge. Most projects have the following common defining features (Katz & Chard, 1989) which can also be considered as the strengths of using projects as an assessment tool. (a) It is a studentăcentred process. Students have the liberty to decide and plan what to do and how to complete the project assigned though the selection of a project may be determined by the teacher. If the choice is left to the students, it probably requires the approval of the teacher. What is significant is that students are involved in the beginning, middle and end of the project. They will play an active role in the entire process and take ownership of their project. They are actively involved as problem solvers, decision makers, investigators, documenters and researchers. They should therefore find projects fun, motivating and challenging. (b) The content of a project and its work process are meaningful to students and is directly observable in their environment. This is because projects normally involve real-life problems and first-hand investigations. For instance, in working on a project, students have to choose a knowledge area, delimit it and formulate a problem or put forward questions. Then, they are required to solve the problems and answer the questions through further work, collection of materials and knowledge. Both the content gathered and the work process involved are purposeful and reflective of real-life situations. Project work allows for connections among school, life and work skills. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 139 (c) Projects can be planned with specific goals related to the curriculum. Normally, work is planned in such a manner that it draws from knowledge areas and skills in the current curriculum. It encourages students to break away from the compartmentalisation of knowledge and instead involves drawing upon different aspects of knowledge. It provides students with opportunities to explore the interrelationships and interconnectedness of topics within a subject and between subjects. For instance, the making of an object requires handicraft skills, knowledge of materials, working methods and uses of the object. Further, writing the project report requires language skills. Technological supports will also enhance studentÊs learning. Thinking skills are integral to project work. Project work thus involves drawing upon different aspects of knowledge and skills from the curriculum and provides students an integrated learning experience. (d) The product/output of a project is tangible and visible, which can be shared with the intended audience. It provides direct evidence of meaningful application of knowledge and skills by the students. Teachers can directly look at the project output/product to determine what they have learnt, while parents need not have to grapple with statistical results to know the performance of their children. (f) Project work provides opportunities for reflective thinking and student selfassessment. Students not only can reflect and self-assess what they have done at the end, but also do so during the entire process, allowing continuous learning to take place. (g) Project work allows for multiple types of authentic assessment. For example, in doing a project, students are required to use journals and diaries to document the work process, portfolios to compile the project products, reports to explain the work procedures, etc. All these outputs are useful authentic evidence of studentsÊ performance that can be assessed. (h) Project work provides an opportunity for students to explore different approaches in solving problems. In project work, a teacher follows, discusses and assesses the work in all its different phases. The teacher is the studentÊs supervisor. When working on a project, the whole work process is as important as the final result or product. Copyright © Open University Malaysia (OUM) 140 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS Generally, there are two types of projects: (a) Research-based Project This is more theoretical in nature and may consist of posing a question, formulating a problem or setting up some hypotheses. In order to answer the question, solve the problem or confirm the assumptions, information must be found, evaluated and used. This information can either be a result of their own investigations or may be obtained from public sources without being a pure reproduction. Such project work is usually presented as a research report. (b) Product-based Project This can be the production of a concrete object, a service, a dance performance, a film, an exhibition, a play, a computer programme and so forth. There are many types of effective projects. The following are some ideas for projects: (a) Survey of historical buildings in the studentÊs community; (b) Study of the economic activities of people in the local community; (c) Study of the transportation system in the district; (d) Recreate a historical event; (e) Develop a newsletter or website on a specific issue relevant to the school or community (school safety, recycling, how businesses can save energy and reduce waste); (f) Compile oral histories of the local area by interviewing community elders; (g) Produce a website as a „virtual tour‰ of the history of the community; (h) Create a video of students graduating from a primary or secondary school; (i) Create a wildlife or botanical guide for a local wildlife area; (j) Create an exhibition on local products, local history or local personalities using audiotapes, videotapes and photographs; and (k) Investigate pollution of local rivers, lakes and ponds. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 141 The possibilities are endless. The key ingredient for any project idea is that it is student-driven, challenging and meaningful. It is important to realise that projectbased instruction complements the structured curriculum. Project-based instruction builds on and enhances what students learn through systematic instruction. Teachers do not let students become the sole decision makers about what project to do, nor do teachers sit back and wait for the students to figure out how to go about the process, which may be very challenging (Bryson, 1994). This is where the teacherÊs ability to facilitate and act as a coach, plays an important role in the success of the project. The teacher will brainstorm ideas with the student to come up with a number of project possibilities, discuss these possibilities and options, help the students form a guiding question and be ready to help them throughout the implementation process such as setting guidelines, due dates, resource selection and so forth (Bryson, 1994). SELF-CHECK 7.1 1. What is a project? 2. State the differences between a research-based project and a product-based project. ACTIVITY 7.1 Give examples of the two types of projects in your subject area or any subject area. Share your answer with your coursemates in the myINSPIRE online forum. 7.1.1 What is Assessed Using Projects? Project-oriented work is becoming increasingly common in working life. Project competence, the ability to work together with others and take personal initiatives and entrepreneurship are skills often required by employers. These competences can be developed during project work which thus prepares students for working life. Project work makes schooling more like the real world. In real life, we seldom spend several hours listening to authorities who know more than we do and tell us exactly what to do and how to do things. We ask questions of the person we are Copyright © Open University Malaysia (OUM) 142 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS learning from. We try to link what the person is telling us with what we already know. We bring our experiences and what we already know that is relevant to the issue or task and say something about it. You can see this with a class of young learners. When the teacher tells a story, little kindergarten children raise their hands, eager to share their experiences with something related to the story. They want to be able to apply their natural tendencies to the learning process. This is how life is much of the time! By giving project work, we open up areas in schooling where students can speak about what they already know. Project work is a learning experience which enables the development of certain knowledge, skills and attitudes which prepares students for lifelong learning and the challenges ahead (refer to Table 7.1). Table 7.1: The Knowledge, Skills and Attitudes Achieved with Projects Domains Learning Outcomes Knowledge and skills application The ability to apply the knowledge and skills acquired in the project task. Communication Examples: Be able to choose a knowledge area and delimit a task or problem. Be able to choose relevant resources to complete the project. Be able to draw up a project plan, implement it and if necessary revise it. Be able to apply creative and critical thinking skills in solving problems. The ability to communicate effectively, by presenting their ideas clearly and coherently to specific audiences, in both the written and oral forms. Examples: Be able to discuss with their supervising teacher how their work is developing. Be able to provide a written report of the project describing the progress of work from the initial idea to final product. Be able to produce a final product which is an independent solution to the task or problem chosen. Copyright © Open University Malaysia (OUM) TOPIC 7 Collaboration PROJECT AND PORTFOLIO ASSESSMENTS 143 The skills of working with others and in a team to achieve common goals. Examples: Independent Learning Be able to participate in group discussion actively. Be able to listen actively the concerns of team members. Be able to display a willingness to be a team player. Be able to assess the strengths and weaknesses of partners. Be able to recognise the contributions of team members. Be able to play the roles assigned effectively and successfully. The ability to learn on his/her own, self-reflect and take appropriate actions to improve. Examples: Be able to document the progress of their work and regularly report the process. Be able to assess either in writing or verbally their work process and results. Be able to manage time for learning efficiently. Source: Adapted from Harwell and Blank (1997) SELF-CHECK 7.2 What are the knowledge, skills and attitudes evaluated using a project? ACTIVITY 7.2 To what extent has project work been used as an assessment strategy in Malaysian schools? Discuss this matter with your coursemates in the myINSPIRE online forum. Copyright © Open University Malaysia (OUM) 144 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 7.1.2 Designing Effective Projects There are many types of projects. There is no one correct way to design and implement a project, but there are some questions and things to consider when designing effective projects. You will be surprised that many teachers are not sure why they use projects to assess their students. It is very important for everyone involved to be clear about the learning goals of the project. Herman, Aschbacher and Winters (1992) have identified five questions to consider when determining learning goals: (a) What important cognitive skills do I want my students to develop? (For example, to use algebra to solve everyday problems, to write persuasively); (b) What social and affective skills do I want my students to develop? (For example, develop teamwork skills); (c) What metacognitive skills do I want my students to develop? (For example, reflect on the research process they use, evaluate its effectiveness and determine methods of improvement); (d) What types of problems do I want my students to be able to solve? (For example, know how to do research, apply a scientific method); and (e) What concepts and principles do I want my students to be able to apply? (For example, apply basic principles of biology and geography in their lives, understand cause-and-effect relationships). In designing project work for assessment, the teacher should also develop an outline that explains the projectÊs essential elements and his or her expectations for each project. Although the outline can take various forms, it should contain the following elements (Bottoms & Webb, 1998): (a) Situation or Problem A sentence or two describing the issue or problem that the project is trying to address. For example, the pollution levels in rivers, transportation problems in urban centres, the price of essential items are increasing, crime rate in squatter areas, youths loitering in shopping complexes, students in Internet cafes during school hours and so forth. (b) Project Description and Purpose A concise explanation of the projectÊs ultimate purpose and how it addresses the situation or problem. For example, students will research, conduct surveys and make recommendations on how students can help reduce pollution of rivers. Results will be presented in a newsletter, information brochure, exhibition or website. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 145 (c) Performance Specifications A list of criteria or quality standards the project must meet. (d) Rules Guidelines for carrying out the project include timeline and short-term goals, such as to have interviews and research completed by a certain date. (e) List of Project Participants with Roles Assigned Roles of team members and if members of the community are involved, identify their roles. (f) Assessment How the studentÊs performance will be evaluated. In project work, the learning process is being evaluated as well as the final product. What concepts and principles do I want my students to be able to apply? (For example, apply basic principles of biology and geography in their lives, understand cause-and-effect relationships). Steinberg (1998) provides a checklist, which is called the Six AÊs Project Checklist, for the design of effective projects (refer to Table 7.2). The checklist can be used throughout the process to help both teacher and student plan and develop a project, as well as to assess whether the project is successful in meeting instructional goals. Table 7.2: The Six AÊs Project Checklist Six AÊs Project Authenticity Questions Checklist Does the project stem from a problem or question that is meaningful to the student? Is the project similar to one undertaken by an adult in the community or workplace? Does the project give the student the opportunity to produce something that has value or meaning to the student beyond the school setting? Academic rigour Does the project enable the student to acquire and apply knowledge central to one or more discipline areas? Does the project challenge the student to use methods of inquiry from one or more disciplines (such as to think like a scientist)? Does the student develop higher-order thinking skills (such as searching for evidence, using different perspectives)? Copyright © Open University Malaysia (OUM) 146 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS Applied learning Does the student solve a problem that is grounded in real life and/or work (such as design a project, organise an event)? Does the student need to acquire and use skills expected in highperformance work environments (such as teamwork, problemsolving, communication or technology)? Does the project require the student to develop organisational and self-management skills? Active exploration Does the student spend significant amount of time doing work in the field, outside school? Does the project require the student to engage in real investigative work, using a variety of methods, media and sources? Is the student expected to explain what he/she learnt through a presentation or performance? Adult relationships Does the student meet and observe adults with relevant experience and expertise? Is the student able to work closely with at least one adult? Do adults and the students collaborate on the design and assessment of the project? Assessment practices Does the student reflect regularly on his or her learning, using clear project criteria that he or she has helped to set? Do adults from outside the community help the student develop a sense of the real-world standards from this type of work? Is the studentÊs work regularly assessed through a variety of methods, including portfolios and exhibitions? Source: Steinberg (1998) In implementing the project, it is also important to ensure that the following questions are addressed: (a) Do the students have easy access to the resources they need? This is especially important if a student is using specific technology or subjectmatter expertise from the community; (b) Do the students know how to use the resources? Students who have minimal experience with the computer, for example, may need extra assistance in utilising it; (c) Do the students have mentors or coaches to support them in their work? This can be in-school or out-of-school mentors; and (d) Are students clear on the roles and responsibilities of each person in the group? Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 147 ACTIVITY 7.3 1. What are some of the factors you should consider when designing project work for students in your subject area? 2. Give examples of projects you have included or can include in the teaching and evaluation of your subject area. Post your answers on the myINSPIRE online forum. 7.1.3 Possible Problems with Project Work Teachers intending to use projects both as an instructional and assessment tool should be aware of certain problem areas. Be as specific as possible in determining outcomes so that both the student and the teacher understand exactly what is to be learnt. In addition, be aware of the following problems when undertaking project-based instruction: (a) Aligning project goals with curriculum goals can be difficult and requires careful planning. For example, if one of the curriculum goals is to develop the reading skills of learners, the project planned should enable learners to acquire the skills in the process. Very often, the skills are assessed but the scope for learning is not made available. (b) Parents tend to be exam-oriented. To the parents, the best way to assess learning is through tests and examinations. They cannot see how doing a project is related to the overall assessment of learning. They are thus not supportive of projects and consider doing projects a waste of student learning time and resources. (c) Students are not clear as to what is required of them. This happens when the assigned projects do not have adequate and clear structure and guidelines. Students are also not given proper guidance on how to carry out the projects. (d) Projects often take longer time to complete and teachers need a lot of time to prepare good authentic projects, to manage and monitor their implementation. Copyright © Open University Malaysia (OUM) 148 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS (e) Teachers are not traditionally prepared to integrate curriculum content into real-world activities. Besides, they may not be familiar with how they should assess projects. There is thus a need for intensive staff development to prepare them for the job. (f) Resources needed for project work are not confined to the usual classroom materials such as paper and pencil. For instance, a project to develop a website requires a computer, special Internet program and other Internet facilities which may not be readily available to the learners. Such resources also involve cost. Support from school administration is needed. (g) Scoring studentsÊ project work can be a daunting task. Project work normally involves the assessment of diverse competencies of the students. It also allows for the production of many diverse outputs by the students. Assessing project work is time-consuming. Besides, teachers need to be specially trained to carry out this type of assessment especially when the project is undertaken by a group of student. Fairness in assessment among group members is an issue that needs special attention. 7.1.4 Group Work in Projects A group project requires two or more students to work together on a longer project. Working in groups has become an accepted part of learning as a consequence of widely recognised benefits of collaborative group work for student learning. When groups work well, students learn more and produce higher quality learning outcomes. What are some benefits of group work in projects? (a) Group Work Can Enhance the Overall Quality of Student Learning Groups that work well together can achieve much more than an individual working on his or her own. A broader range of skills can be applied to practical activities and sharing and discussing ideas can play a pivotal role in deepening an individual studentÊs understanding of a particular subject area. This is because working in a group enables him to examine topics from the perspectives of others. When an individual is required to discuss a topic and negotiate how to address it, he is forced to listen to other peopleÊs ideas. Their ideas will then influence his own thinking and broaden his horizons. His group members are not just fellow learners, they are also his teachers. Besides, being part of a team will help him develop his interpersonal skills such as speaking and listening. Group work will also help him find his own strengths and weaknesses (for example, he may be a better leader than listener, or he might be good at coming up with the „big ideas‰ but not so good at putting them into action). Enhanced self-awareness will help his approach to learning. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 149 (b) Group Work Can Improve Peer Learning Group work enhances peer learning. Students learn from each other and benefit from activities that require them to articulate and test their knowledge. Group work provides an opportunity for students to clarify and refine their understanding of concepts through discussion and rehearsal with peers. Many, but not all students recognise the value of group work to their personal development and of being assessed as a member of a group. Working with a group and for the benefit of the group also motivates some students. Group assessment helps some students develop a sense of responsibility. A student working in a group on a project may think, „I felt that because one is working in a group, it is not possible to slack off or to put things off. I have to keep working, otherwise I would be letting other people down.‰ (c) Group Work Can Help Develop Generic Skills Sought by Employers As a direct response to the objective of preparing graduates with the capacity to function successfully as team members in the workplace, there has been a trend in recent years to incorporate generic skills alongside traditional subject-specific knowledge in the expected learning outcomes in higher education. Group work can facilitate the development of skills, which include: (i) Teamwork skills (skills in working within team dynamics and leadership skills); (ii) Analytical and cognitive skills (analysing task requirements, questioning, critically interpreting material and evaluating the work of others); (iii) Collaborative skills (conflict management and resolution, accepting intellectual criticism, flexibility and negotiation and compromise); and (iv) Organisational and time management skills. A student might say, „Having to do group work has changed the way I worked. I could not do it all the night before. I had to be more organised and efficient.‰ (d) Group Work May Reduce the Work Load Involved in Assessing, Grading and Providing Feedback to Students Group work and group assessment in particular, is sometimes implemented in the hope of streamlining assessment and grading tasks. In simple terms, if students submit group assignments, then the number of pieces of work to be assessed can be vastly reduced. This prospect might be particularly attractive for staff teaching large classes. Copyright © Open University Malaysia (OUM) 150 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS SELF-CHECK 7.3 1. What are some problems in the implementation of project work and how would you solve them? 2. What are the benefits of group work in projects? 7.1.5 Assessing Project Work Assessing student performance on project work is quite different from an examination using objective tests and essay questions. It is possible that students might be working on different projects; some may be working in groups while others are working alone. This makes the task of assessing student progress even more complex compared with a paper-and-pencil test where everyone is evaluated using one marking scheme. Table 7.3 illustrates the general marking scheme for projects. Table 7.3: General Marking Scheme for Projects Marks 100ă90% Criteria Exceptional and distinguished work of a professional standard. Outstanding technical and expressive skills. Work demonstrating exceptional creativity and imagination. Work displaying great flair and originality. 89ă80% Excellent and highly developed work of a professional standard. Extremely good technical and expressive skills. Work demonstrating a high level of creativity and imagination. Work displaying flair and originality. 79ă70% Very good work which approaches professional standard. Very good technical and expressive skills. Work demonstrating good creativity and imagination. Work displaying originality. 69ă60% A good standard of work. Good technical and expressive skills. Work displaying creativity and imagination. Work displaying some originality. Copyright © Open University Malaysia (OUM) TOPIC 7 59ă50% PROJECT AND PORTFOLIO ASSESSMENTS 151 A reasonable standard of work. Adequate technical and expressive skills. Work displaying competence in the criteria assessed, but which may be lacking some creativity or originality. 49ă40% A limited, but adequate standard of work. Limited technical and expressive skills. Work displaying some weaknesses in the criteria assessed and lacking creativity or originality. 39ă30% Limited work which fails to meet the required standard. Weak technical and expressive skills. Work displaying significant weaknesses in the criteria assessed. 29ă20% Poor work. Unsatisfactory technical or expressive skills. Work displaying significant or fundamental weaknesses in the criteria assessed. 19ă10% Very poor work or work where very little attempt has been made. A lack of technical or expressive skills. Work displaying fundamental weaknesses in the criteria assessed. 9ă1% Extremely poor work or work where no serious attempt has been made. Source: Chard (1992) Product, Process or Both? According to Bonthron and Gordon (1999), from the onset, you should be clear about the following: (a) Whether you are going to assess the product of the group work or both product and process. (b) If you intend to assess process, what proportions of the marks are you going to allocate for process and based on what criteria? And how are you going to use the criteria to assess process? (c) What criteria are you planning to use to assess the product and how will the marks be distributed? Some educators believe there is a need to assess the processes within groups as well as the products or outcomes. What exactly does process mean? Both teachers and students must be clear what process means. For example, if you want to assess „the level of interaction‰ among students in the group, they should know what „high‰ or „low‰ interaction means. Should the teacher be involved in the working Copyright © Open University Malaysia (OUM) 152 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS of each group or should it rely on self or peer assessment? Obviously, being involved in so many groups would be physically impossible for the teacher. So, how do you measure „process‰? Some educators may say, „I donÊt care what they do in their groups. All IÊm interested in is the final product and how they arrive at their results is their business.‰ However, to provide a more balanced assessment, there is growing interest in both the process and product of group work and the issue that arises is, „What proportion of assessment should focus on product and what proportion should focus on process?‰ The criteria for the evaluation of group work can be determined by teachers alone or by both teachers and students through consultation. Group members can be consulted on what should be assessed in a project through consultation with the teacher. Obviously, you have to be clear about the intended learning outcomes of the project in your subject area. It is a useful starting point for determining criteria for assessment of the project. Once these broader learning outcomes are understood, you can establish the criteria for marking the project. Generally, it is easier to establish criteria for measuring the product of project work and much more difficult to measure the processes involved in project work. However, it is suggested that evaluation of product and process can be done separately rather than attempting to do both at once. Who Gets the Marks ă Individuals or the Group? Most projects involve more than one student and the benefits of group work have been discussed earlier. A major problem of evaluating projects involving group work is how to allocate marks fairly among group members. As exclaimed by a student, „I would like my teacher to tell me what amount of work and effort will get what mark.‰ Other concerns would be, „Do all students get the same mark even though not all students put in the same effort?‰ and „Are marks given for the individual contribution of team members?‰ These are questions that bother teachers, especially when it is common to find free loaders or sleeping partners in group projects. The following are some suggestions how group work may be assessed: (a) Shared Group Mark All group members receive the same mark for the work submitted regardless of individual contribution. It is a straightforward method that encourages group work where group members sink or swim together. However, it may be perceived as unfair by better students who may complain that they are unfairly disadvantaged by weaker students and the likelihood of „sleeping partners‰ is very high. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 153 (b) Share-out Marks The students in the group decide how the total number of marks should be shared between them. For example, a score of 40 is given by the teacher for the project submitted. There are five members in the group and so the total score possible is 5 40 = 200. The students then share the 200 marks based on the contribution of each of the five students; which may be 35, 45, 42, 38 and 40. This is an effective method if group members are fair, honest and do not have ill feelings towards each other. However, there is the likelihood for the marks to be equally distributed to avoid ill feelings among group members. (c) Individual Mark Each student in the group submits an individual report based on the task allocated or on the whole project. (d) (i) Allocated Task From the beginning, the project is divided into different parts or tasks and each student in the group completes his or her allocated task that contributes to the final group product and gets the marks for that task. This method is a relatively objective way of ensuring individual participation and may motivate students to work hard on their task or part. The problem is breaking up the project into tasks that are exactly equal in size or complexity. Also, the method may not encourage group collaboration and some members may slow down progress. (ii) Individual Report Each student writes and submits an individual report based on the whole project. The method ensures individual effort and may be perceived as fair by students. However, it is difficult to determine how the individual reports should differ and students may unintentionally commit plagiarism. Individual Mark (Examination) Use examination questions that specifically target the group projects, and can only be answered by students who have been thoroughly involved in the project. This method may motivate students to learn from the group project including learning from the other members of the group. However, it may not be effective because students may be able to answer the questions by reading the group reports. In the Malaysian context, a national examination may not be able to include such questions as it involves hundreds of thousands of students. Copyright © Open University Malaysia (OUM) 154 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS (e) Combination of Group Average and Individual Marks The group mark is awarded to each member with a mechanism for adjusting for individual contributions. This method may be perceived to be fairer than shared group mark. However, it means additional work for teachers trying to establish individual contribution. ACTIVITY 7.4 Which of the five methods of assessing group work would you use in evaluating project work in your subject area? Give reasons for your choice. Post your answer on the myINSPIRE online forum. 7.1.6 Evaluating Process in a Project The assessment of a group product is rarely the only assessment taking place in group activities. The process of group work is increasingly recognised as an important element in the assessment of group work. Moreover, where group work is marked solely on the basis of product and not process, there can be differences in individual grading that are unfair and unacceptable. The following are the elements which are considered in evaluating process: (a) Peer/Self Evaluation of Roles Students rate themselves as well as other group members on specific criteria, such as responsibility, contributing ideas and finishing tasks. This can be done through various grading forms (refer to Figure 7.1) or having students write a brief essay on the group membersÊ strengths and weaknesses. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 155 Figure 7.1: Checklist for evaluating processes involved in project work Source: Sutherland (2003) Copyright © Open University Malaysia (OUM) 156 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS (b) Individual Journals Students keep a journal of events that occur in each group meeting. These include who attended, what was discussed and plans for future meetings. These can be collected and periodically read by the instructor, who comments on progress. The instructor can provide guidance for the group without directing them. (c) Minutes of Group Meetings Similar to journals are minutes for each group meeting, which are periodically read by the instructor. These include who attended, tasks completed, tasks planned and contributors to various tasks. This provides the instructor with a way of monitoring individual contributions to the group. (d) Group and Individual Contribution Grades Instructors can divide the project grade into percentage of individual and group contribution. This is especially beneficial if peer and self-evaluations are used. Logs can potentially provide plenty of information to form the basis of assessment while keeping minutes helps members to focus on the process which is a learning experience in itself. These techniques may be perceived as a fair way to deal with „shirkers‰ and outstanding contributions. However, reviewing logs can be time consuming for teachers and students may need a lot of training and experience in keeping records. Also, emphasis on second-hand evidence may not be reliable. 7.1.7 Self-assessment in Project Work Self-assessment is a process by which students learn about themselves; for example, what they have learnt about the project, how they have learnt and how they have reacted in certain situations when carrying out the project. Involving students in the assessment process is an essential part of balanced assessment. When students become partners in the learning process, they gain a better sense of themselves as readers, writers and thinkers. Some teachers maybe uncomfortable with self-assessment because traditionally teachers are responsible for all forms of assessment in the classroom and here we are asking students to assess themselves. Self-assessment can take many forms such as: (a) Discussion involving the whole class or small groups; (b) Reflection logs; Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 157 (c) Self-assessment checklist or inventories; and (d) Teacher-student interviews. These types of self-assessment share a common theme; they ask students to review their work to determine what they have learnt and if areas of confusion still exist. Although each method may differ slightly, they all include enough time for students to consider thoughtfully and evaluate their own progress. Since project learning is student-driven, assessment should be student-driven as well. Students can keep journals and logs to continually assess their progress. A final reflective essay or log can allow students and teachers to understand thinking processes, reasoning behind decisions, ability to arrive at conclusions and communicate what they have learnt. According to Edwards (2000), the following are some questions a student can ask himself or herself while self-assessing: (a) What were the projectÊs successes? (b) What might I do to improve the project? (c) How well did I meet my learning goals? What was most difficult about meeting the goals? (d) What surprised me most about working on the project? (e) What was my groupÊs best team effort? Worst team effort? (f) How do I think other people involved with the project felt it went? (g) What were the skills I used during this project? How can I practise these skills in the future? SELF-CHECK 7.4 1. Explain how process can be measured in a group project work. 2. List some of the problems with the evaluation of process. ACTIVITY 7.5 Do you think process should be assessed? Justify your answer in the myINSPIRE online forum. Copyright © Open University Malaysia (OUM) 158 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 7.2 WHAT IS A PORTFOLIO? A portfolio is a collection of pieces of student work presented for assessment. However, it does not contain all the work a student does. It may contain examples of „best‰ works or examples from each of several categories of work, for example, a book review, a letter to a friend, a creative short story and a persuasive essay. A studentÊs portfolio may have one or several goals. The student will select and submit works to meet these goals. The works submitted should provide evidence of studentsÊ progress towards the goals and reflect both student production and process. A portfolio is therefore not a pile of student work that accumulates over a semester or year. Rather, a portfolio contains a purposefully selected subset of student work which reflects his or her efforts, progress and achievements in different areas of the curriculum. Some people may associate portfolios with the stock market where a person or organisation keeps a portfolio of stocks and shares owned. A portfolio can be defined as a container that holds evidence of an individualÊs skills, ideas, interests and accomplishments. The organised collection of contents such as text, files, photos, videos and more to tell that story, are generically referred to as artefacts and evidence of what students have learnt. These artefacts are usually accompanied by studentsÊ reflection. The particular purposes of portfolio determine the number and type of items to be included, the process for selecting the items, how and whether students respond to the items selected. Portfolios offer a way of assessing student learning that is different from traditional methods. Portfolio assessment provides the teacher and students an opportunity to observe students in a broader context: taking risks, developing creative solutions and learning to make judgements about their own performances. (Paulson, Paulson & Meyer, 1991) Portfolios typically are created for one of three purposes; to show growth, to showcase current abilities and to evaluate cumulative achievement. Many educators who work with portfolios consider the reflection component the most critical element of a good portfolio. Simply selecting samples of work can produce meaningful stories about students and others can benefit from „reading‰ these stories. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 159 The students themselves are missing significant benefits of the portfolio process if they are not asked to reflect upon the quality and growth of their work. As Paulson et al. (1991) stated, „The portfolio is something that is done by the student, not to the student.‰ Most importantly, it is something done for the student. The student needs to be directly involved in each phase of the portfolio development to learn the most from it and the reflection phase holds the most promise for promoting studentÊs growth. Portfolios are sometimes described as portraits of a personÊs accomplishments. Using this metaphor, we can consider a student portfolio a self-portrait, but one that has benefited from guidance and feedback from a teacher and sometimes from other students. 7.2.1 What is Portfolio Assessment? Increasingly, portfolio assessment is gaining acceptance as an assessment strategy seeking to present a more holistic view of the learner. The collection of works by students are assessed and hence the term portfolio assessment. However, some suggest that portfolios are not really assessments at all because they are just collections of previously completed assessments. In portfolio assignment, students are in fact performing authentic tasks which capture meaningful application of knowledge and skills. Their portfolios often tell compelling stories of the growth of the studentsÊ talents and showcase their skills through a collection of authentic performances. The portfolio provides for continuous and ongoing assessment (i.e. formative assessment) as well as assessment at the end of a semester or a year (i.e. summative assessment). Emphasis is more on monitoring studentsÊ progress towards achieving the learning outcomes of a particular subject, course or programme. Portfolio assessment has been described as multidimensional because it allows students to include different aspects of their works such as essays, project reports, performance on objective tests, objects or artefacts they have produced, poems, laboratory reports and so forth. In other words, the portfolio contains samples of work over an entire semester, term or year, rather than single points in time, such as during examination week only. Using portfolios introduces students to an evaluation format which they may need to become familiar as more schools adopt portfolio assessment. Although many portfolios reflect long-term projects completed over a period of time, they do not have to be that way. Teachers can have students create portfolios of their work for a particular unit. That portfolio might count as a project for that particular topic of study. Though portfolios assessment is currently quite popular in our school system, there are still teachers who are uncomfortable to use it as an assessment Copyright © Open University Malaysia (OUM) 160 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS tool. These teachers may have the thinking that the portfolio is a very subjective form of assessment. They may be unsure of the purpose of a portfolio and its uses in the classroom. To them, there is also the question of how the portfolio can be most effectively used to assess student learning. The situation can be overcome if these teachers understand the purpose of portfolios, how the portfolios can be used to evaluate their studentsÊ work and how grades will be determined. Portfolio assessment represents a significant shift in thinking about the role of assessment in education. Teachers who use this strategy in the classroom have shifted their philosophy of assessment from merely comparing achievement (based on grades, test score, percentile rankings) towards improving studentsÊ achievement through feedback and self-reflection. Teachers should convey to students the purpose of the portfolio, what constitute quality work and how the portfolio is graded. 7.2.2 Types of Portfolios There are two main types of portfolios, namely process-oriented and productoriented portfolios (refer to Table 7.4). Table 7.4: Types of Portfolios Process-oriented Portfolios Product-oriented Portfolios These portfolios tell a story about the student and how the learner has grown. It will include earlier drafts and how these drafts have been improved upon. For example, the first draft of a poem written by a Year Three student is reworked based on the comments by the teacher and the student reflecting on his or her work. All the drafts and changes made are kept in the portfolio. In this manner, student works can be compared by providing evidence of how the studentÊs skills have improved. These portfolios contain the works of a student which he or she considers the best. The aim is to document and reflect on the quality of the final products rather than the process that produced them. The student is required to collect all his or her work at the end of the semester, at which time he or she must select those works which are of the highest quality. Students could be left to make the decision themselves or the teacher can set the criteria on what a portfolio must contain and the quality of the works to be included. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 161 SELF-CHECK 7.5 1. What is portfolio assessment? 2. Describe two main types of portfolios. 7.2.3 Developing a Portfolio According to Epstein (2006), the design and development of a portfolio involves four main steps as follows: (a) Collection This step simply requires students to collect and store all of their work. Students have to get used to the idea of documenting and saving their work which they may not have done before. Questions involved in this step are: (i) How should the work be organised? By subject or by themes? (ii) How should the work be recorded and stored? (iii) How to get students to form the habit of documenting evidence? (b) Selection This will depend on whether it is a process or product portfolio and the criteria set by the teacher. Students will go through the work collected and select certain works for their portfolio. This might include: examination papers and quizzes, audio and video recordings, project reports, journals, computer work, essays, poems, artwork and so forth. Questions related to this step are: (i) How does one select? What is the basis of selection? (ii) Who should be involved in the selection process? (iii) What are the consequences of not completing the portfolio? (c) Reflection This is the most important step in the portfolio process. It is reflection that differentiates the portfolio from a mere collection of student work. Reflection is often done in writing but it can also be done orally. Students are asked why they have chosen a particular product or work (such as essay); and how it Copyright © Open University Malaysia (OUM) 162 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS compares with other work, what particular skills and knowledge were used to produce it (such as the essay) and how it can be further improved. Questions related to this step are: (d) (i) Should students reflect on how or why they chose certain works? (ii) How should students go about the reflection process? Connection As a result of „reflection‰, students begin to ask themselves, „Why are we doing this?‰ They are encouraged to make connections between their schoolwork and the value of what they are learning. They are also encouraged to make connections between the work included in their portfolio with the world outside the classroom. They learn to exhibit what they have done in school to the community. Questions to ask are: (i) How is the cumulative effect of the portfolio evaluated? (ii) Should students exhibit their works? 7.2.4 Advantages of Portfolio Assessment Portfolio assessments have been gaining importance as an assessment strategy in educational institutions because of their benefits to teaching and learning. However, like other assessment methods, the benefits of portfolio assessments also bring along some problems. Let us first look at the advantages of using portfolios as an assessment tool. (a) Allows Assessment of Creativity and Higher-level Cognitive Skills It has frequently been suggested that paper-and-pencil tests (objective and essay tests) are not able to assess all the learning outcomes in a particular subject area. For example, many higher-level cognitive skills and the affective domain (feelings, emotions, attitudes and values) are not adequately assessed using traditional assessment methods. However, portfolio assessments allow for the assessment of studentsÊ higher-level cognitive skills such as critical and creative thinking skills. For instance, they can be assessed on how critical they are in individual reflection. Likewise, they can be assessed how creative they are in the selection and presentation of their works for portfolio assessments. Materials compiled by students in the portfolio development process also provide evidence about his or her growth in the affective domain such as self-confidence, diligence, attention to detail and positive attitude towards learning. Copyright © Open University Malaysia (OUM) TOPIC 7 (b) PROJECT AND PORTFOLIO ASSESSMENTS 163 Continous, Ongoing Process Portfolio assessments are an ongoing process. Hence, they not only provide an opportunity for the teacher to trace or monitor change and growth over a period of time, but also provide an opportunity for students to reflect their own learning and thinking. Teachers have an opportunity to monitor their understanding and approaches to solving problems and decision making (Paulson et al., 1991), while upon reflection, students can identify where they have gone wrong or how they can improve themselves. Emphasis in portfolio assessment is on improving studentÊs achievement rather than ranking students according to their performance on tests. Portfolio assessments are both formative and summative. Since the assessment is a continuous and ongoing process, it allows students to reflect on their own learning and thinking and allows teachers to monitor studentsÊ progress and to provide feedback. Through teacher feedback and selfreflection, students improve their achievement. The assessment of this learning process is formative while the assessment of the documents that are being compiled as well as assessment of the final products at the end of the semester/year are summative. (c) Multidimensional Portfolio assessments are multidimensional in that they allow for inclusion of different aspects of studentsÊ works and also samples of work over time, i.e. over a semester/year. Portfolios are thus a rich source of evidence on student learning and development, providing a more complete and holistic picture of studentsÊ achievement. (d) Encourages Self-assessment Portfolio assessments involve students in the assessment process. Students self-assess their own work and decide what to include as the evidence of their growth and performance. They judge their work using explicit criteria to identify strengths and weaknesses and monitor their own progress. By selfevaluating their own work, students will become more accountable and be more responsible of their own learning. Student learning has also become more meaningful. For self-evaluation to take place, teachers should constantly invite students to reflect their growth and performance as students. The teachers should convey to students the purpose of the portfolio, what constitutes quality work and how the portfolio is graded. Feedback enables learners to reflect on what they are learning and why. Copyright © Open University Malaysia (OUM) 164 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS (e) Tangible Outcomes In portfolio assessments, the portfolios give parents and teachers concrete examples of studentsÊ development over time as well as their current skills and abilities. The assessment outcomes are also tangible and more meaningful than the numeric statistical results. For example, through the compositions written and compiled by a student in the portfolio, parents and teachers cannot only see how he or she has mastered the writing skills but also how he or she has improved over time. (f) Individualised Portfolio assessments are individualised, meaning that studentsÊ portfolios will be assessed separately. This allows teachers to see their students as individuals, each with his or her own unique characteristics, needs and strengths. With this understanding, teachers can adapt their instruction to the learning needs and styles of the students. 7.2.5 Disadvantages of Portfolio Assessment There are also disadvantages in portfolio assessment. (a) Time Consuming Extra time is needed to plan an assessment system as the assessment involves multiple learning outcomes. The problem is further compounded if a large group of students need to be assessed. The portfolio outputs can be varied and require expert judgement. Thus, assessing portfolios is time consuming for teachers and the data from portfolios will be difficult to analyse. (b) Need for Constant Feedback and Mentoring Moreover, for portfolio assessments to be beneficial to students, teachers need to provide constructive feedback on the work included in the studentsÊ portfolios and on the portfolios as a whole. Teachers also need to provide guidance through portfolio conferences about how best to construct a portfolio for a specific purpose. Scheduling individual portfolio conferences is difficult and the length of each conference may interfere with other instructional activities. There is thus a need to ensure that the benefits of portfolio assessments justify the investment of time by the teachers. (c) Poor Reliability Scoring portfolios involves extensive use of subjective evaluation procedures and thus open to the question of reliability. Assessment is useless if data obtained is unreliable. According to Nitko (2001), the reliability of portfolio results is typically in the 0.4 to 0.6 range. This indicates that as much as 60 per cent of the variability in portfolio scores is the results of measurement Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 165 error. This should give all educators a reason to be cautious when using the results of portfolio assessments in assigning course grades, certifying achievement or making high-stake decisions. Part of the poor reliability comes from the difficulty of establishing clear scoring criteria for the large and diverse sets of materials that are included. Poor reliability is also due to a lack of standardisation that leads to incomparability of portfolio entries that different students choose to include. (d) Energy, Skills and Resources Society is still strongly oriented towards grades and test scores and in addition, most universities and colleges still use test scores and grades as the main admission criteria. Parents who are exam-oriented cannot see how keeping portfolios is related to the overall assessment of learning. They are thus usually not supportive of portfolio assessments and consider them a waste of student learning time and resources. To them, the best way to assess learning is through tests and examinations. An important aspect of portfolio assessments is reflection. In fact, this is the most important step in the portfolio process. As mentioned earlier in subtopic 7.2.3, it is reflection that differentiates the portfolio from a mere collection of student work. Reflection is often done in writing but it can also be done orally. They will be asked to reflect on what they are learning and why. To be able to do this, students must possess good metacognitive skills to think about what they are thinking. This can be a problem to students, especially those who are underachievers. In summary, portfolio assessments have significant strengths and weaknesses. On the positive side, they provide a broad framework for examining a studentÊs progress, encourage student participation in the assessment process and strengthen the relationship between instruction and assessment. On the down side, they demand considerable time, energy and a certain degree of expertise on the part of teachers as well as the students, and have questionable reliability. 7.2.6 How and When Should Portfolios be Assessed? If the purpose of the assessment is to demonstrate progress, the teacher could make judgements about the evidence of progress and provide those judgements as feedback to the student. The student could self-assess progress to check whether the goals have been met or not. Copyright © Open University Malaysia (OUM) 166 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS The portfolio is more than just a collection of student work. The teacher may assess and assign grades to the process of assembling and reflecting upon the portfolio of a studentÊs work. The students might have also included reflections on growth, on strengths and weaknesses, on goals that were or are to be set, on why certain samples tell a certain story about them or on why the contents reflect sufficient progress to indicate completion of designated standards. Some of the process skills may also be part of the teacherÊs or schoolÊs or districtÊs standards. So, the portfolio provides some evidence of attainment of those standards. Any or all of these elements can be evaluated and/or graded. The portfolio assignments can be assessed or graded with a rubric. Rubric is useful in avoiding personal judgement in assessing a complex product such as a portfolio. Clear criteria for assessment, including what must be included in the portfolio and rubrics are vital to a successful portfolio assessment. Rubric can provide some clarity and consistency in assessing and judging the quality of the content and the elements making up that content. Moreover, application of a rubric increases the likelihood of consistency among the teachers who are assessing the portfolios. Table 7.5 is a sample portfolio rubric that may be used for self-assessment and peer feedback. Table 7.5: Portfolio Rubric Criteria Unsatisfactory Emerging Proficient Exemplary Selection of artefacts The artefacts and work samples do not relate to the purpose of the portfolio. Some of the artefacts and work samples are related to the purpose of the portfolio. Most artefacts and work samples are related to the purpose of the portfolio. All artefacts and work samples are clearly and directly related to the purpose of the portfolio. A wide variety of artefacts is included. Descriptive text No artefacts are accompanied by a caption that clearly explains the importance of the item including title, author and date. Some of the artefacts are accompanied by a caption that clearly explains the importance of the item including title, author and date. Most of the artefacts are accompanied by a caption that clearly explains the importance of the item work including title, author and date. All artefacts are accompanied by a caption that clearly explains the importance of the item including title, author and date. Copyright © Open University Malaysia (OUM) Rating TOPIC 7 Reflection Citations PROJECT AND PORTFOLIO ASSESSMENTS 167 The reflections do not explain growth or include goals for continued learning. A few of the reflections explain growth and include goals for continued learning. Most of the reflections explain growth and include goals for continued learning. All reflections clearly explain how the artefacts demonstrate studentsÊ growth, competencies, accomplishments and include goals for continued learning. The reflections do not illustrate the ability to effectively critique work or provide suggestions for constructive practical alternatives. A few reflections illustrate the ability to effectively critique work and provide suggestions for constructive practical alternatives. Most of the reflections illustrate the ability to effectively critique work and provide suggestions for constructive practical alternatives. All reflections illustrate the ability to effectively critique work and provide suggestions for constructive practical alternatives. No images, media or text created by others are cited with accurate, properly formatted citations. Some of the images, media or texts created by others are not cited with accurate, properly formatted citations. Most images, media or text created by others are cited with accurate, properly formatted citations. All images, media or text created by others are cited with accurate, properly formatted citations. Copyright © Open University Malaysia (OUM) 168 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS Usability and layout Writing convention The portfolio is difficult to read due to inappropriate use of fonts, type size for headings, subheadings and text and font styles (italic, bold, underline). The portfolio is often difficult to read due to inappropriate use of fonts and type size for headings, subheadings, text or long paragraphs. The portfolio is generally easy to read. Fonts and type size vary appropriately for headings, subheadings and text. The portfolio is easy to read. Fonts and type size vary appropriately for headings, subheadings and text. Many formatting tools are under or overutilised and decrease the reader accessibility to the content. Some formatting tools are under or overutilised and decrease the readersÊ accessibility to the content. Use of font styles (italic, bold, underline) is generally consistent. Use of font styles is consistent and improves readability. There are more than six errors in grammar, capitalisation, punctuation and spelling requiring major editing and revision. There are four or more errors in grammar, capitalisation, punctuation, and spelling requiring editing and revision. There are a few errors in grammar, capitalisation, punctuation and spelling. These require minor editing and revision. There are no errors in grammar, capitalisation, punctuation and spelling. TOTAL Source: Vandervelde (2018) Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 169 SELF-CHECK 7.6 1. Describe four main steps to develop a portfolio. Which do you think is the most important step? 2. Examine to what extent portfolio assessments are useful as an assessment tool. 3. Justify how and when portfolios should be assessed. ACTIVITY 7.6 Discuss in the myINSPIRE online forum: (a) To what extent is portfolio assessment used in Malaysian classrooms? (b) Do you think portfolio assessment can be used as an assessment technique in your subject area? Justify your answer. A project is an activity in which time constraints have been largely removed. It can be undertaken individually or by a group, and usually involves a significant element of work done at home or out of school. A research-based project is more theoretical in nature and may consist of putting a question, formulating a problem or setting up some hypotheses. A product-based project would be the production of a concrete object, a service, a dance performance, a film, an exhibition, play, a computer programme and so forth. Project work is a learning experience which enables the development of certain knowledge, skills and attitudes which prepares students for lifelong learning and the challenges ahead: knowledge application, collaboration, communication and independent learning. Copyright © Open University Malaysia (OUM) 170 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS An effective project should contain the following elements: situation or problem, project description and purpose, performance specifications, rules, roles of member and assessment. The Six AÊs of a project comprises academic rigour, applied learning, authenticity, active exploration, adult relationships and assessment practices. Working in groups has become an accepted part of learning as a consequence of widely recognised benefits of collaborative group work for student learning. Allocating marks in a project work: include shared group marks, shared-out marks, individual mark, individual mark (examination) and combination of group average and individual mark. A portfolio is a purposeful collection of the works produced by students which reflects their efforts, progress and achievements in different areas of the curriculum. Teachers need to know the benefits and weaknesses of portfolios and use them to help in studentÊ learning. A portfolio is a purposeful collection of the work produced by students which reflects their efforts, progress and achievements in different areas of the curriculum. The portfolio provides for continuous and ongoing assessment (i.e. formative assessment) as well as assessment at the end of a semester or a year (i.e. summative assessment). The portfolio assignments can be assessed or graded with a rubric. As a formative assessment tool, student portfolios can be used by teachers as informal diagnostic techniques or feedback. Copyright © Open University Malaysia (OUM) TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS 171 Artefacts Product-oriented portfolio Formative assessment Project assessment Group work Research-based projects Peer evaluation Rubrics Portfolio assessment Self-assessment Portfolios Six AÊs effective projects Process-oriented portfolio Summative assessment Product-based project Bonthron, S., & Gordon, R. (1999). Service learning and assessment: A field guide for teachers. Evaluation/Reflection. Paper 45. Retrieved from http://digitalcommons.unomaha.edu/ Bottoms, G., & Webb, L. D. (1998). Connecting the curriculum to „real life.‰ Breaking ranks: Making it happen [Guide ă Non-classroom]. Reston, VA: National Association of Secondary School Principals. Bryson, E. (1994). Will a project approach to learning provide children opportunities to do purposeful reading and writing, as well as provide opportunities for authentic learning in other curriculum areas? Descriptive report. ERIC Document No. ED392513. Chard, S. C. (1992). The project approach: A practical guide for teachers. Edmonton, Canada: University of Alberta Printing Services. Edwards, K. M. (2000). EveryoneÊs guide to successful project planning: Tools for youth. Portland, OR: Northwest Regional Educational Laboratory. Epstein, A. (2006). Introduction http://www.teachervision.com to portfolios. Retrieved Copyright © Open University Malaysia (OUM) from 172 TOPIC 7 PROJECT AND PORTFOLIO ASSESSMENTS Harwell, S., & Blank, W. (1997). Connecting high school with the real world. ERIC Document No. ED407586. Herman, J. L., Aschbacher, P. R., & Winters, L. (1992). A practical guide to alternative assessment. Alexandria, VA: Association for Supervision and Curriculum Development. Katz, L. G., & Chard, S. C. (1989). Engaging the minds of young children: The project approach. Norwood, NJ: Ablex. Nitko, A. J. (2001). Educational assessment of students. New Jersey, NJ: Pearson. Steinberg, A. (1998). Real learning, real work: School-to-work as high school reform. New York, NY: Routledge. Sutherland, M. (2003). Peer evaluation checklist for the Biotechnology Academy at Andrew P. Hill High School. San Jose, CA: East Side Union High School District. Paulson F. L., Paulson, P. R., & Meyer, C. (1991). What makes a portfolio a portfolio? Educational Leadership, 48(1), 60ă63. Vandervelde, J. (2018). Eportfolio (digital portfolio) rubric. Retrieved from http://www2.uwstout.edu Copyright © Open University Malaysia (OUM) Topic Reliability and 8 Validity of Assessment Techniques LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Explain the concept of a true score and reliability coefficient; 2. Apply the different methods of estimating the reliability of a test; 3. Compare the different techniques of establishing the validity of a test; 4. Identify the factors affecting reliability and validity; and 5. Discuss the relationship between reliability and validity. INTRODUCTION We have discussed the various methods of assessing student performance using objective tests, essay tests, authentic assessments, project assessments and portfolio assessment. In this topic, we will address two important issues, namely, the reliability and validity of these assessment methods. How do we ensure that the techniques we use for assessing the knowledge, skills and values of students are reliable and valid? We are making important decisions about the abilities and capabilities of the future generation and obviously we want to ensure that we are making the right decisions. Copyright © Open University Malaysia (OUM) 174 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 8.1 WHAT IS RELIABILITY? What is reliability? Reliability is the consistency of the measurement. Let say you gave a Geometry test to a group of Form Five students and one of your students named Swee Leong obtained a score of 66 per cent in the test. How sure are you that it is actually the score that Swee Leong should receive? Is that his true score? When you develop a test and administer it to your students, you are attempting to measure as far as possible the true score of each student. The true score is a hypothetical concept with regard to the actual ability, competency and capacity of an individual. A test attempts to measure the true score of a person. When measuring human abilities, it is practically impossible to develop an errorfree test and there will be error. However, just because there is an error it does not mean that the test is not good; what is more important is the size of the error. Formally, an observed test score, X, is conceived as the sum of a true score, T and an error term, E. The true score is defined as the average of test scores if a test is repeatedly administered to a student (and the student can be made to forget the content of the test in-between repeated administrations). Given that the true score is defined as the average of the observed scores, so in each administration of a test, the observed score departs from the true score and the difference is called measurement error. This statement can be simplified as follows: Observed score (X) = True score (T) + Error (E) This departure is not caused by blatant mistakes made by the test writers, but it is caused by some chance elements in studentsÊ performance on a test. Measurement error mostly comes from the fact that we have only sampled a small portion of a studentÊs capabilities. Ambiguous questions and incorrect markings can contribute to measurement error but it is only a small part of it. Imagine if there are 10,000 items and a student can obtain 60 per cent if all 10,000 items are administered (which is not practically feasible). Then, 60 per cent is the true score. Now, assume that you sample only 40 items to put in a test. The expected score for the student is 24 items. However, the student may get 20, 26, 30 and so on depending on which items are in the test. This is the main source of measurement error. That is, measurement error is due to the sampling of items, rather than poorly written items. Copyright © Open University Malaysia (OUM) TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 175 Generally, the smaller the error, the greater the likelihood that you are closer to measuring the true score of a student. If you are confident that your Geometry test (observed score) has a small error, then you can confidently infer that Swee LeongÊs score of 66 per cent is close to his true score or his actual ability in solving geometry problems; i.e. what he actually knows. To reduce the error in a test, you must ensure that your test is both reliable and valid. The higher the reliability and validity of your test, the greater the likelihood that you will be measuring the true score of your students. We will first examine the reliability of a test. Would your students get the same scores if they took your test on two different occasions? Would they get approximately the same scores if they took two different forms of your test? These questions have to do with the consistency of your classroom tests in measuring studentsÊ abilities, skills and attitudes or values. The generic name for consistency is reliability. Reliability is an essential characteristic of a good test because if a test does not measure consistently (reliably), then you cannot count on the scores resulting from the administration of the test. 8.2 THE RELIABILITY COEFFICIENT Reliability is quantified as a reliability coefficient. The symbol used to denote a reliability coefficient is r with two identical subscripts (for example, rxx). The reliability coefficient is generally defined as the variance of the true score divided by the variance of the observed score. The following is the equation. rxx 2 true score Variance of the true score 2 Variance of the observed score observed score Copyright © Open University Malaysia (OUM) 176 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES If there is relatively little error, the ratio of the true score variance to the observed score variance approaches a reliability coefficient of 1.00 which is perfect reliability. If there is a relatively large amount of error, the ratio of the true score variance to the observed score variance approaches 0.00, which is total unreliability (refer to Figure 8.1): Figure 8.1: Reliability coefficient High reliability means that the questions of a test tended to „pull together‰. Students who answered a given question correctly were more likely to answer other questions correctly as well. If an equivalent or parallel test was developed by using similar items, the relative scores of students would show little change. Meanwhile, low reliability indicates that the questions tended to be unrelated to each other in terms of who answered them correctly. The resulting test scores reflect that something is wrong with either the items or the testing situation rather than the studentsÊ knowledge of the subject matter. The following guidelines may be used to interpret reliability coefficients for classroom tests as shown in Table 8.1. Table 8.1: Interpretation of Reliability Coefficients Reliability Interpretation 0.90 and above Excellent reliability (comparable to the best standardised tests). 0.80ă0.90 Very good for a classroom test. 0.70ă0.80 Good for a classroom test but there are probably a few items which could be improved. 0.60ă0.70 Somewhat low. There are probably some items which could be removed or improved. 0.50ă0.60 The test needs to be revised. 0.50 and below Questionable reliability and the test should be replaced or needs major revision. Copyright © Open University Malaysia (OUM) TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 177 If you know the reliability coefficient of a test, can you estimate the true score of a student on a test? In testing, we use the standard error of measurement to estimate the true score. The standard error of measurement = Standard deviation 1 r Note: „r‰ is the reliability of the test. Using the normal curve, you can estimate a studentÊs true score with some degree of certainty based on the observed score and standard error of measurement. Example 8.1: You gave a History test to group of 40 students. Khairul obtained a score of 75 in the test, which is his observed score. The standard deviation of your test is 2.0. Earlier, you had established that your History test had a reliability coefficient of 0.7. You are interested in finding out KhairulÊs true score. The standard error of measurement = Standard deviation 1 r = 2.0 1 0.7 =2.0 0.55 = 1.1 Therefore, based on the normal distribution curve (refer to Figure 8.2), KhairulÊs true score should be: (a) Between 75 ă 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68 per cent of the time. (b) Between 75 ă 2.2 and 75 + 2.2 or between 72.8 and 77.1 for 95 per cent of the time. (c) Between 75 ă 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99 per cent of the time. Copyright © Open University Malaysia (OUM) 178 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES Figure 8.2: Determining KhairulÊs true score based on a normal distribution SELF-CHECK 8.1 1. Define the reliability of a test. 2. What does the reliability coefficient indicate? 3. Explain the concept of a true score. ACTIVITY 8.1 Shalin obtains a score of 70 in a Biology test. Given the reliability of the test is 0.65 and the standard deviation of the test is 1.5. The teacher was planning to select students who had scored 70 and above to take part in a Biology competition. The teacher was not sure whether he should select Shalin since there could be an error in her score. Should he select Shalin? Why? Post your answer on the myINSPIRE online forum. (Use the standard error of measurement.) Copyright © Open University Malaysia (OUM) TOPIC 8 8.3 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 179 METHODS TO ESTIMATE THE RELIABILITY OF A TEST Let us now discuss how we estimate the reliability of a test. Figure 8.3 lists three common methods of estimating the reliability of a test. It is not possible to calculate reliability exactly and so we have to estimate reliability. Figure 8.3: Methods for estimating reliability These three methods are further explained as follows: (a) Test-retest Using the test-retest technique, the same test is administered again to the same group of students. The scores obtained in the first administration of the test are correlated to the scores obtained on the second administration of the test. If the correlation between the two scores is high, then the test can be considered to have high reliability. However, a test-retest situation is somewhat difficult to conduct as it is unlikely that students will be prepared to take the same test twice. There is also the effect of practice and memory that may influence the correlation. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval. (b) Parallel or Equivalent Forms For this technique, two equivalent tests (or forms) are administered to the same group of students. The two tests are not similar but are equivalent. In other words, they may have different questions but they are measuring the same knowledge, skills or attitudes. Therefore, you have two sets of scores which are correlated and reliability can be established. Unlike the test-retest technique, the parallel or equivalent forms reliability measure is not affected Copyright © Open University Malaysia (OUM) 180 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES by the influence of memory. One major problem with this approach is that you have to be able to generate a lot of items that reflect the same construct. This is often not an easy feat. (c) Internal Consistency Internal consistency is determined using only one test administered once to the students. Internal consistency refers to how the individual items or questions behave in relation to each other and the overall test. In effect, we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. The following are two common internal consistency measures that can be used. (i) Split-half To solve the problem of having to administer the same test twice, the split-half technique is used. In this technique, a test is administered once to a group of students. The test is divided into two equal halves after the students have completed the test. This technique is most appropriate for tests which include multiple-choice items, true-false items and perhaps short-answer essays. The items are selected based on odd-even method whereby one half of the test consists of odd numbered items, while the other half consists of even numbered items. Then, the scores obtained for the two halves are correlated to determine the reliability of the whole test using the Spearman-Brown correlation coefficient. rsb 2rxy 1 rxy In this formula, rsb is the split-half reliability coefficient and rxy represents the correlation between the two halves. Say for example, you have established that the correlation coefficient between the two halves is 0.65. What is the reliability of the whole test? rsb 2rxy 1 rxy 2 0.65 1 0.65 1.3 0.79 1.65 Copyright © Open University Malaysia (OUM) TOPIC 8 (ii) RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 181 CronbachÊs Alpha CronbachÊs coefficient alpha can be used for both binary-type (1 = correct, 0 = incorrect or 1 = true and 0 = false) and scale items (1 = strongly agree, 2 = agree, 3 = disagree, 4 = strongly disagree). Reliability is estimated by computing the correlation between the individual questions and the extent to which individual questions correlate with the total test. This is meant by internal consistency. The key is „internal‰, unlike test-retest and parallel or equivalent form that require another test as an external reference. The stronger the items are inter-related, the more likely the test is consistent. The higher the alpha, the more reliable is the test. There is no generally agreed cut-off point. Usually, 0.7 and above is acceptable (Nunnally, 1978). The formula for CronbachÊs alpha is as follows: k pi 1 pi k i 1 CronbachÊs alpha () 1 2 K 1 x where, k is the number of items in the test; pi refers to item difficulty which is the proportion of students who answered the item i correctly; and 2x is the sample variance for the total score. Example 8.2: Suppose that in a multiple-choice test consisting of five items or questions, the following difficulty index for each item was observed: p1 = 0.4, p2= 0.5, p3 = 0.6, p4 = 0.75 and p5 = 0.85. Sample variance (2x) = 1.84. CronbachÊs alpha would be calculated as follows: 5 1.045 1 0.54 5 1 1.840 Copyright © Open University Malaysia (OUM) 182 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES Professionally developed standardised tests should have an internal consistency coefficient of at least 0.85. High reliability coefficients are required for standardised tests because they are administered only once and the score on that one test is used to draw conclusions about each studentÊs ability level on the construct measured. Perhaps, the closest to a standardised test in the Malaysian context would be the tests for different subjects conducted at the national level in the PMR and SPM. According to Wells and Wollack (2003), it is acceptable for classroom tests to have reliability coefficients of 0.70 and higher because a studentÊs score on any one test does not determine the studentÊs entire grade in the subject or course. Usually, grades are based on several other measures such as project work, oral presentations, practical tests, class participation and so forth. To what extent is this true in the context of the Malaysian classroom? A Word of Caution! When you get a low alpha, you should be careful not to immediately conclude that the test is a bad test. Instead, you should check to determine if the test measures several attributes or dimensions rather than one attribute or dimension. If it does, there is the likelihood for the CronbachÊs alpha to be deflated. For example, an aptitude test may measure three attributes or dimensions such as quantitative ability, language ability and analytical ability. Hence, it is not surprising that the CronbachÊs alpha for the whole test may be low as the questions may not correlate with each other. Why? This is because the items are measuring three different types of human abilities. The solution is to compute three different CronbachÊs alphas ă one for quantitative ability, one for language ability and one for analytical ability. SELF-CHECK 8.2 1. What is the main advantage of the split-half technique over the testretest technique in determining the reliability of a test? 2. Explain the parallel or equivalent form technique in determining the reliability of a test. 3. Explain the concept of internal consistency reliability. Copyright © Open University Malaysia (OUM) TOPIC 8 8.4 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 183 INTER-RATER AND INTRA-RATER RELIABILITY Whenever you use humans as a part of your measurement procedure, you have to be concerned whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distracted. We get tired of doing repetitive tasks. We daydream. We misinterpret. Therefore, how do we determine whether: (a) Two observers are being consistent in their observations? (b) Two examiners are being consistent in their marking of an essay? (c) Two examiners are being consistent in their marking of a project? Let us analyse these problems from the perspectives of: (a) Inter-rater Reliability When two or more people mark essay questions, the extent to which there is agreement in the marks allotted is called inter-rater reliability (refer to Figure 8.4). The greater the agreement, the higher is the inter-rater reliability. Figure 8.4: Examiner A versus Examiner B Copyright © Open University Malaysia (OUM) 184 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES Inter-rater reliability can be low because of the following reasons: (i) Examiners are subconsciously being influenced by knowledge of the students whose scripts are being marked; (ii) Consistency in marking is affected after marking a set of either very good or very weak scripts; (iii) When there is an interruption during the marking of a batch of scripts, different standards may be applied after the break; and (iv) The marking scheme is poorly developed resulting in examiners making their own interpretations of the answers. Inter-rater reliability can be enhanced if the criteria for marking or the marking scheme: (i) Contain suggested answers related to the question; (ii) Have made provision for acceptable alternative answers; (iii) Allocate appropriate time for the work required; (iv) Are sufficiently broken down to allow the marking to be as objective as possible and the totalling of marks is correct; and (v) (b) Allocate marks according to the degree of difficulty of the question. Intra-rater Reliability While inter-rater reliability involves two or more individuals, intra-rater reliability refers to the consistency of grading by a single rater. Scores on a test are rated by a single rater at different times. When we grade tests at different times, we may become inconsistent in our grading for various reasons. For example, some papers that are graded during the day may get our full attention, while others that are graded towards the end of the day may be very quickly glossed over. Similarly, changes in our mood may affect the grading of papers. In these situations, the lack of consistency can affect intra-reliability in the grading of student answers. SELF-CHECK 8.3 List the steps that may be taken to enhance inter-rater reliability in the grading of essay answer scripts. Copyright © Open University Malaysia (OUM) TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 185 ACTIVITY 8.2 In the myINSPIRE online forum, suggest other steps you would take to enhance intra-rater reliability in the grading of projects. 8.5 TYPES OF VALIDITY Validity is often defined as the extent to which a test measures what it was designed to measure (Nuttall, 1987). While reliability relates to the consistency of the test, validity relates to the relevancy of the test. If it does not measure what it sets out to measure, then its use is misleading and the interpretation based on the test is not valid or relevant. For example, if a test that is supposed to measure the „spelling ability of eight-year-old children‰ does not measure „spelling ability‰, then the test is not a valid test. It would be disastrous if you make claims about what a student can or cannot do based on a test that is actually measuring something else. It is for this reason that many educators argue that validity is the most important aspect of a test. However, validity will vary from test to test depending on what it is used for. For example, a test may have high validity in testing the recall of facts in economics but that same test maybe low in validity with regard to testing the application of concepts in economics. Messick (1989) was most concerned about the inferences a teacher draws from the test score, the interpretation the teacher makes about his or her students and the consequences from such inferences and interpretation. You can imagine the power an educator holds in his or her hand when designing a test. Your test could determine the future of many thousands of students. Inferences based on a test of low validity could give a completely different picture of the actual abilities and competencies of students. Three types of validity have been identified: construct validity, content validity and criterion-related validity which is made up of predictive and concurrent validity (refer to Figure 8.5). Copyright © Open University Malaysia (OUM) 186 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES Figure 8.5: Types of validity These various types of validity are further explained as follows: (a) Construct Validity Construct validity relates to whether the test is an adequate measure of the underlying construct. A construct could be any phenomenon such as mathematics achievement, map skills, reading comprehension, attitude towards school, inductive reasoning, environmental awareness, spelling ability and so forth. You might think of construct validity as the correct „labelling‰ of something. For example, when you measure what you term as „critical thinking‰, is that what you are really measuring? Thus, to ensure high construct validity, you must be clear about the definition of the construct you intend to measure. For example, a construct such as reading comprehension would include vocabulary development, reading for literal meaning and reading for inferential meaning. Some experts in educational measurement have argued that construct validity is the most critical type of validity. You could establish the construct validity of an instrument by correlating it with another test that measures the same construct. For example, you could compare the scores obtained on your reading comprehension test with the scores obtained on another well-known reading comprehension test administered to the same sample of students. If the scores for the two tests are highly correlated, then you may conclude that your reading comprehension test has high construct validity. A construct is determined by referring to theory. For example, if you are interested in measuring the construct „self-esteem‰, you need to be clear what self-esteem is. Perhaps, you need to refer to various literature in the field describing the attributes of self-esteem. You find that theoretically, selfesteem is made of the following attributes: physical self-esteem, academic self-esteem and social self-esteem. Based on this theoretical perspective, you can build items or questions to measure self-esteem covering these three types of self-esteem. Through such a process, you are more certain to ensure high construct validity. Copyright © Open University Malaysia (OUM) TOPIC 8 (b) RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 187 Content Validity Content validity is more straightforward and likely to be related to construct validity. It concerns the coverage of appropriate and necessary content, i.e. does the test cover the skills necessary for good performance or all the aspects of the subject taught? It is concerned with sample-population representativeness, i.e. the facts, concepts and principles covered by the test items should be representative of the larger domain (e.g. syllabus) of facts, concepts and principles. For example, the science unit on „energy and forces‰ may include facts, concepts, principles and skills on light, sound, heat, magnetism and electricity. However, it is difficult, if not impossible, to administer a 2 to 3 hours paper to test all aspects of the syllabus on „energy and forces‰ (refer to Figure 8.6). Figure 8.6: Sample of content tested for the unit on „energy and forces‰ Therefore, only selected facts, concepts, principles and skills from the syllabus (or domain) are sampled. The content selected will be determined by content experts who will judge the relevance of the content in the test to the content in the syllabus or a particular domain. Content validity will be low if the questions in the test include questions testing content not included in the domain or syllabus. To ensure content validity and coverage, most teachers use the table of specifications (as discussed in Topic 3). Table 8.2 is an example of a table of specifications which specifies the knowledge and skills to be measured and the topics covered for the unit on „energy and forces‰. Copyright © Open University Malaysia (OUM) 188 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰ Understanding of Concept Application of Concepts Total Light 7 4 11 (22%) Sound 7 4 11 (22%) Heat 7 4 11 (22%) Magnetism 3 3 6 (11%) Electricity 8 3 11 (22%) 32 (64%) 18 (36%) 50 (100%) Topics TOTAL Since you cannot measure all the content of a topic, you will have to focus on the key areas and give due weighting to those areas that are important. For example, the teacher has decided that 64 per cent of questions will emphasise the understanding of concepts, while the remaining 36 per cent will focus on the application of concepts for the five topics. A table of specifications provides the teachers with evidence that a test has high content validity, that it covers what should be covered. Content validity is different from face validity, which refers not to what the test actually measures, but to what it superficially appears to measure. Face validity assesses whether the test „looks valid‰ to the examinees who take it, the administrative personnel who decide on its use and other technically untrained observers. The face validity is a weak measure of validity but that does not mean that it is incorrect, only that caution is necessary. Its importance however cannot be underestimated. (c) Criterion-related Validity Criterion-related validity of a test is established by relating the scores obtained to some other criterion or the scores of other tests. There are two types of criterion-related validity: (i) Predictive validity relates to whether the test predicts accurately some future performance or ability. Is STPM a good predictor of performance in a university? One difficulty in calculating the predictive validity of STPM is because only those who pass the exam will go on to university (generally speaking) and we do not know how well students who did not pass might have done. Also, only a small proportion of the population takes the STPM and the correlation between STPM grades and performance at the degree level would be quite high. Copyright © Open University Malaysia (OUM) TOPIC 8 (ii) 8.6 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 189 Concurrent validity is concerned with whether the test correlates with, or gives substantially the same results as, another test of the same skill. For example, does your end-of-year language test correlate with the Malaysian University English Test (MUET)? In other words, if your language test correlates highly with MUET, then your language test has high concurrent validity. FACTORS AFFECTING RELIABILITY AND VALIDITY To prepare tests which are acceptably valid and reliable, the following factors should be taken into account: (a) Construction of Test Items The quality of test items has a significant effect on the validity and reliability of a test. If the test items are poorly constructed, ambiguous and open to different interpretations, the reliability of the test will be affected because the test results will not reflect the true abilities of the students being assessed. If the items do not assess the right content and do not match the intended learning outcomes, then the test is not measuring what it is supposed to measure; thus, affecting the test validity. (b) Length of the Test Generally, the longer the test, the more reliable and valid the test is. A short test would not adequately cover a yearÊs work. The syllabus needs to be sampled. The test should consist of enough questions that are representative of the knowledge, skills and competencies in the syllabus. However, there is also a problem with tests that are too long. A lengthy test maybe valid, but it will take too much time and fatigue may set in which may affect the performance and the reliability of the test. (c) Selection of Topics The topics selected and the test questions prepared should reflect the way the topics were treated during teaching and learning. It is necessary to be clear about the learning outcomes and to design items that measure these learning outcomes. For example, in your teaching, students were not given an opportunity to think critically and solve problems. However, your test consists of items requiring students to think critically and solve problems. In such a situation, the reliability and validity of the test will be affected. The test is not reliable because it will not produce consistent results. It is also not Copyright © Open University Malaysia (OUM) 190 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES valid because the test does not measure the right intended learning outcomes. There is no constructive alignment between instruction and assessment. (d) Choice of Testing Techniques The testing techniques selected will also affect reliability and validity. For example, if you choose to use essay questions, validity may be high but reliability may be low. Essay questions tend to have validity because they are capable of assessing simple and complex learning outcomes, but tend to be less reliable because of the subjective manner studentsÊ responses are scored. On the other hand, if a test chooses objective test items such as MCQs, truefalse questions, matching questions and short-answer questions are selected, the reliability of the test can be high because the scoring of studentsÊ responses are not influenced by the subjective judgement of the assessors. The validity, however, can be low because not all intended learning outcomes can be appropriately assessed by objective test items alone. For instance, multiple-choice questions are not suitable in assessing learning outcomes that require students to organise ideas. (e) Method of Test Administration Test administration is also an important step in the measurement process. This includes the arrangement of items in a test, the monitoring of test taking and the preparation of data files from the test booklets. Poor test administration procedures can lead to problems in the data collected and affect the validity of the test results. For instance, if the results of the students taking a test are not accurately recorded, the test scores have become invalid. Adequate time must also be allowed for the majority of students to finish the test. This would reduce wild guessing and instead encourage students to think carefully about the answer. Instructions need to be clear to reduce the effects of confusion on reliability and validity. The physical conditions under which the test is taken must be favourable for the students. There must be adequate space and lighting, and the temperature must be conducive. Students must be able to work independently and the possibility of distractions in the form of movement and noise must be minimised. If such measures are not taken, studentsÊ performance may be affected because they are handicapped in demonstrating their true abilities. (f) Method of Marking The marking should be as objective as possible. Marking which depends on the exercise of human judgement ă such as in essays, projects and portfolios ă is subject to the variations of human fallibility (refer to inter-rater reliability discussed earlier). Besides, poorly designed or inappropriate marking schemes can affect validity. For example, if an essay test is intended to assess Copyright © Open University Malaysia (OUM) TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 191 studentsÊ ability to discuss an issue but a checklist is used to assess content (knowledge), the validity of the test is questionable. If unqualified or incompetent examiners are engaged to mark responses to essay questions, they will not be consistent in their scoring, thus affecting test reliability. It is quite easy to mark objective items quickly, but it is also surprisingly easy to make careless errors. This is especially true where large numbers of scripts are being marked. A system of checks is strongly advised. One method is through the comments of the students themselves when their marked papers are returned to them. 8.7 RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY Some people may think of reliability and validity as two separate concepts. In reality, reliability and validity are related. Figure 8.7 shows the analogy. Figure 8.7: Graphical representations of the relationship between reliability and validity The centre or the bullÊs-eye is the concept that we are trying to measure. Say, for example, in trying to measure the concept of „inductive reasoning‰, you are likely to hit the centre (or the bullÊs-eye) if your inductive reasoning test is both reliable and valid, which is what all test developers aim to achieve (refer to Figure 8.6d). On the other hand, your inductive reasoning test can be „reliable but not valid‰. How is that possible? Your test may not measure inductive reasoning but the scores you obtain each time you administer the test is approximately the same (refer to Figure 8.6b). In other words, the test is consistently and systematically measuring the wrong construct (i.e. inductive reasoning). Imagine the consequences of making judgement about the inductive reasoning of students using such a test! Copyright © Open University Malaysia (OUM) 192 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES However, in the context of psychological testing, if an instrument does not have satisfactory reliability, one typically cannot claim validity. That is, validity requires that instruments are sufficiently reliable. So, Figure 8.6c does not have high validity even though the target is hit twice. The validity is low and it does not have reliability because the hits are not concentrated. In other words, you are not getting a valid estimate of the inductive reasoning ability of your students and they are inconsistent. The worst-case scenario is when the test is neither reliable nor valid (refer to Figure 8.6a). In this scenario, the scores obtained by students tend to concentrate at the top and left of the target and they are consistently missing the centre target. Your measure in this case is neither reliable nor valid and the test should be rejected or improved. The true score is a hypothetical concept with regard to the actual ability, competency and capacity of an individual. The higher the reliability and validity of your test, the greater the likelihood that you will be measuring the true scores of your students. Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. Validity requires that instruments are sufficiently reliable. Face validity is a weak measure of validity. Using the test-retest technique, the same test is administered again to the same group of students. For the parallel or equivalent forms technique, two equivalent tests (or forms) are administered to the same group of students. Internal consistency is determined using only one test administered once to the students. When two or more people mark essay questions, the extent to which there is agreement in the marks allotted is called inter-rater reliability. While inter-rater reliability involves two or more individuals, intra-rater reliability is the consistency of grading by a single rater. Copyright © Open University Malaysia (OUM) TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 193 Validity is the extent to which a test measures what it claims to measure. It is vital for a test to be valid in order for the results to be accurately applied and interpreted. Construct validity relates to whether the test is an adequate measure of the underlying construct. Content validity is more straightforward and likely to be related to construct validity; it is related to the coverage of appropriate and necessary content. Some people may think of reliability and validity as two separate concepts. In reality, reliability and validity are related. Construct Reliability and validity relationship Content and face Reliable and not valid Criterion relate Test-retest Internal consistency True score Parallel-form Valid and reliable Predictive Validity Reliability Deale, R. N. (1975). Assessment and testing in secondary school. Chicago, IL: Evans Bros. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York, NY: Macmillan. Nunnally, J. (1978). Psychometric methods. New York, NY: McGraw-Hill. Copyright © Open University Malaysia (OUM) 194 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES Nuttall, D. L. (1987). The validity assessment. European Journal of Psychology Education, 2(2), 109ă118. Wells, C. S., & Wollack, J. A. (2003). An instructorÊs guide to understanding test reliability. Retrieved from https://testing.wisc.edu/Reliability.pdf Copyright © Open University Malaysia (OUM) Topic Item Analysis 9 LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Describe what item analysis is and the steps in item analysis; 2. Calculate the difficulty index and discrimination index; 3. Apply item analysis on essay-type question; 4. Discuss the relationship between the difficulty index and discrimination index of an item; 5. Do distractor analysis; and 6. Explain the role of an item bank in the development of tests. INTRODUCTION When you develop a test, it is important to identify the strengths and weaknesses of each item. To determine how well items in a test perform, some statistical procedures need to be used. In this topic, we will discuss item analysis which involves the use of three procedures: item difficulty, item discrimination and distractor analysis to help the test developer decide whether the items in a test can be accepted or should be modified, or rejected. These procedures are quite straightforward and easy to use, and the educator needs to understand the logic underlying the analyses in order to use them properly and effectively. Copyright © Open University Malaysia (OUM) 196 TOPIC 9 ITEM ANALYSIS 9.1 WHAT IS ITEM ANALYSIS? After having administered a test and marked it, most teachers would discuss the answers with their students. Discussion would usually focus on the right answers and the common errors made by students. Some teachers may focus on the questions most students performed poorly on and the questions they did very well. However, there is much more information available about a test that is often ignored by teachers. This information will only be available if the item analysis is done. What is item analysis? Item analysis is a process which examines the responses to individual test items or questions in order to assess the quality of those items and the test as a whole. Item analysis is especially valuable in improving items or questions that will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. Specifically, in classical test theory (CTT) the statistics produced from analysing the test results based on test scores include measures of difficulty index and discrimination index. Analysing the effectiveness of distractors also becomes part of the process (which we will discuss in detail later in the topic). The quality of a test is determined by the quality of each item or question in the test. The teacher who constructs a test can only roughly estimate the quality of a test. This estimate is based on the fact that the teacher has followed all the rules and conditions of test construction. However, it is possible that this estimation may not be accurate and certain important aspects have been ignored. Hence, it is suggested that to obtain a more comprehensive understanding of the test, item analysis should be conducted on the responses of students. Item analysis is done to obtain information about individual items or questions in a test and how the test can be further improved. It also facilitates the development of an item or question bank which can be used in the construction of a test. Copyright © Open University Malaysia (OUM) TOPIC 9 9.2 ITEM ANALYSIS 197 STEPS IN ITEM ANALYSIS Both CTT and the „modern‰ test theory such as item response theory (IRT) provide useful statistics to help us analyse the test data. For many item analysis, CTT is sufficient to provide the information we need. CTT will be used in this module. Let us take an example of a teacher who has administered a 30-item multiplechoice objective test in geography to 45 students in a secondary school classroom. Step 1 Upon receiving the answer sheet, the first step would be to mark each of the answer sheets. Step 2 Arrange the 45 answer sheets from the highest score obtained to the lowest score obtained. The paper with the highest score is on top and the paper with the lowest score is at the bottom. Step 3 Multiply 45 (the number of answer sheets) with 0.27 (or 27 per cent) which is 12.15 and round it up to 12. The use of the value 0.27 or 27 per cent is not inflexible. It is possible to use any percentage from 27 to 35 per cent as the value. However, the 27 per cent rule can be ignored if the class size is too small. Instead of taking the 27 per cent sample, divide the number of answer sheets by two. Step 4 Arrange the pile of 45 answer sheets according to the scores obtained (from the highest score to the lowest score). Take out 12 answer sheets from the top of the pile and 12 answer sheets from the bottom of the pile. Call these two piles „high marks‰ students and „low marks‰ students respectively. Set aside the middle group of papers (21 papers). Although these could be included in the analysis, using only the high and low groups will simplify the procedure. Step 5 Refer to Question 1 (refer to Figure 9.1), then: (a) Count the number of students from the „high marks‰ group who selected each of the options (A, B, C or D); and (b) Count the number of students from the „low marks‰ group who selected the option A, B, C or D. Copyright © Open University Malaysia (OUM) 198 TOPIC 9 ITEM ANALYSIS Figure 9.1: Item analysis for one item or question From the analysis, 11 students from the „high marks‰ group and two students from the „low marks‰ group selected „B‰ which is the correct answer. This means that 13 out of the 24 students selected the correct answer. Also, note that all the distractors (A, C and D) were selected by at least one student. However, the information provided in Figure 9.1 is insufficient and further analysis has to be conducted. SELF-CHECK 9.1 1. Define item analysis. 2. Describe the five steps of item analysis. Copyright © Open University Malaysia (OUM) TOPIC 9 9.3 ITEM ANALYSIS 199 DIFFICULTY INDEX Using the information provided in Figure 9.1, you can compute the difficulty index which is a quantitative indicator with regard to the difficulty level of an individual item or question. It can be calculated using the following formula: Difficulty index Number of students with the correct answer (R) Total number of students who attempted the question (T) R 13 0.54 T 24 What does a difficulty index (p) of 0.54 mean? The difficulty index is a coefficient that shows the percentage of students who got the correct answer compared with the total number of students who attempted the question. In other words, 54 per cent of students selected the right answer. Although our computation is based on the high and low scoring groups only, it provides a close approximation of the estimate that would be obtained with the total group. Thus, it is proper to say that the index of difficulty for this item is 54 per cent (for this particular group). Note that since „difficulty‰ refers to the percentage getting the item right, the smaller the percentage figure the more difficult the item. The meaning of the difficulty index is shown in Figure 9.2. Figure 9.2: Interpretation of the difficulty index (p) If a teacher believes that the achievement 0.54 on the item is too low, he or she can change the way he or she teaches the item to better meet the objective represented by it. Another interpretation might be that the item was too difficult or confusing or invalid, in which case the teacher can replace or modify the item, perhaps using information from the itemÊs discrimination index or distractor analysis. Under CTT, the item difficulty measure is simply the proportion that is correct for an item. For an item with a maximum score of two, there is a slight modification to the computation of proportion of percentage correct. Copyright © Open University Malaysia (OUM) 200 TOPIC 9 ITEM ANALYSIS This item has a possible partial credit scoring 0, 1, 2. If the total number of students attempting this item is 100, and 23 students scored 0, 60 students scored 1 and 17 students scored 2, then a simple calculation will show that 23 per cent of the students scored 0, 60 per cent of the students scored 1, and 17 per cent of the students scored 2 for this particular item. The average score for this item should be 0 0.23 + 1 0.6 + 2 0.17 = 0.94. Thus, the observed average score of this item is 0.94 out of a maximum of 2. So the average proportion correct is 0.94/2 = 0.47 or 47 per cent. ACTIVITY 9.1 A teacher gave a 20-item Science test to a group of 35 students. The correct answer for Question #20 is „C‰ and the results are as follows: Options A B C D Blank High marks group (n = 12) 0 2 8 2 0 Low marks group (n = 12) 2 4 3 2 1 (a) Calculate the difficulty index (p) for Question #20. (b) Is Question #20 an easy or difficult question? (c) Do you think you need to improve Question #20? Why? Post your answers on the myINSPIRE online forum. 9.4 DISCRIMINATION INDEX Discrimination index is a basic measure which shows the extent to which a question discriminates or differentiates between students in the „high marks‰ group and „low marks‰ group. This index can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skills is related to the response on an item. Most crucial for a test item is that whether a student answered a question correctly or not is due to his/her level of knowledge or ability and not due to something else such as chance or test bias. Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 201 In our example in subtopic 9.2, 11 students in the high group and two students in the low group selected the correct answer. This indicates positive discrimination, since the item differentiates between students in the same way that the total test score does. That is, students with high scores on the test (high group) got the item right more frequently than students with low scores on the test (low group). Although analysis by inspection maybe all that is necessary for most purposes, an index of discrimination can be easily computed using the following formula: Rh RL 1 T 2 where Rh = Number of students in „high marks‰ group (Rh) with the correct answer Discrimination index RL = Number of students in „low marks‰ group (RL) with the correct answer T = Total number of students Example 9.1: A test was given to a group of 43 students and 10 out of the 13 „high marks‰ group got the correct answer compared to five out of the 13 „low marks‰ group who got the correct answer. The discrimination index is computed as follows: R h R L 10 5 10 5 0.38 1 T 1 26 13 2 2 What does a discrimination index of 0.38 mean? The discrimination index is a coefficient that shows the extent to which the question discriminates or differentiates between „high marks‰ students and „low marks‰ students. Blood and Budd (1972) provide the guidelines on the meaning of the discrimination index as follows (refer to Figure 9.3). Copyright © Open University Malaysia (OUM) 202 TOPIC 9 ITEM ANALYSIS Figure 9.3: Interpretation of the discrimination index Source: Blood and Budd (1972) A question that has a high discrimination index is able to differentiate between students who know and those who do not know the answer. When we say that a question has a low discrimination index, it is not able to differentiate between students who know and students who do not know. A low discrimination index means that more „low marks‰ students got the correct answer because the question was too simple. It could also indicate that students from both the „high marks‰ group and „low marks‰ group got the answer wrong because the question was too difficult. The formula for the discrimination index is such that if more students in the „high marks‰ group chose the correct answer than students did in the low scoring group, the number will be positive. At a minimum, one would hope for a positive value, as that would indicate that it is knowledge of the question that resulted in the correct answer. The greater the positive value (the closer it is to 1.0), the stronger the relationship is between overall test performance and performance on that item. If the discrimination index is negative, that means that for some reason, students who scored low on the test were more likely to get the answer correct. This is a strange situation which suggests poor validity for an item. Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 203 APPLICATION OF ITEM ANALYSIS ON ESSAY-TYPE QUESTIONS 9.5 The previous subtopics explain the use of item analysis on multiple-choice questions. Item analysis can also be applied on essay-type questions. This subtopic will illustrate how this can be done. For ease of understanding, the illustration will use a short-answer essay question as an example. Let us assume that a group of 20 students have responded to a short-answer essay question with scores ranging from the minimum of 0 to the maximum of 4. Table 9.1 provides the scores obtained by the students. Table 9.1: Scores Obtained by Students for a Short-answer Essay Question Item Score No. of Students Earning Each Score Total Scores Earned 4 5 20 3 6 18 2 5 10 1 3 3 0 1 0 Total Average Score 51 51/20 = 2.55 The difficulty index (p) of the item can be computed using the following formula: p Average score Possible range of score Using the information from Table 9.1, the difficulty index of the short-answer essay question can be easily computed. The average score obtained by the group of students is 2.55, while the possible range of score for the item is (4 ă 0) = 4. Thus, p 2.55 4 0.64 Copyright © Open University Malaysia (OUM) 204 TOPIC 9 ITEM ANALYSIS The difficulty index (p) of 0.64 means that on average, students have received 64 per cent of the maximum possible score of the item. The difficulty index can be interpreted the same as that of the multiple-choice question discussed in subtopic 9.3. The item is of a moderate level of difficulty (refer to Figure 9.2). Note that in computing the difficulty index in the previous example, the scores of the whole group are used to obtain the average score. However, for a large group of students, it is possible to estimate the difficulty index for an item based on only a sample of students comprising the high marks and low marks groups as in the case of computing the difficulty index of a multiple-choice question. To compute the discrimination index (D) of an essay-type question, the following formula is suggested by Nitko (2004): D Difference between upper and lower groups' average score Possible range of score Using the information from Table 9.1 but presenting it in the following format as in Table 9.2, we can compute the discrimination index of the short-answer essay question. Table 9.2: Distribution of Scores Obtained by Students 0 1 2 3 4 Total Average Score High marks group (n = 10) 0 0 1 4 5 34 3.4 Low marks group (n = 10) 1 3 4 2 0 17 1.7 Score Note: n refers to the number of students. The average score obtained by the upper group of students is 3.4 while that of the lower group is 1.7. Using the formula as suggested by Nitko (2004), we can compute the discrimination index of the short-answer essay question as follows: D 3.4 1.7 4 0.43 Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 205 The discrimination index (D) of 0.43 indicates that the short-answer question does discriminate between the upper and lower groups of students and at a high level (refer to Figure 9.3.) As in the computation of the discrimination index of the multiple-choice question for a large group of students, a sample of students comprising the top 27 per cent and the bottom 27 per cent may be used to provide a good estimate. The following are two possible reasons for poorly discriminating items: (a) The item tests something else compared to the majority of items in the test; or (b) The item is poorly written and confuses the students. Thus, when examining the low discriminating item, it is advisable to check whether: (a) The wording and format of the item are problematic; and (b) The item may be testing a different thing than that intended for the test. Copyright © Open University Malaysia (OUM) 206 TOPIC 9 ITEM ANALYSIS ACTIVITY 9.2 1. The following is the performance of students in the high marks and the low marks groups in a short-answer essay question. Score 0 1 2 3 4 High marks group (n = 10) 2 2 3 1 2 Low marks group (n = 10) 3 2 2 3 0 (a) Calculate the difficulty index. (b) Calculate the discrimination index. Discuss the findings on the myINSPIRE online forum. 2. A teacher gave a 35-item Economics test to 42 students. For Question 16; 8 out of the 11 from the high marks groups got the correct answer compared with 4 out of 11 from the low marks group who got the correct answer. (a) Calculate the discrimination index for Question 16. (b) Does Question 16 have a high or low discrimination index? Post your answers on the myINSPIRE online forum. 9.6 RELATIONSHIP BETWEEN DIFFICULTY INDEX AND DISCRIMINATION INDEX Theoretically, the more difficult or easier a question (or item) is, the lower will the discrimination index be. Stanley and Hopkins (1972) provided a theoretical model to explain the relationship between the difficulty index and discrimination index of a particular question or item (refer to Figure 9.4). Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 207 Figure 9.4: Theoretical relationship between difficulty index and discrimination index Source: Stanley and Hopkins (1972) According to the model, a difficulty index of 0.2 can result in a discrimination index of about 0.3 for a particular item (which may be described as an item of „moderate discrimination‰). Note that as the difficulty index increases from 0.1 to 0.5, the discrimination index increases even more. When the difficulty index reaches 0.5 (described as an item of „moderate difficulty‰), the discrimination index is positive 1.00 (very high discrimination). Interestingly, a difficulty index of more than 0.5 leads to a decrease in the discrimination index. For example, a difficulty index of 0.9 results in a discrimination index of about 0.2, is described as an item of low to moderate discrimination. What does this mean? The more difficult a question, the harder it is for that question or item to discriminate between those students who know and those who do not know the answer to the question. Copyright © Open University Malaysia (OUM) 208 TOPIC 9 ITEM ANALYSIS Similarly, when the difficulty index is about 0.1, the discrimination index drops to about 0.2. What does this mean? The easier a question, the harder it is for that question or item to discriminate between those students who know and those who do not know the answer to the question. ACTIVITY 9.3 1. What can you conclude about the relationship between the difficulty index of an item and its discrimination index? 2. Do you take these factors into consideration when giving an objective test to students in your school? Justify. Share your answers with your coursemates in the myINSPIRE online forum. 9.7 DISTRACTOR ANALYSIS In addition to examining the performance of an entire test item, teachers are also interested in examining the performance of individual distractors (incorrect answer options) on multiple-choice items. By calculating the proportion of students who chose each answer option, teachers can identify which distractors are „working‰ and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and not being chosen by many students. To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distractors as is feasible. Analyses of response options allow teachers to fine-tune and improve items they may wish to use again with future classes. Let us examine performance on an item or question (refer to Figure 9.5). Figure 9.5: Effectiveness of distractors Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 209 Generally, a good distractor is able to attract more „low marks‰ students to select that particular response or distract „high marks‰ students towards selecting that particular response. What determines the effectiveness of distractors? Figure 9.5 shows you how 24 students selected the options A, B, C and D for a particular question. Option B is a less effective distractor because many „high marks‰ students (n = 5) selected option B. Option D is a relatively good distractor because two students from the „high marks‰ group and five students from the „low marks‰ group selected this option. The analysis of response options shows that those who missed the item were about equally likely to choose answer B and answer D. No students chose answer C, meaning it does not act as a distractor. Students were not choosing between four answer options on this item, they were really choosing between only three options, as they were not even considering answer C. This makes guessing correctly more likely, which hurts the validity of the item. The discrimination index can be improved by modifying and improving options B and C. ACTIVITY 9.4 Which British resident was killed by Maharajalela in Pasir Salak? Hugh Low Birch Options A B High marks (n = 15) 4 7 Low marks (n = 15) 6 3 Brooke Gurney C D No Response 0 4 0 2 4 0 The answer is B. Analyse the effectiveness of the distractors. Discuss your answer with your coursemates on the myINSPIRE online forum. Copyright © Open University Malaysia (OUM) 210 TOPIC 9 ITEM ANALYSIS 9.8 PRACTICAL APPROACH IN ITEM ANALYSIS Some teachers may find the techniques discussed earlier as time consuming and this fact cannot be denied especially when you have a test consisting of 40 items. However, there is a more practical approach which may take less time. Imagine that you have administered a 40-item test to a class of 30 students. It will surely take a lot of time to analyse the effectiveness of each item and this may discourage teachers from analysing each item in a test. Here is a method that shows you how to do so: Step 1 Arrange the 30 answer sheets from the highest score obtained to the lowest score obtained. Step 2 Select the answer sheet that obtained a middle score. Group all answer sheets above this score as „high marks‰ (mark an „H‰ on these answer sheets). Group all answer sheets below this score as „low marks‰ group (mark an „L‰ on these answer sheets). Step 3 Divide the class into two groups (high and low) and distribute the „high‰ answer sheets to the high group and the low answer sheet to the low group. Assign one student in each group to be the counter. Step 4 The teacher then asks the class, „The answer for Question #1 is „C‰ and those who got it correct, raise your hand. Counter from „H‰ group: „Fourteen for group H‰ Counter from „L‰ group: „Eight from group L‰ Step 5 The teacher records the responses on the whiteboard as follows: Question #1 Question #2 Question #3 | | Question #n High 14 12 16 Low 8 6 7 Total of Correct Answers 22 18 23 n n n Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 211 Step 6 Calculate the difficulty index for Question #1 as follows: Difficulty index R H R L 14 8 0.73 30 30 Step 7 Compute the discrimination index for Question #1 as follows: Discrimination index R H R L 14 8 6 0.40 1 30 15 15 2 Note that earlier, we took 27 per cent of answer sheets in the „high marks‰ group and 27 per cent of answer sheets in the „low marks‰ group from the total answer sheets. However, in this approach we divided the total answer sheets into two groups. There is no middle group. The important thing is to use a large enough fraction of the group to provide useful information. Selecting the top and bottom 27 per cent of the group is recommended for a more refined analysis. This method may be less accurate but it is a „quick and dirty‰ method. ACTIVITY 9.5 Compare the difficulty index and discrimination index obtained using this rough method with the theoretical model by Stanley and Hopkins (1972) in Figure 9.4. Are the indexes very far out? Share your answer with your coursemates in the myINSPIRE online forum. Copyright © Open University Malaysia (OUM) 212 TOPIC 9 ITEM ANALYSIS 9.9 USEFULNESS OF ITEM ANALYSIS TO TEACHERS After each test or assessment, it is advisable to carry out item analysis of the test items because the information from the analysis would be useful to teachers. Among the benefits they can get from the analysis are as follows: (a) From the discussion in the earlier subtopics, it is obvious that the results of item analysis could provide answers to the following questions: (i) Did the item function as intended? (ii) Were the items of appropriate difficulty? (iii) Were the items free from irrelevant clues and other defects? (iv) Was each of the distracters effective (in multiple-choice questions)? Answers to the previous questions can be used to select or revise test items for future use. This would improve the quality of test items and the test paper to be used in future. It also saves teachersÊ time in preparing the test items for future use because good items can be stored in the item bank. (b) Item analysis data can provide a basis for efficient class discussion of the test results. Knowing how effectively each test item functions in measuring the achievement of the intended learning outcome and how students perform in each item, teachers can have a more fruitful discussion with the students as feedback based on the item analysis that is more objective and informative. For example, teachers can highlight the misinformation or misunderstanding reflected in the choice of particular distracters on multiple-choice questions or frequently repeated errors on essay-type questions, thereby enhancing the instructional value of assessment. If, during the discussion, the item analysis reveals that there are technical defects in the items or the marking scheme, studentsÊ marks can also be rectified to ensure a fairer test. (c) Item analysis data can be used for remedial work. The analysis will reveal the specific areas that the students are weak in. Teachers can use the information to focus remedial work directly on the particular areas of weakness. For example, based on the distracter analysis, it is found that a specific distracter has a low discrimination with a great number of students from both the high marks and low marks groups choosing the option. This could suggest that there is some misunderstanding of a particular concept. Remedial lessons can thus be planned to arrest the problem. Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 213 (d) Item analysis data can reveal weaknesses in teaching and provide useful information to improve teaching. For example, despite the fact that an item is properly constructed, it has a low difficulty index, suggesting that most students fail to answer the item satisfactorily. This might suggest that the students have not mastered a particular syllabus content that is being assessed. This could be due to the weakness in instruction and thus necessitates the implementation of more effective teaching strategies by the teachers. Furthermore, if the item is repeatedly difficult for the items, there might be a need to revise the curriculum. (e) Item analysis procedures provide a basis for teachers to improve their skills in test construction. As teachers analyse studentsÊ responses to items, they become aware of the defects of the items and what causes them. When revising the items, they gain experience in rewording the statements so that they are clear, rewriting the distracters so that they are more plausible and modifying the items so that they are at a more appropriate level of difficulty. As a consequence, teachers improve their test construction skills. 9.10 CAUTION IN INTERPRETING ITEM ANALYSIS RESULTS Despite the usefulness of item analysis, the results from such an analysis are limited in many ways and must be interpreted cautiously. The following are some of the major precautions to observe: (a) Item discriminating power does not indicate item validity. A high discrimination index merely indicates that students from the high marks group perform relatively better than the students from the low marks group. The division of the high and low marks groups is based on the total test score obtained by each student, which is an internal criterion. By using the internal criterion of total test score, item analysis offers evidence concerning the internal consistency of the test rather than its validity. The validity of a test needs to be judged by an external criterion, that is, to what extent the test assesses the learning outcomes intended. (b) The discrimination index is not always an indicator of item quality. For example, a low index of discriminating power does not necessarily indicate a defective item. If an item does not discriminate but it has been found to be free from ambiguity and other technical defects, the item should be retained, especially in a criterion-referenced test. In such a test, a non-discriminating item may suggest that all students have achieved the criterion set by the teacher. As such, the item does not discriminate between the good and poor Copyright © Open University Malaysia (OUM) 214 TOPIC 9 ITEM ANALYSIS students. Another possible reason why low discrimination occurs for an item is that the item may be very easy or very difficult. Sometimes, this item, however, is necessary or desirable to be retained in order to measure a representative sample of learning outcomes and course content. Moreover, an achievement test is usually designed to measure several different types of learning outcomes (knowledge, comprehension, application and so on). In such a case, there will be learning outcomes that are assessed by fewer test items and these items will have low discrimination because they have less representation in the total test score. Removing these items from the test is not advisable as it will affect the validity of the test. (c) This traditional item analysis data is tentative. It is not fixed but influenced by the type and number of students being tested and the instructional procedures employed. The data would thus change with every administration of the same test items. So, if repeated use of items is possible, item analysis should be carried for each administration of each item. The tentative nature of item analysis should therefore be taken seriously and the results are interpreted cautiously. 9.11 ITEM BANK What is an item bank? An item bank is a large collection of easily accessible questions or items that have been administered over a period of time. For achievement tests which assess performance in a body of knowledge such as Geography, History, Chemistry or Mathematics, the questions that can be asked are rather limited. Hence, it is not surprising that previous questions are „recycled‰ with some minor changes and administered to a different group of students. Making good test items is not a simple task and can be time consuming for teachers. Hence, an item or question bank would be of great assistance to teachers. An item bank consists of questions that have been analysed and stored because they are good items. Each stored item will have information on its difficulty index and discrimination index. Each item is stored according to what it measures, especially in relation to the topics of the curriculum. These items will be stored in the form of a table of specifications indicating the content being measured as well as the cognitive levels measured. For example, you will be able to draw from the Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 215 item bank items measuring the application of concepts for the topic on „electricity‰. You will also be able to draw items from the bank with different difficulty levels. Perhaps, you want to arrange easier questions at the beginning of the test so as to build confidence in students and then gradually introduce questions of increasing difficulty. With computerised databases, item banks are easy to access. Teachers will have at their disposal hundreds of items from which they can draw upon when developing classroom tests. This would certainly help them with the tedious and time-consuming task of having to construct items or questions from scratch. Unfortunately, not many educational institutions are equipped with such an item bank. The more common practice is for teachers to select items or questions from commercially prepared workbooks, past examination papers and sample items from textbooks. These sources do not have information about the difficulty index and discrimination index of items, nor information about the cognitive levels of questions or what they aim to measure. Teachers will have to figure out for themselves the characteristics of the items based on their experience in teaching the content. However, there are certain issues to consider in setting up a question bank. One of the major concerns of the bank is how to place different test items collected overtime on a common scale. The scale should indicate difficulty of the items, one scale per subject matter. Retrieval of items from the bank is made easy when all items are placed on the same scale. The person in charge must also take every effort to add only quality items to the item pool. To develop and maintain a good item bank requires a great deal of preparation, planning, expertise and organisation. Though item response theory (IRT) approach is not a panacea for the item banking problems, it can solve many of these issues (IRT will be explained further in the next subtopic). Copyright © Open University Malaysia (OUM) 216 TOPIC 9 ITEM ANALYSIS 9.12 PSYCHOMETRIC SOFTWARE Software designed for general statistical analysis such as SPSS can often be used for certain types of psychometric analysis. However, there are many software available in the market specially to analyse the data from tests. Classical test theory or CTT is an approach to psychometric analysis that has weaker assumptions than item response theory and is more applicable to smaller sample sizes. Under CTT, the studentÊs raw test score would be the sum of the scores received on the item in the test. For example, Iteman is a commercial software program while TAP is a free program for classical analysis. Item response theory (IRT) is a psychometric approach which assumes that the probability of a certain response is a direct function of an underlying trait or traits. Under IRT, the concern is whether the student obtained each item correctly or not, rather than the raw test score. The basic concept of IRT is about the individual item of test rather than about the test scores. Student trait or ability and item characteristics are referenced to the same scale. For example, ConQuest is a computer program for item response and latent regression models and TAM is a R package for item response models. ACTIVITY 9.6 In the myINSPIRE forum, discuss: (a) To what extent do Malaysian schools have item banks? (b) Do you think teachers should have access to computerised item banks? Justify. Item analysis is a process which examines the responses to individual test items or questions in order to assess the quality of those items and the test as a whole. Item analysis is conducted to obtain information about individual items or questions in a test and how the test can be improved. Copyright © Open University Malaysia (OUM) TOPIC 9 ITEM ANALYSIS 217 The difficulty index is a quantitative indicator with regard to the difficulty level of an individual item or question. The discrimination index is a basic measure which shows the extent to which a question discriminates or differentiates between students in the "high marks" group and "low marks" group. Theoretically, the more difficult a question (or item) or easier the question (or item) is, the lower will the discrimination index be. By calculating the proportion of students who chose each answer option, teachers can identify which distractors are "working" and appear attractive to students who do not know the correct answer, and which distractors are simply taking up space and not being chosen by any student. Generally, a good distractor is able to attract more "low marks" students to select that particular response or distract "high marks" students towards selecting that particular response. An item bank is a collection of questions or items that have been administered over a period of time. There are many psychometric software programs to help expedite the tedious calculation process. Computerised data bank Good distractor Difficult question High marks group Difficulty index Item analysis Discrimination index Item bank Distractor analysis Low marks group Easy question Copyright © Open University Malaysia (OUM) 218 TOPIC 9 ITEM ANALYSIS Blood, D. F., & Budd, W. C. (1972). Educational measurement and evaluation. Manhattan, NY: Harper and Row. Nitko, A. J. (2004). Educational assessments of students. Englewood Cliffs, NJ: Prentice Hall. Stanley, G., & Hopkins, D. (1972). Introduction to educational measurement and testing. Boston, MA: Macmillan. Copyright © Open University Malaysia (OUM) Topic Analysis of 10 Test Scores LEARNING OUTCOMES By the end of the topic, you should be able to: 1. Differentiate between descriptive and inferential statistics; 2. Calculate various central tendency measures; 3. Explain the use of standard scores; 4. Calculate Z-score and T-score; 5. Describe the characteristics of the normal curve; and 6. Explain the role of norms in standardised tests. INTRODUCTION Do you know that all the data you have collected on the performance of students have to be analysed? In this final topic, we will focus on the analysis and interpretation of the data you have collected about the knowledge, skills and attitudes of your students. Information you have collected about your students can be analysed and interpreted quantitatively and qualitatively. For the quantitative analysis of data, various statistical tools are used. For example, statistics are used to show the distribution of scores on a Geography test and the average score obtained by a group of students. Copyright © Open University Malaysia (OUM) 220 TOPIC 10 ANALYSIS OF TEST SCORES 10.1 WHY USE STATISTICS? When you give a Geography test to your class of 40 students at the end of the semester, you get a score for each student which is a measurement of a sample of the studentÊs ability. The behaviour tested could be the ability to solve problems in Geography such as reading maps, the globe and interpretation of graphs. For example, student A gets a score of 64 while student B gets 32. Does this mean that the ability of student A is better than that of student B? Does it mean that the ability of student A is twice the ability of student B? Are the scores 64 and 32 percentages? These scores or marks are difficult to interpret because they are raw scores. Raw scores can be confusing if there is no reference made to a „unit‰. So, it is only logical that you convert the score to a unit such as percentages. In this example, you get 64 per cent and 32 per cent. Even the use of percentages may not be meaningful. For example, getting 64 per cent in the test may be considered „good‰ if the test was a difficult one. On the other hand, if the test was easy, then 64 per cent may be considered to be only „average‰. In other words, to get a more accurate picture of the scores obtained by students on the test, the teacher should find out: (a) Which student obtained the highest marks in the class and the number of questions correctly answered; (b) Which student obtained the lowest marks in the class and the number of questions correctly answered; and (c) The number of questions correctly answered by all students in the class. This illustrates that the marks obtained by students in a test should be carefully examined. It is not enough to just report the marks obtained. More information should be given about the marks obtained and to do this you have to use statistics. Some teachers may be afraid of statistics while others may regard it as too time consuming. In fact, many of us often use statistics without being aware of it. For example, when we talk about average rainfall, per capita income, interest rates and percentage increase in our daily lives, we are using the language of statistics. What is statistics? Statistics is a mathematical science pertaining to the analysis, interpretation and presentation of data. Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 221 It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities. Statistics have been widely used by researchers in education and by classroom teachers. In applying statistics in education, one begins with a population to be studied. This could be all Form Two students in Malaysia which number about 450,000 or all secondary school teachers in the country. For practical reasons, rather than compiling data about an entire population, we usually select or draw a subset of the population called a sample. In other words, the 40 Form Two students that you teach is a sample of the population of Form Two students in the country. The data you collect about the students in your class can be subjected to statistical analysis, which serves two related purposes, namely, descriptive and inference. (a) Descriptive Statistics You use these statistical techniques to describe how your students performed. For example, you use descriptive statistics techniques to summarise data in a useful way either numerically or graphically. The aim is to present the data collected so that it can be understood by teachers, school administrators, parents, the community and the Ministry of Education. The common descriptive techniques used are the mean or average and standard deviation. Data may also be presented graphically using various kinds of charts and graphs. (b) Inferential Statistics You use inferential statistical techniques when you want to infer about the population based on your sample. You use inferential statistics when you want to find out the differences between groups of students, the relationship between variables or when you want to make predictions about student performance. For example, you want to find out whether the boys did better than the girls or whether there is a relationship between performance in coursework and the final examination. The inferential statistics often used are the t-test, ANOVA and linear regression. Copyright © Open University Malaysia (OUM) 222 TOPIC 10 ANALYSIS OF TEST SCORES 10.2 DESCRIBING TEST SCORES Let us assume that you have just given a test on Bahasa Melayu to a class of 35 students in Form One. After marking the scripts, you have a set of scores for each of the students in the class and you want to find out more about how your students performed. Figure 10.1 shows you the distribution of the score obtained by students in the test. Figure 10.1: The distribution of Bahasa Melayu marks The „frequency‰ column shows how many students scored for each mark shown and the percentage is shown in the „percentage‰ column. You can describe these scores using two types of measures, namely, central tendency and dispersion. Copyright © Open University Malaysia (OUM) TOPIC 10 10.2.1 ANALYSIS OF TEST SCORES 223 Central Tendency The term „central tendency‰ refers to the „middle‰ value and is measured using the mean, median and mode. It is an indication of the location of the scores. Each of these three measures is calculated differently, and which one to use will depend on the situation and what you want to show. (a) Mean The mean is the most commonly used measure of central tendency. When we talk about an „average‰, we usually refer to the mean. The mean is simply the sum of all the values (marks) divided by the total number of items (students) in the set. The result is referred to as the arithmetic mean. Using the data from Figure 10.1 and applying the formula given, you can calculate the mean. Mean x 35 40 41 ............... 75 1863 53.23 N 35 35 (b) Median The median is determined by sorting the score obtained from lowest to highest values and taking the score that is in the middle of the sequence. For the example in Figure 10.1, the median is 52. There are 17 students with scores less than 52 and 17 students whose scores are greater than 52. If there is an even number of students, there will not be a single point at the middle. So, you calculate the median by taking the mean of the two middle points i.e. divide the sum of the two scores by 2. (c) Mode The mode is the most frequently occurring score in the data set. Which number appears most often in your data set? In Figure 10.1, the mode is 57 because seven students obtained that score. However, you can also have more than one mode. If you have two modes, it is bimodal. Distributions of scores may be graphed to demonstrate visually the relationship among the scores in a group. In such graphs, the horizontal axis or x-axis is the continuum on which the individuals are measured. The vertical axis or y-axis is the frequency (or the number) of individuals earning any given score shown on the x-axis. Figure 10.2 shows you a histogram representing the scores for the Bahasa Melayu test obtained by a group of 35 students as shown earlier in Figure 10.1. Copyright © Open University Malaysia (OUM) 224 TOPIC 10 ANALYSIS OF TEST SCORES Figure 10.2: Graph showing the distribution of Bahasa Melayu test scores SELF-CHECK 10.1 1. What is the difference between descriptive statistics and inferential statistics? 2. What is the difference between mean, median and mode? 10.2.2 Dispersion Although the mean tells us about the groupÊs average performance, it does not tell us how close to the average or mean the students scored. For example, did every student score 80 per cent on the test or were the scores spread out from 0 to 100 per cent? Dispersion is the distribution of the scores and among the measures used to describe spread are range and standard deviation. (a) Range The range of scores in a test refers to the lowest and highest scores obtained in the test. The range is the distance between the extremes of a distribution. Copyright © Open University Malaysia (OUM) TOPIC 10 (b) ANALYSIS OF TEST SCORES 225 Standard Deviation Standard deviation refers to how much the scores (obtained by students) deviate or differ from the mean. Table 10.1 shows the scores obtained by 10 students on a Science test. Table 10.1: Scores on a Science Test Obtained by 10 Students x x x x Marks x 2 35 35 ă 39 = -4 (-4)2 = 16 39 39 ă 39 = 0 (0)2 = 0 45 45 ă 39 = 6 (6)2 = 36 40 40 ă 39 = 1 (1)2 = 1 32 32 ă 39 =-7 (-7)2 = 49 42 42 ă 39 = 3 (3)2 = 9 37 37 ă 39 =-2 (-2)2 = 4 44 44 ă 39 = 5 36 36 ă 39 =-3 (-3)2 = 9 41 41 ă 39 = 2 (2)2 = 4 (5)2 = 25 Sum = 390 Mean ( x ) = 39 N = 10 ¡ ( x ă x )2 = 153 Based on the raw scores, you can calculate the standard deviation of a sample using the formula given. Standard deviation x x N 1 2 153 17 9 4.12 The steps in calculating the standard deviation are as follows: (i) The first step is to find the mean, which is 390 divided by 10 (the number of students) = 39; (ii) Next is to subtract the mean from each score in the column labelled x x . Note that all numbers in this column are positive. The squared differences are then summed and the square root calculated; and Copyright © Open University Malaysia (OUM) 226 TOPIC 10 ANALYSIS OF TEST SCORES (iii) The standard deviation is the positive square root of 153 divided by 9 and the result is 4.12. To better understand what the standard deviation means, refer to Figure 10.3 which shows the spread of scores with the same mean but different standard deviations. Figure 10.3: Distribution of scores with varying standard deviations Based on Figure 10.3: (i) For Class A, with a standard deviation of 4.12, approximately 68 per cent (1 standard deviation) of students scored between 34.88 and 43.12; (ii) For Class B, with a standard deviation of 2, approximately 68 per cent (1 standard deviation) of students scored between 37 and 41; and (iii) For Class C, with a standard deviation of 1, approximately 68 per cent of students scored between 38 and 40. Note that the smaller the standard deviations, the greater the scores tend to „bunch‰ around the mean and vice versa. Hence, it is not enough to just examine the mean alone because the standard deviation tells us a lot about the spread of the scores around the mean. Which class do you think performed better? The mean does not tell us which class performed better. Class C performed the best because approximately two-thirds of the students scored between 38 and 40. Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 227 Skew refers to the symmetry of a distribution. A distribution is skewed if one of its tails is longer than the other. Figure 10.4 shows you the distribution of the scores obtained by 38 students on a History test. Figure 10.4: Distribution of History test scores There is a negative skew because it has a longer tail in the negative direction. What does it mean? It means that more students were getting high scores on the History test which may indicate that either the test was too easy or the teaching methods and materials were successful in bringing about the desired learning outcomes. Now, let us look at Figure 10.5 which shows the distribution of the scores obtained by 38 students on a Biology test. Copyright © Open University Malaysia (OUM) 228 TOPIC 10 ANALYSIS OF TEST SCORES Figure 10.5: Distribution of Biology test scores There is a positive skew because it has a longer tail in the positive direction. What does it mean? It means that more students were getting low scores in the Biology test which indicates that the test was too difficult. Alternatively, it could imply that the questions were not clear or the teaching methods and materials did not bring about the desired learning outcomes. SELF-CHECK 10.2 What is the difference between range and standard deviation? Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 229 ACTIVITY 10.1 1. What is the difference between a standard deviation of 2 and a standard deviation of 5? 2. A teacher administered an English test to 10 students in her class. The students earned the following marks: 14, 28, 48, 52, 77, 63, 84, 87, 90 and 98. For the distribution of marks, find the following: (a) Mean; (b) Median; (c) Range; and (d) Standard deviation. Post your answers on the myINSPIRE online forum. 10.3 STANDARD SCORES After having given a test, most teachers report the raw scores obtained by students. For example, Zulinda, a Form Five student, earned the following scores in the endof-semester examination: (a) Science: 80; (b) History: 72; and (c) English: 40. With these raw scores alone, what can you say about ZulindaÊs performance on these tests or her standing in the class? Actually, you cannot say very much. Without knowing how these raw scores compare to the total distribution of raw scores for each subject, it is difficult to draw any meaningful conclusion regarding her relative performance in each of these tests. How do you make these raw scores meaningful? Let us assume that the scores of all three tests are approximately normally distributed. Copyright © Open University Malaysia (OUM) 230 TOPIC 10 ANALYSIS OF TEST SCORES The mean and standard deviation of the three tests are as shown in Table 10.2. Table 10.2: Mean and Standard Deviation for the Three Tests Subject Mean Standard Deviation Science 90 10 History 60 12 English 40 15 Based on this additional information, what statements can you make regarding ZulindaÊs relative performance on each of these three tests? The following are some conclusions you can make: (a) Zulinda did best on the History test and her raw score of 72 falls at a point one standard deviation above the mean; (b) Her next best score is English and her raw score of 40 falls exactly on the mean of the distribution of the scores; and (c) Finally, even though her raw score for Science was 80, it falls one standard deviation below the mean. Converting ZulindaÊs raw scores into Z-scores, we can say that she achieved a: (i) Z-score of +1 for History; (ii) Z-score of 0 for English; and (iii) Z-score of -1 for Science. 10.3.1 Z-score What is a Z-score? How do you calculate the Z-score? A Z-score is a type of standard score. The term standard score is the general name for converting a raw score to another scale using a predetermined mean and a predetermined standard deviation. Z-scores tell how many standard deviations away from the mean the score is located. Z-scores can be positive or negative. A positive Z-score indicates that the value is above the mean, while a negative Z-score indicates that the value falls below the mean. A Z-score is a raw score that has been transformed or converted to a scale with a predetermined mean of 0 and a predetermined standard deviation of 1. For instance, a Z-score of -6 means that the score is 6 standard deviations below the mean. Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 231 The formula used for transforming a raw score into a Z-score involves subtracting the mean from the raw score and then dividing it by the standard deviation. Z x x SD For example, let us use this formula to convert KumarÊs marks of 52 obtained in a Geography test. The mean for the test is 70 and the standard deviation is 7.5. Z x x SD 52 70 18 2.4 7.5 7.5 The Z-score calculated for the raw score of 52 is -2.4 which means that KumarÊs score for the Geography test is located 2.4 standard deviations below the mean. 10.3.2 Example of Using the Z-score to Make Decisions A teacher administered two Bahasa Melayu tests to students in Form Four A, Form Four B and Form Four C. The two top students in Form Four C were Seng Huat and Mei Ling. The teacher was planning to give a prize for the best student in Bahasa Melayu in Form Four C but was not sure who the better student was. Test 1 Test 2 Seng Huat 30 50 Mei Ling 45 35 Mean 42 47 7 8 Standard deviation Copyright © Open University Malaysia (OUM) 232 TOPIC 10 ANALYSIS OF TEST SCORES The teacher could use the mean to determine who was better. However, both students have the same mean. How does the teacher decide? Using a Z-score can tell the teacher how far from the mean are the scores of the two students and thus who performed better. Using the formula, the teacher calculates the Z-score shown as follows: Test 1 Seng Huat Mei Ling 30 42 7 45 42 7 Test 2 = -1.71 = 0.43 50 47 8 35 47 8 Total = 0.375 -1.34 = -1.50 -1.07 Upon examination of the calculation, the teacher finds that both Seng Huat and Mei Ling have negative Z-scores for the total of both tests. However, Mei Ling has a higher total Z-score (-1.07) compared with Seng HuatÊs total Z-score (-1.34). In other words, Mei LingÊs total score was closer to the mean and therefore the teacher concludes that Mei Ling did better than Seng Huat. Z-scores are relatively simple to use but many educators are reluctant to use it, especially when test scores are reported as negative numbers. How would you like to have your mathematics score reported as -4? For this reason, alternative standard score methods are used such as the T-score. 10.3.3 T-score The T-score was developed by W. McCall in the 1920s and is one of the many standard scores currently used. T-scores are widely used in the fields of psychology and education, especially when reporting performance in standardised tests. The T-score is a standardised score with a mean of 50 and a standard deviation of 10. The formula for calculating the T-score is: T = 10(z) + 50 Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 233 For example, a student has a Z-score of -1.0 and after converting it to T-score, you get the following: T = 10 (z) + 50 = 10 (-1.0) + 50 = (-10) + 50 = 40 When converting Z-scores to T-scores, you should be careful not to drop the negatives. Dropping the negatives will result in a completely different score. ACTIVITY 10.2 1. Convert the following Z-scores to T-scores. Z-score T-score +1.0 -2.4 +1.8 2. Why would you use T-scores rather than Z-scores when reporting the performance of students in the classroom? Share your answers with your coursemates on the myINSPIRE online forum. 10.4 THE NORMAL CURVE The normal curve (also called the „bell curve‰) is a hypothetical curve that is supposed to represent all natural occurring phenomena. In a normal distribution, the mean, median and mode have the same value. It is assumed that if we were to sample a particular characteristic such as the height of Malaysian men, you will find the average height to be 162.5cm or 5 feet 4 inches. Copyright © Open University Malaysia (OUM) 234 TOPIC 10 ANALYSIS OF TEST SCORES However, there will be a few men who will be relatively shorter and an equal number who are relatively taller. By plotting the heights of all Malaysian men according to the frequency of occurrence, you can expect to obtain something similar to a normal distribution curve. Besides height, normal distribution curve can seen in terms of IQ. Figure 10.6 shows a normal distribution curve for IQ based on the Wechsler intelligence scale for children. Figure 10.6: The normal distribution curve In a normal distribution, about two-thirds of individuals will have an IQ of between 85 and 115 with a mean of 100. According to the American Association on Intellectual and Developmental Disabilities, individuals who have an IQ of less than 70 may be classified as mentally retarded or mentally challenged and those who have an IQ of more than 130 may be considered as gifted. Similarly, test scores that measure a particular characteristic such as language proficiency, quantitative ability or scientific literacy of a specific population can be expected to produce a normal curve. The normal curve is divided according to standard deviations (i.e. -4s, -3s ⁄⁄ +3s and 4s) which are shown on the horizontal axis. The area of the curve between standard deviations is indicated as a percentage on the diagram. For example, the area between the mean and standard deviation +1 is 34.13 per cent. Similarly, the area between the mean and Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 235 standard deviation -1 is also 34.13 per cent. Hence, the area between standard deviation -1 and standard deviation +1 is 68.26 per cent. It means that in a normal distribution, 68.26 per cent of individuals will score between standard deviations -1 and +1. In using the normal curve, it is important to make a distinction between standard deviation values and standard deviation scores. A standard deviation value is a constant and is shown on the horizontal axis in Figure 10.6. On the other hand, the standard deviation score is the obtained score when we use the standard deviation formula (which we discussed earlier). For example, if we obtained a standard deviation of 5, then the score for one standard deviation is 5 and the score for two standard deviations is 10, the score for three standard deviations is 15 and so forth. Standard deviation values of -1, -2 and -3 will have corresponding negative scores of -5, -10 and -15. Note, that in Figure 10.6, Z-scores are indicated from +1 to +4 and -1 to -4 with the mean as 0. Each interval is equal to one standard deviation. Similarly, T-scores are reported from 10 to 90 (interval of 10) with the mean set at 50. Each interval of 10 is equal to one standard deviation. 10.5 NORMS In norm-referenced assessment, an individualÊs performance is evaluated in relation to other peopleÊs performances. Norm-referenced tests are seldom used in Malaysia but in the United States, standardised tests are widely used. Perhaps, because of the decentralised education system in the United States, school-based assessment is extensively practised. Unlike Malaysia, there are no national examinations like the PMR and SPM in the United States. Hence, teachers there who want to find out how their students are performing compared with other students in the country rely on norm-referenced tests to compare the performances of their students with the performances of other students in the norm group. What are norms? Norms are the characteristics of a population accurately estimated from the characteristics of a representative subset of the population (called the sample or norm sample). Copyright © Open University Malaysia (OUM) 236 TOPIC 10 ANALYSIS OF TEST SCORES Norms are produced based on the norm sample. For example, if you have norms of reading ability for children of different age groups, you will be able to compare the performance of a seven-year-old in your class on the reading ability test with the rest of the population. In other words, you can determine whether your sevenyear-old is reading at the level of other seven-year-olds in the country. In establishing these norms, you have to ensure that the norm sample is representative of the population. Representativeness When you compare your students with the rest of the population, you want to ensure that the norm sample is representative. In other words, the individuals tested in the norm sample must consist of the appropriate age group, taking into consideration gender differences, geographic location and cultural differences. For example, the eight-year-olds selected for the norm sample should reflect eightyear-olds in the rest of the country according to gender (male and female), geographic location (urban or rural) and cultural differences. For example, the norm sample consists of 3,000 Malaysian primary school children with 500 students for each age group (seven-year-olds = 500 students, eight-year-olds = 500 and so forth). The norm sample should consist of children in all the states of Malaysia, including all the ethnic groups in the country and be drawn from different socioeconomic backgrounds and geographic locations. Based on the norm sample of 3,000 primary school children, the following hypothetical norms on reading ability in Bahasa Melayu for Malaysian children were produced (refer to Table 10.3). Table 10.3: Norms for a Reading Ability Test Reading Ability (Eight-year-olds) Score Percentile 50 96 49 90 48 84 47 78 46 70 47 66 46 58 45 50 44 45 Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 237 Percentile ranks (percentiles) are used in standardised tests which allow teachers to compare the performance of their students with the norm group. An eight-yearold student who obtained a score of 48 on the test has a percentile rank of 84. This means that the student is reading at a level as well as, or better than, 84 per cent of other eight-year-old students in the test. Similarly, an eight-year-old who obtains a percentile rank of 45 is reading as well as, or better than, 45 per cent of eight-year-olds in the norm sample. To use norms effectively, you should be sure that the norm sample is appropriate, both for the purpose of testing and for the person being tested. If you recognise that the test norms are inadequate, you should be cautious because you may obtain misleading information about the abilities of your students. The organisation responsible for developing the norms should clearly state the groups tested because you want to ensure that the norm sample is similar to your students. In other words, the norm sample should consist of the same type of people in the same proportion as is found in the population of reference. The norm sample should be large enough to be stable over time. SELF-CHECK 10.3 1. List some characteristics of the normal curve. 2. What are norms? How are norms used? 3. Do you think we should have standardised tests with norms for the measurement of different kinds of abilities? Why? Statistics is a mathematical science pertaining to the analysis, interpretation and presentation of data. Data collected about students can be subjected to statistical analysis, which serves two related purposes: descriptive and inferential. The term „central tendency‰ refers to the „middle‰ value and is measured using the mean, median and mode. It is an indication of the location of scores. The mean is simply the sum of all the values (marks) divided by the total number of items (students) in the set. Copyright © Open University Malaysia (OUM) 238 TOPIC 10 ANALYSIS OF TEST SCORES The range of scores in a test refers to the lowest and highest scores obtained in the test. Standard deviation refers to how much the scores obtained by students deviate or differ from the mean. Skew refers to the symmetry of a distribution. A negative skew has a longer tail in the negative direction. A positive skew has a longer tail in the positive direction. The standard score refers to a raw score that has been converted from one scale to another scale using the mean and standard deviation. Z-score tells how many standard deviations away from the mean the score is located. The T-score is a standardised score with a mean of 50 and a standard deviation of 10. The normal curve (also called the „bell curve‰) is a hypothetical curve that is supposed to represent all natural occurring phenomena. In norm-referenced assessment, an individualÊs performance is evaluated in relation to other peopleÊs performances. Norms are the characteristics of a population accurately estimated from the characteristics of a representative subset of the population called the sample or norm sample. Copyright © Open University Malaysia (OUM) TOPIC 10 ANALYSIS OF TEST SCORES 239 Central tendency Norms Descriptive statistics Positive skew Dispersion Range Inferential statistics Standard deviation Mean Standard score Median T-scores Negative skew Z-scores Normal curve Copyright © Open University Malaysia (OUM) MODULE FEEDBACK MAKLUM BALAS MODUL If you have any comment or feedback, you are welcome to: 1. E-mail your comment or feedback to modulefeedback@oum.edu.my OR 2. Fill in the Print Module online evaluation form available on myINSPIRE. Thank you. Centre for Instructional Design and Technology (Pusat Reka Bentuk Pengajaran dan Teknologi ) Tel No.: 03-78012140 Fax No.: 03-78875911 / 03-78875966 Copyright © Open University Malaysia (OUM) Copyright © Open University Malaysia (OUM)