AMEP Assessment Task Bank Professional Development Kit General Issues and Principles of Assessment Notes to accompany the PowerPoint for the AMEP Assessment Task Bank Professional Development Kit. Developed by Marian Hargreaves for NEAS, 2013. Slide 1: Front page Slide 2: Workshop aims To review the key issues in assessment the principles of assessment and how they apply in the context of the AMEP Part 1 Slide 3: Key issues The stakeholders: who has an interest in this activity of assessment? The different purposes of assessment: why do we assess anyway? The stages of assessment: when do we assess? How do we assess? The focus of assessment: what do we assess? Slide 4: The Stakeholders (and the stakes). Who? For language assessment stakeholders usually include the participants at every level involved in the process: Of course – the students – they want to know how they are doing and be able to prove their ability. For students, the stakes are very high and may determine the course and direction of their lives. The teachers – they also want to know how the students are doing, how they can best improve their teaching practice and assist the students in their achievement. For teachers the stakes are not so high, the assessment practice may have more impact on their professional development and teaching style. The management – which needs evidence that the program is working and there is value for money. The stakes here may be quite high as funds (both from sponsors and from fees) may depend upon good results. The administrators who deal with the daily keeping of records and process the evidence provided by assessment. The stakes are not high for administrators – it’s just a part of the job. DIAC – the sponsors paying for the program, in this case the Adult Migrant English Program (AMEP), and the development of the Assessment Task Bank (ATB). The personal © NEAS Ltd 2014 1 nature of the stakes does not apply here, but decisions made regarding the future of the program may be affected by the results of assessment. The rest of the world which takes the results of assessment and makes key decisions, about employment, funding, development etc. The stakes are even less personal here, but hindsight can trace the effect of assessment. Slide 5: Different purposes of Assessment: (Why?) Prior to – diagnostic or placement: to see what class a student should be put into; to find out a student’s strengths and weaknesses. As – formative: to see how students are progressing. Of – summative: to see what students have actually learned and achieved. The aim of assessment for the Certificates of Spoken and Written English (CSWE) is summative. But assessment of learners within the Adult Migrant English Program also includes diagnostic and formative. It is an area for discussion and development in support of classroom practice. For – as a continuous and integral part of the inclusive, student-centred learning process. Slide 6: Stages of Assessment: ( When?) The Teaching Learning cycle Talk to the diagram as much as you feel necessary/useful. Give as handout. © NEAS Ltd 2014 2 © NEAS Ltd 2014 3 Slide 7: The focus of assessment: (What?) Always a key question, what exactly is it that we are trying to assess? This is a question that should be given serious consideration and regular review – it can be too easy to make assumptions. For example a simple reading text may contain cultural norms which, while very familiar to teachers, may be startlingly new to migrants. A simple listening question may require an answer that actually involves a lot of writing. What have you been teaching/want to teach? A description for a Cert I Learning Outcome is very different if you want to describe a place, from describing a person. What does your centre want? This is intrinsic to the validity of assessment, so there is more on this later. Slide 8: How do we assess? Styles • Self assessment • Formal tests/exams • Informal, in-class and/or observation • Group assessment • Peer assessment • Continuous Different styles suit different types of assessment. Slide 9: examples of different Cert modules and suitable types of assessment But, for the achievement of CSWE learning outcomes, modules, and the completed Certificate, group and peer assessment are not appropriate. What do we assess: see the curriculum, and the specifications of DIAC. Part 2: Underpinning principles Slide 10: Best practice in assessment Identify clear learning outcomes - assessment should be explicit – students should know what is expected of them and ensure that both student and teacher play a pivotal role in focusing learning and teaching on intended learning outcomes. Promote active student engagement in learning. Recognise and value student diversity. Provide an opportunity for success for all students. Embody high quality, timely feedback. Produce grades and reports of student learning achievements that are valid, reliable and accurate. © NEAS Ltd 2014 4 Meet expectations and standards of national and international stakeholders, where appropriate. Require the involvement of leaders and managers to achieve quality enhancement and continuous improvement. Slide 11: The Cornerstone Principles Validity Reliability Practicality Slide 12: Validity Does the assessment task measure what you want it to measure? Slide 13 There are several types of validity, but construct validity is often used as the overarching tenet. The construct of a test is the theory that the test is based on. For language tests therefore, this is the theory of language ability. Construct validation is about investigating whether the performance of a test is consistent with the predictions made from these theories. So, for us using the CSWE framework, the relevant theory is that of social, communicative language. When we talk about validity and in particular about construct validity, we are usually referring to the macro skills and whether the test is actually assessing those skills. For example, if we are talking about the construct validity of a reading task in a proficiency test, we might start by trying to define the reading skills and strategies that learners use to read a text. The task will then be designed with those skills and strategies in mind and will attempt to target them specifically in order to measure a learner’s ability in reading. If the task is designed for a specific curriculum with itemised criteria, then the task must specifically target and measure those criteria. Hence the CSWE Learning Outcomes. Cognitive related validity is concerned with the extent to which the cognitive processes employed by candidates are the same as those that will be needed in real-world contexts beyond the test. These contexts are generally known as Target Language Use (TLU). This often crops up in the pursuit of Authenticity, both in performance and in the texts used in tests. Context related validity is concerned with the conditions under which the test is performed and includes aspects such as the tasks themselves, the rubrics and the topic as well as the administration conditions. The question of Fairness in a test comes under context related validity. © NEAS Ltd 2014 5 Face validity is where a “test looks as if it measures what it is supposed to measure” Hughes, 2003:33). This is very popular with stakeholders who may not know a great deal about assessment or validity itself. Slide 14: Reliability Is the assessment consistent across tasks and raters and assessors? Are the conditions of administration consistent across assessment occasions? Reliable tests produce the same or similar result on repeated use. Test results should be stable, consistent and free from errors of measurement. There are two types of reliability in methodology. Internal and External. I am mostly going to consider internal reliability and begin with the reliability of the task. This is where criteria become important. Criterion referenced tests are ones in which candidates are assessed against specific target behaviours. Test scores are then an indication of what a candidate can and cannot do. Reliability is also affected by the setting and maintaining of standards. Wherever multiple versions of a test are produced it is important to be able to show that these different versions are comparable and that the standard (in terms of proficiency level) has been maintained. This means ensuring that all the tasks used to assess a particular LO are equivalent. While this is hard to achieve, we need to come as close as possible to equivalence. For example, Slide 15: Example: Cert II C2 Participate in a spoken transaction for information/goods & services. In a low level speaking task, if the prompt includes a list of information which structures the task, then the prompt for all tasks for that LO should include the same amount and type of information as that included in the list in the speaking task. For example, see how without the prompts, the second version of this speaking task would make it much more difficult for a learner. Key role for the ATB – to maintain consistency between tasks and reliability. The conditions under which the task is administered also need to be equivalent. For example, these include the amount of time allowed for pre-reading, the use of adequate equipment for playing listening tasks, the availability of dictionaries etc. However, any test score will reflect a proportion of other factors, known as error. the day of the test session (the weather, administration etc might be different), the individual candidate may vary through tiredness, loss of motivation, stress etc , the markers or the test version may perform differently, other factors beyond our control (like a traffic accident outside). Reliability wants to measure real ability, need to reduce other factors as much as possible. © NEAS Ltd 2014 6 Slide 16: Scoring At this point I would like to clarify the distinction between rating and marking. Marking applies to the receptive skills, reading and listening. ATB assessment tasks have a marking guide or answer sheet. Theoretically this can be used by almost anybody and should if correctly designed, with all possible, acceptable answers ensure marking reliability. However, there is enormous diversity among AMEP teachers both in experience and in expectations. ATB answer sheets therefore now have, wherever possible, explicit details related to the criteria of exactly what answers are acceptable and which questions, or how many questions learners have to get correct in order to achieve the Learning Outcome. Rating applies to the productive skills, writing and speaking, and it is much more difficult to ensure consistency. Scoring a performance is much more subjective. The criteria in the assessment grids helps a great deal, and recording is a must, but still problematic. The other type of reliability that we therefore need to take into account is that of the raters themselves, ie rater reliability. There are two types: inter-rater reliability and intra-rater reliability. Inter-rater reliability is whether two raters rating the same performance are rating with the same degree of severity. It can very difficult to know whether there is consistency across ratings. Intra-rater reliability refers to whether the same rater rates all the learner performances with the same degree of severity. Slide 17: Practicality How practical is it to develop, administer and mark the tasks? This is a major issue and an integral part of deciding the usefulness of a test (Bachman & Palmer). Factors that need to be considered include the resources necessary to produce and administer the test (which includes marking, recording and giving feedback). Slide 18: Impact and Fairness: important aspects in current theories of validity. Impact - consequence of taking that test and the effect of passing or failing, both on the candidate themselves and on educational system and society more widely. The effects and consequences of a test include the intended (and hopefully positive) outcomes of assessment, as well as the unanticipated and sometimes negative side-effects which tests might have. For example, the washback effect from the introduction of a new test may affect (positively or negatively) the way in which teachers teach. Many language tests, especially summative tests, are very high stakes tests: the consequences of passing or failing will affect the candidate’s entire life. Examples of a high-stakes test could include a driving test – passing a driving test changes your whole life. If you fail, you can take it again of course, but you will not be able to do © NEAS Ltd 2014 7 things that you may have been counting on, eg drive your children to school, drive yourself to work and thus avoid a long and tedious journey by public transport, go on holiday. In the AMEP context, time is very limited. Clients want to make the most of their 510 hours of English tuition, pass their tests and get their certificates. If they fail an assessment, especially towards the end of term, it may not be possible for them to do another for some time, especially if it does not fit in with the rest of the class. If a TAFE course for example requires a particular LO/assessment, then the client may not be able to enrol in the TAFE course for another 6 months or even a year. It is therefore very important that the test be a fair one: A fair test should not discriminate against sub-groups of candidates or give an advantage to other groups. A test should be kept confidential and secure so that candidates cannot see the questions in advance. Results should be clear and easy to understand. A test should be fair to those who rely on the results in addition to candidates, such as employers who need to know that the test is consistent and accurately reflects the ability being tested. A fair test should have no nasty surprises for the candidate – the form and structure of the test should be familiar and the content appropriate to what has been taught. Rubrics should always be both clear and appropriate. Rubrics can be understood to be either or both the instructions of a task, and the scoring standard for that task. Here, we are largely talking about the instructions to teachers and to students. Each assessment task should have clear and consistent instructions to teachers for the administration of the task, and the scoring/marking of the task, including the answer key if relevant. The instructions for the learners should also be clear, appropriate and consistent. The language of the instructions should be unambiguous and at a level of language below the level being tested. They should not be longer than necessary, and the learner should have the opportunity to clarify any possible confusion or misunderstanding before the assessment begins. For example, the instructions for a writing assessment should not constitute a reading assessment in themselves. The questions in a listening assessment should be understood before the assessment begins and the recording is played. If the assessment is to be done in sections, the procedure should be clearly explained. If dictionaries are allowed, the learners should know this, and also know what type of dictionary is allowed – can they use electronic dictionaries, for example? Slide 19: Task-based language performance assessment (TBLPA). © NEAS Ltd 2014 8 This approach to assessment looks at the way in which language is used to show competency. It is dependent upon consistency of performance with more than one event being used to determine ability. There are two approaches in this theory. The first uses construct validity to show that the learner has a language ability, and authenticity/generalisability to extrapolate the ability in the area of target language use (TLU). The second approach uses content validity to show that the learner can do specific real-life tasks, for example function in the role of an air traffic controller. The significance of TBLPA is that both approaches are relevant for learners in the AMEP; not only do they have to perform specific tasks, but they also have to demonstrate that their language skills apply in a more general sense to Target Language Use. Slide 20: Authenticity Situational and interactional It is generally accepted that assessment tasks and items (ie the questions) should represent language activities from real life. This relates to the Target Language Use mentioned earlier (situational authenticity). Interactional authenticity is the naturalness of the interaction between the test taker and the task and the mental processes which accompany it. That is, is it for the test taker, a relevant task or just a meaningless exercise. For example, a task based on listening for specific information can be made more situationally authentic if an everyday context, such as a radio weather forecast, is created. It can be made more interactionally authentic if the test taker is given a reason for listening – eg planning a picnic that week and must select a suitable day. It may be necessary to adapt material and activities to learners’ current level of language proficiency. This will reduce the authenticity of the materials, but the situations that the learners engage in and their interaction with the texts and with each other can still be authentic. We will consider authenticity a bit more when we look at tasks themselves. Slide 21: Questions Questions are an essential part of assessment, but the style of questioning varies according to the macroskill being assessed. Question items, their advantages, disadvantages and appropriate use are therefore addressed in a separate module. In brief, question items for reading and listening include: • • • • • • • • Multiple choice questions (MCQs) Grid completion Summary cloze (gap fill) Short answer questions Sentence completion Matching Ordering Information transfer © NEAS Ltd 2014 9 Question items for speaking generally include the How, what, why, when, where, etc question forms. Slide 22: Feedback Arguably the whole point of assessment is in the feedback. All assessment, including summative assessment for achievement, should provide an opportunity for feedback as an important part of the learning process. Feedback should be given following every occasion of assessment. Feedback can be given either in writing or verbally, but should be confidential at all times, except where a whole class activity is undertaken, in which case anonymous feedback should be given to the whole class. However, as it is very important to keep all documents in a summative assessment task set secure for future assessment purposes, answer keys should not be given out, nor learners allowed to keep their corrected response sheets. Slide 23: and finally You need to be humble to write assessment tasks! Even the best assessment task writer will write bad items on occasion. Here are some points which are useful to remember when writing items. • do the task yourself. It isn’t enough just to look it over. • ask your peers to do the task and give their feedback - nobody writes good items alone. • don’t be defensive about your tasks – we all write bad tasks. Accept suggestions for improvement. • get feedback from respondents on what they think the task and items are testing. • check respondent feedback with the LO specifications. • check overall compliance with specifications. • make sure that the test method is familiar to learners. • pre-test the task on learners. • check whether the language of the items is easier than the language of the text. • check whether all possible and plausible answers have been included in the answer key. • make sure that the item is contextualised. • check how true to life the item is and whether it looks like a real world task. Evaluation Absolutely essential : do the task yourself. © NEAS Ltd 2014 10 When the task has been written, checked, piloted, modified and if finally ready for the Assessment Task Bank – get the proof reader to do the task themselves. They usually turn up something! Use a checklist to ensure all aspects of the task are covered. Headings in the checklist are: • the conditions under which the task is administered • the characteristics of the task • the features of the task. For example, for a listening task, these would be: - The text - The items - The text/item relationship - The answer key. Going through the evaluation checklist and the specifications for the learning outcome in the curriculum should identify problems in the task that need to be addressed or indicate that the task is unsalvageable and should be abandoned. Ask other teachers to look at your new task. Pilot the task with students Piloting every task with learners is a vital step in the task development process. Piloting enables you to check whether the task works and elicits performances that are compatible with the performance criteria. It establishes whether the task requires further modification or if it should be rejected. Piloting should be done with at least five learners under uniform conditions. Where possible obtain a range of learner profiles. Vary the language background of learners, their age, their level of formal education and so on. Give learners the possibility of commenting on the task. This feedback should be considered in addition to the learners’ performance in the task. Modification After the task has been done and evaluated, it should be modified. Problems which have been identified during task evaluation should be rectified during this stage according to the feedback received. The revised task is then ready to be done again, either by peers or by learners. Slide 24: References Alderson, J. C. (2000). Assessing reading. Cambridge, Cambridge University Press, 69. Bachman , L.F. & A. S. Palmer (1996) Language Testing in Practice, Oxford University Press,. Brindley, G.P. (1995) Language Assessment in Action. NCELTR. Hughes, A. (1989). Testing for language teachers. Cambridge, Cambridge University Press, pp. 59–140. Khalifa, H. & C. Weir (2009) General Marking: performance management, in Examining Reading: Research and practice in assessing second language reading (Studies in Language Testing, vol 29, pp. 276-280. Manual for Language Test Development and Examining (2011) Language Policy division, Council of Europe. © NEAS Ltd 2014 11 Messick, S (1989) Validity, in Linn, R.L. (Ed) Educational Measurement (3rd Ed.) New York; Macmillan, pp.13-103. Mislevey et al. (2001) Design and Analysis in Task-Based Language Assessment. Language Assessment. Shohamy, E. (1985). A practical handbook in language testing for the second language teacher. Israel: Internal Press. Weir, C. J. (1997). The testing of reading in a second language. In C.Clapham & D. Corson (Eds) Encyclopedia of Language and Education (pp. 39-50). Dordrecht: Kluwer Academic Publishers. © NEAS Ltd 2014 12