VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF FOREIGN LANGUAGES DEPARTMENT OF POSTGRADUATE STUDIES ------------------------------ GIÁP THỊ AN EVALUATING A FINAL ENGLISH READING TEST FOR THE STUDENTS AT HANOI, TECHNICAL AND PROFESSIONAL SKILLS TRAINING SCHOOL – HANOI CONSTRUCTION CORPORATION (ĐÁNH GIÁ BÀI KIỂM TRA HẾT MÔN TIẾNG ANH CHO HỌC SINH TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ NỘI – TỔNG CÔNG TY XÂY DỰNG HÀ NỘI) M.A minor thesis Field: Methodology Code: 60.14.10 HANOI –2008 VIETNAM NATIONAL UNIVERSITY, HANOI COLLEGE OF FOREIGN LANGUAGES DEPARTMENT OF POSTGRADUATE STUDIES -------------------------------- GIÁP THỊ AN EVALUATING A FINAL ENGLISH READING TEST FOR THE STUDENTS AT HANOI, TECHNICAL AND PROFESSIONAL SKILLS TRAINING SCHOOL – HANOI CONSTRUCTION CORPORATION (ĐÁNH GIÁ BÀI KIỂM TRA HẾT MÔN TIẾNG ANH CHO HỌC SINH TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ NỘI – TỔNG CÔNG TY XÂY DỰNG HÀ NỘI) MA. minor thesis Field: Methodology Code: 60.14.10 Supervisor: Phùng Hà Thanh, M.A. HANOI – 2008 I clarify that the study is my own work. The thesis, wholly or partially, has not been submitted for higher degree. ii ACKNOWLEGEMENTS I would like to express my deepest thanks to my supervisor Ms. Phùng Hà Thanh, M.A. for the invaluable support, guidance, and timely encouragement she gave me while I was doing this research. I am truly grateful to her for her advice and suggestions right from the beginning when this study was only in its formative stage. I would like to send my sincere thanks to the teachers at English Department, HATECHS, who have taken part in the discussion as well given insightful comments and suggestion for this paper. My special thanks also go to students in groups KT1, KT2, KT3, KT4 -K06 for their participation to the study as the subjects of the study. With out them, this project could not have been so successful. I owe a great debt of gratitude to my parents, my sisters, my husband especially my son, who have constantly inspired and encouraged me to complete this research. iii ABSTRACT Test evaluation is a complicated phenomenon which has been paid much attention by number of researchers since the importance of language test in assessing the achievements of students was raised. When evaluating a test, evaluator should have concentrated on criteria of a good test such as the mean, the difficulty level, discrimination, the reliability and the validity. This present study, researcher chose the final reading test for students at HATECHS to evaluate with an aim at estimating the reliability and checking the validity. This is a new test that followed the PET form and was used in school year 2006 – 2007 as a procedure to assess the achievement of students at HATECHS. From the interpretation of the data got from scores, researcher has found out that the final reading test is reliable in the aspect of internal consistency. The face and construct validity has been checked as well and the test is concluded to be valid based on the calculated validity coefficients. However, the study remains limitations that lead to the researcher’s directions for future studies. iv TABLE OF CONTENT ACKNOWLEGEMENT ii ABSTRACT iii TABLE OF CONTENT iv LIST OF ABBREVIATION vi LIST OF TABLES GLOSSARY OF TERMS PART ONE: INTRODUCTION vii viii 1 1. Rationale 2 2. Objectives of the study 3 3. Scope of the study 3 4. Methodology of the study 3 5. The organization of the study 4 PART TWO: DEVELOPMENT 5 CHAPTER 1: LITERATURE REVIEW 6 1.1. Language testing 1.1.1. Approaches to language testing 6 6 1.1.1.1. The essay translation approach 6 1.1.1.2. The structuralist approach 6 1.1.1.3. The integrative approach 7 1.1.1.4. The communicative approach 8 1.1.2. Classifications of Language Tests 8 1.2. Testing reading 10 1.3. Criteria in evaluating a test 13 1.3.1. The mean 13 1.3.2. The difficulty level 13 1.3.3. Discrimination 14 1.3.4. Reliability 15 1.3.5. Validity 17 CHAPTER 2: METHODOLOGY AND RESULTS 19 2.1. Research questions 19 2.2. The participants 19 2.3. Instrumentation and Data collection 20 v 2.3.1 Course objectives; Syllabus and Materials for teaching 20 2.3.1.1. Course objectives 20 2.3.1.2. Syllabus 20 2.3.1.3. Assessment instruments 21 2.3.2. Data collection 23 2.3.3. Data analysis and Results 23 2.3.3.1. Test score analysis 24 2.3.3.2. The reliability of the test 25 2.3.3.3. The test validity 28 2.3.3.4. Summary to the results of the study 32 PART THREE: CONCLUSION 33 1. Conclusion 34 2. Limitations 34 3. Future directions 35 REFERENCES 36 APAENDIX 1: THE FINAL READING TEST 38 APPENDIX 2: THE PET 44 APPENDIX 3: QUESTIONS FOR DISCUSSION 50 vi LIST OF ABBREVIATIONS HATECHS: Hanoi, Technical and Professional Skill Training School M: Mean N: Number of students P: Part PET: Preliminary English Test r: Validity coefficient R xx: Reliability Coefficient SD: Standard Deviations Ss: Students vii LIST OF TABLES Table 1: Types of language tests 9 Table 2: Types of tests 10 Table 3: Types of reliability 16 Table 4: The syllabus for teaching English – Semester 2 21 Table 5: Components of the PET reading test 22 Table 6: Components of the final reading test 22 Table 7: The Raw Scores of the final reading test and the PET 24 Table 8: The reliability coefficients 27 Table 9: The validity coefficients 31 viii GLOSSARY OF TERMS Discrimination is the spread of scores produced by a test, or the extent to which a test separates students from one another on a range of scores from high to low. Also used to describe the extent to which an individual multi-choice item separates the students who do well on the test as a whole from those who do badly. Difficulty is the extent to which a test or test item is within the ability range of a particular candidate or group of candidates. Mean is a descriptive statistic, measuring central tendency. The mean is calculated by dividing the sum of a set of scores by the number of scores. Median is a descriptive, measuring central tendency: the middle score or value in a set. Marker, also scorer is the judge or observer who operates a rating scale in the measurement of oral and written proficiency. The reliability of markers depends in part on the quality of their training, the purpose of which is to ensure a high degree of comparability, both inter- and intra-rater. Mode is a descriptive statistic, measuring central tendency: the most frequent occurring score or score interval in a distribution. Raw scores – test data in their original format, not yet transformed statistically in any way ( eg by conversion into percentage, or by adjusting for level of difficulty of task or any other contextual factors). Reading comprehension test is a measure of understanding of text. Reliability is the consistency, the extent to which the scores resulting from a test are similar wherever and whenever it is taken, and whoever marks it. ix Score is the numerical index indicating a candidate’s overall performance on some measure. The measure may be based on ratings, judgements, grades, number of test items correct. Standard deviation is the property of the normal curve. Mathematically, it is the square root of the variance of a test. Test analysis is the data from test trials are analyzed during the test development process to evaluate individual items as well as the reliability and validity of the test as a whole. Test analysis is also carried out following test administration in order to allow the reporting of results. Test analysis may also be conducted for research purposes. Test item is part of an objective test which sets the problem to be answered by the student: usually either in multi choice form as statement followed by several choices of which one is the right answer and the rest are not; or in true/false statement which the student must judge to be either right or wrong. Test taker is a term used to refer to any person undertaking a test or examination. Other terms commonly used in language testing are candidate, examinee, testee. Test retest is the simplest method of computing test reliability; it involves administering the same test to the same group of subjects on two occasions. The time between administrations is normally limited to no more than two weeks in order minimize the effect of learning upon true scores. Validity is the extent to which a test measures what it is intended to measure. The test validity consists of content, face and construct validity. 1 PART ONE: INTRODUCTION 2 1. Rationale Testing is necessary in the process of language teaching and learning, therefore it has gained much concern from teachers and learners. Through testing, teacher can evaluate learners’ achievements in a certain learning period; self assess their different teaching method and provide input into the process of language teaching (Bachman, 1990, p. 3). Thanks to testing, learners also self-assess their English ability to examine whether their levels of English meet the demand of employment or studying abroad. The important role of test makes test evaluation necessary. By evaluating tests, test designers would have the best test papers for assessing their students. Despite the importance of testing, in many schools, tests are designed without following any rigorous principles or procedures. Thus, the validity and the reliability should be doubted. In HATECHS, the final English course tests had been designed by teachers at English Department at the end of the course, and some of tests were used repeatedly with no adjustment. In school year 2006 – 2007, there has been a change in test designing. Final tests were designed according to PET (Preliminary English Test) procedure. The PET is from Cambridge Testing System for English Speakers of Other Languages. Based on the PET, new Final Reading Test has been also developed and used as an instrument to assess students’ achievement in reading skill. The test was delivered to students at the end of school year 2006 - 2007, there was not any evaluation. To decide whether the test is reliable and valid a serious study is needed. The context at HATECHS has inspired the author, a teacher of English to take this opportunity to undertake the study entitled “Evaluating a Final English Reading Test for the Students at Hanoi Technical and Professional Skills Training School” with an aim to evaluate the test to check the validity and reliability of the test. The author was also eager to have a chance to find out some suggestions for test designers to get better and more effective test for their students. 3 2. Objectives of the study The study is aimed at evaluating final reading test for the students at Hanoi, Technical and Professional Skill Training School. The test takers are non-majors. The results of the test will be analyzed, evaluated and interpreted with the aims: - to calculate the internal consistency reliability of the test - to check the face and construct validity of the test 3. Scope of the study Test evaluation is a wide concept and there are many criteria in evaluating the test. Normally, there are four major criteria - item difficulty, the discrimination, reliability and the validity when any test evaluator wants to evaluate a test. However, it is said that item difficulty and the discrimination of the test are difficult to evaluate and interpret; therefore, with in this study the researcher focuses on the reliability and the validity of the test as a whole. At HATECHS, at the end of Semester 1, there is a reading achievement test, and at the end of the first year, after finishing 120 periods studying English, there is a final reading test. The researcher chose the final test to evaluate the internal consistency reliability, face and construct validty. 4. Methodology of the study In this study, the author evaluated the test by adopting both qualitative and quantitative methods. The research is quantitative in the sense that th e data will be collected through the analysis to the scores of the 30 random papers of students at the Faculty of Finance and Accounting. To calculate the internal consistency reliability researcher use Formula called Kuder-Richardson 21 and Pearson Correlation Coefficient Formula would be adopted to calculate the validity coefficient. It is qualitative in the aspect of using a semi-structured interview with open questions 4 which were delivered to teachers at HATECHS at the annual meeting on teaching syllabus and methodology. The conclusion to the discussion would be used as the qualitative data of the research. 5. The organization of the study The study is divided into three parts: Part one: Introduction – is the presentation of basic information such as the rationale, the scope, the objectives, the methods and the organization of the study. Part two: Development – This part consists of two chapters Chapter 1: Literature Review – in which the literature that related to language of testing and test evaluation. Chapter 2: Methodology and Results – is concerned with the methods of the study, the selection of participants, the materials and the methods of data collection and analysis as well the results of the process of data analysis. Part three: Conclusion – this part will be the summary to the study, limitations as well the recommendations for further studies. Then, Bibliography and Appendices 5 PART TWO: DEVELOPMENT 6 CHAPTER 1 LITERATURE REVIEW This chapter is an attempt to establish theoretical backgrounds for the study. Approaches to language testing and testing reading as well as some literature to the test evaluation will be reviewed. 1.1. Language testing 1.1.1. Approaches to language testing 1.1.1.1. The essay translation approach According to Heaton (1998), this approach is commonly referred to as the pre scientific stage of language testing. In this approach, no special skill or expertise in testing is required. Tests usually consist of essay writing, translation and grammati cal analysis. The tests, for Heaton, also have a heavy literary and cultural bias. He also criticized that public examination i.e. secondary school leaving examinations resulting from the essay translation approach sometimes have an aural/oral component at the upper intermediate and advanced levels though this has sometimes been regarded in the past as something additional and in no way an integral part of the syllabus or examination (p. 15) 1.1.1.2. The structuralist approach “This approach is characterized by the view that language learning is chiefly concerned with the systematic acquisition of a set of habits. It draws on the work of structural linguistics, in particular the importance of contrastive analysis and the need to identify and measure the learner’s mastery of the separate elements of the target language: phonology, vocabulary and grammar. Such mastery is tested using words and sentences completely divorced from any context on the grounds that a language forms can be covered in the test in a comparatively short time. The skills of listening, speaking, reading and writing are also separated from one another as much as possible because it is considered essential to test one thing at a time” Heaton, 1998, p.15). 7 According to him, this approach is now still valid for certain types of test and for certain purposes such as the desire to concentrate on the testees’ ability to write by attempting to separate a composition test from reading. The psychometric approach to measurement with its emphasis on reliability and objectivity forms an integral part of structuralist testing. Psychometrists have been able to show early that such traditional examinations as essay writing are highly subjective and unreliable. As a result, the need for statistical measures of reliability and validity is considered to be the utmost importance in testing: hence the popularity of the multi-choice item – a type of item which lends itself admirably to statistical analysis. 1.1.1.3. The integrative approach Heaton (1998, p.16) considered this approach the testing of language in context and is thus concerned primarily with meaning and the total communicative effect of discourse. As the result, integrative tests do not seek to separate language skills into neat divisions in order to improve test reliability: instead, they are often designed to assess the learner’s ability to use two or more skills simultaneously. Thus, integrative tests are concerned with a global view of proficiency – an underlying language competence or ‘grammar of expectancy’, which it is argued every learner possesses regardless of the purpose for which the language is being learnt. The integrative testing, according to Heaton (1998) are best characterized by the use of cloze testing and dictation. Beside, oral interviews, translation and essay writing are also included in many integrative tests – a point frequently overlooked by those who take too narrow a view of integrative testing. Heaton (1998) points out that cloze procedure as a measure of reading diffi culty and reading comprehension will be treated briefly in the relevant section of the chapter on testing reading comprehension. Dictation, another major type of integrative test, was previously regarded solely as a means of measuring students’ skills of listening comprehension. Thus, the complex elements involved in tests of dictation were largely overlooked until fairly recently. The integrated skills involved in test dictation 8 includes auditory discrimination, the auditory memory span, spelling, the reco gnition of sound segments, a familiarity with the grammatical and lexical patterning of the language, and overall textual comprehension. 1.1.1.4. The communicative approach According to Heaton (1998, p.19), “the communicative approach to language testing is sometimes linked to the integrative approaches. However, although both approaches emphasize the importance of the meaning of utterances rather than their form and structure, there are nevertheless fundamental differences between the two approaches”. The communicative approach is said to be very humanistic. It is humanistic in the sense that each student’s performance is evaluated according to his or her degree of success in performing the language tasks rather than solely relation to the performance of other students. (Heaton, 1998, p.21). However, the communicative approach to language testing reveals two drawbacks. First, teachers will find it difficult to assess students’ ability without comparing achievement results of performing language tests among students. Second, communicative approach is claimed to be somehow unreliable because of various real-life situation. (Hoang, 2005, p.8). Nevertheless, Heaton (1988) proposes a solution to this matter. In his point of view, to avoid the lack of reliabili ty, very careful drawn - up and well-established criteria must be designed, but he does not set any criteria in detail. It a nutshell, each approach to language testing has its weak points and strong point as well. Therefore, a good test should incorporate features of these four approaches. (Heaton, 1988, p.15). 1.1.2. Classifications of Language Tests Language tests may be of various types but different scholars hold different views on the types of language tests. 9 Henning (1987), for instant, establishes seven kinds of language test which can be demonstrated as follows: No 1 Types Characteristics Objective vs objective tests Objective tests have clear making scale; do not need much consideration of markers. Subjective tests are scored based on the raters’ judgements or opinions. They are claimed to be unreliable and dependent. 2 Direct vs indirect tests Direct tests are in the forms of spoken tests (in real life situations) 3 Indirect tests are in the form of written tests. Discrete vs integrative Discrete tests are used to test knowledge in restricted areas. tests Integrative tests are used to evaluate general languge knowledge. 4 Aptitude, achievement and proficiency tests Aptitude tests (intelligence tests) are used to select students in a special programme. Achievement tests are designed to assess students’ knowledge in already-learnt ereas. Proficiency tests (placement tests) are used to select students in desired field. 5 Criterion referenced vs norm-reference Criterion referenced tests: The instructions are designed after the tests are devised. The tests obey the teaching objectives tests perfectly. Norm-reference tests: there are a large number of people from the target population. Standards of achievement such as the mean, average score are established after the course. 6 7 Speed test and power Speed tests consist of items, but time seems to be insufficient. tests Power tests contain difficult items, but time is sufficient. Others Table 1: Types of language tests (Source: Henning, 1987, pp 4-9) 10 However, Hughes (1989) mentions two categories: kinds of tests and kinds of language testing. Basically, kinds of language testing consist of direct vs indirect testing, norm-referenced testing vs criterion-referenced testing, discrete vs integrative testing, objective vs subjective testing (Hughes, 1989, pp 14-19). Apart from this, he develops one more type of test called communicative language testing which is described as the assessment of the ability to take part in acts of communication (Hughes, 1989, p.19). Hughes also discusses kinds of tests which can be illustrated in the following table: No 1 Kinds of tests Characteristics Sufficient command of language for a particular Proficiency purpose 2 3 Achievement Final Achievement Organized after the end of the course Progress Achievement Measure the students’ progress Diagnostic Find students’ strengths and weaknesses; what further teaching necessary. 4 Placement Classify students into classes at different levels. Table 2: Types of tests (Source: Hoang, 2005, p.13 as cited in Hughes, 1990, pp 9-14) Language tests are divided into two types by Mc Namara (2000) based on test methods and test purposes. About test methods, he believes that there exists two basic types, namely traditional paper and pencil language tests which are used to assess either separate components or receptive understanding; performance tests. Regarding to test purpose he divides language tests into two types: achievement tests and proficiency tests. 1.2. Testing reading Reading can be defined as the interaction between the reader and the text (Aebersold & Field, 1997). This dynamic relationship portrays the reader as creating meaning of 11 the text in relation to his or her prior knowledge (Anderson, 1999). Reading i s one of four main skills, which plays a decisive role in process of acquiring a language. Therefore, testing reading comprehension is also important. Traditionally, testing reading is no doubt because of the social important of literacy and because these tests are considered more reliable than speaking test. Alderson (1996) proposes that reading teachers feel uncomfortable in testing reading. To him, although most teachers use a variety of techniques in their reading classes, they do not tend to use the same variety of techniques when they administer reading tests. Despites the variety of testing techniques, none of them is subscribed to as the best one. Alderson (1996, 2000) considers that no single method satisfies reading teachers since each teacher has different purposes in testing. He listed a number of test techniques or formats often used in reading assessments, such as cloze tests, multiple-choice techniques, alternative objective techniques (e.g., matching techniques, ordering tasks, dichotomous items), editing tests, alternative integrated approaches (e.g., the C-test, the cloze elide test), short-answer tests (e.g., the freerecall test, the summary test, the gapped summary), and information-transfer techniques. Among the many approaches to testing reading comprehension, the three principal methods have been the cloze procedure, multiple-choice questions, and short answer questions (Weir, 1997). Cloze test is now a well-known and widely-used integrative language test. Wilson Taylor (1953) first introduced the cloze procedure as a device for estimating the readability of a text. However, what brought the cloze procedure widespread popularity was the investigations with the cloze test as a measure of ESL proficiency (Jonz, 1976, 1990; Bachman, 1982, 1985; Brown, 1983, 1993). The results of the substantial volume of research on cloze test have been extremely varied. Furthermore, major technical defects have been found with the procedure. Alderson (1979), for instance, showed that changes in the starting point or deletion rate affect reliability and validity coefficients. Other researchers like Carroll (1980), Klein-Braley (1983, 1985) and Brown (1993) have questioned the reliability and different aspects of validity of cloze tests. 12 According to Heaton (1998) “cloze test was originally intended to measure the reading difficulty level of the text. Used in this way, it is a reliable means of determining whether or not certain texts are at an appropriate level for particular groups of students” (p.131). However, for Heaton the most common purpose of the cloze test is to measure reading comprehension. It has long been argued that cloze measures text involving the interdependence of phrases, sentences and paragraphs within the text. However, a true cloze is said generally to measure global reading comprehension although insights can undoubtedly be gained into particular reading difficulty. In contrast, Cohen (1998) concludes that cloze tests do not assess global reading ability but they do assess local-level reading. Each research tends to show his evident to prove their arguments; however, most of them agree that cloze procedure is really effective in testing reading comprehension. Another technique that Alderson (1996, 2000), Cohen (1998), and Hughes (2003) discuss is ‘multiple-choice’; a common device for text comprehension. Ur (1996, p.38) defines multiple-choice questions as consisting “... of a stem and a number of options (usually four), from which the testee has to select the right one”. Alderson (2000: 211) states that multiple-choice test items are so popular because they provide testers with the means to control test-takers’ thought processes when responding; they “… allow testers to control the range of possible answers …” Weir (1993) points out that short-answer tests are extremely useful for testing reading comprehension. According to Alderson (1996, 2000), ‘short-answer tests’ are seen as ‘a semi-objective alternative to multiple choice’. Cohen (1998) argues that openended questions allow test-takers to copy the answer from the text, but firstly one needs to understand the text to write the right answer. Test-takers are supposed to answer a question briefly by drawing conclusions from the text, not just responding ‘yes’ or ‘no’. The test-takers are supposed to infer meaning from the text before answering the question. Such tests are not easy to construct since the tester needs to see all possible answers. Hughes (2003: 144) points out that “the best short -answer questions are those with a unique correct response”. However, scoring the responses 13 depends on thorough preparation of the answer-key. Hughes (2003) proposes that this technique works well when the aim is testing the ability to identify referents. These above techniques are what usually used in testing reading, however, it difficult to say which the most effective one is because it depends on the purpose of teachers in assessing their students. 1.3. Criteria in evaluating a test Test evaluation is a complicated phenomenon; this process needs to analyze number of criteria. However, there are five main criteria that most researchers evaluate their tests; they are the mean, difficult level, discrimination, reliability and validity. 1.3.1. The mean According to a dictionary of language testing by Milanovic and some other authors, the mean, also the arithmetical average is a descriptive statistic, measuring central tendency. The mean is calculated by dividing the sum of a set score by the number of score. Like other measures of central tendency the mean gives an indication of the trend or the score which is typical of the whole group. In normal distributions the mean is closely aligned to the median and the mode. This measure is by far the most commonly used and it is the basis of a number of statistical tests of comparison between groups commonly used in language testing. (Milanovic et al, 1999, p.118) In language test evaluation, this also a criterion needs evaluating because the mean score of the test will tell you how difficult or easy the test was for the given group. This is useful for evaluators to have reasonable adjustment to the test as a whole. 1.3.2. The difficulty level Difficulty level of a test tells you how difficult or easy each item of the test is. Difficulty also shows the ability range of a particular candidate or group of 14 candidates. “In language testing, most tests are designed in such a way that the majority of items are not too difficult or too easy for the relevant sample of test candidates.” (Milanovic et al, 1999, p.44) Item difficulty requirements vary according to test purpose. In selection test, for example, there may be no need for fine graded assessment within the ‘pass’ or ‘fail’ groups so that the most efficient test design will have a majority f items cluste ring near the critical cut-score. Information about item difficulty is also useful in determining the order of items on a test. Tests tend to begin with easy items in order to boost confidence and to ensure that weaker candidates do not waste valuable time on items which are two difficult for them. For test evaluators, difficulty level of a test should be analyzed for its importance in deciding the sequence of items on a test. As well, this is one of factors that affect the test scores of test-takers. 1.3.3. Discrimination According to Heaton, “the discrimination index of an item indicates the extent to which the item discriminates between the testees, separating the more able testees from the less able (Heaton, 1998, p. 179). For him, the index of discrimination tells us whether those students who performed well on the whole test tended to do well or badly on each item in the test. As well, in Milanovic’s definition, it is understood as “a fundamental property of a language test, in their attempt to capture the range of individual abilities. On that basis the more widely discrimination is an important indicator of a test’s reliability”. (Milanovic et al, 1999, p.48) By looking at the test scores, can the evaluators check the discrimination. Because of its decisive role in categorizing the test takers into bad and good group, discrimination of a test needs analyzing in the process of evaluating a test. 15 1.3.4. Reliability Reliability is another factor of a test should be estimated by the test evaluator. “Reliability is often defined as consistency of measurement”. (Bachman & Palmer, 1996, p.19). A reliable test score will be consistent across different characteristics of the testing situation. Thus, reliability can be considered to be a function of consistency of scores from one set of test tasks to another. Reliability is also means “the consistency with which a test measure the same thing all the time”. (Harrison, 1987, p.24) For test evaluators, reliability can be estimated by some of methods such as “p arallel form, split half, rational equivalence, test-retest and inter-rater reliability checks” (Milanovic et al, 1999, p.168). According to Shohamy (1985), the types and the description as well the ways to calculate the reliability are summarized in the following table: 16 Reliability types 1. Test-retest Description How to calculate The extent to which the test Correlations between scores score are stable from one of the same test given on two administration occasions assuming occurred to no another learning between two occasions 2. Parallel form The extent to which 2 tests Correlations between two taken form the same domain forms of the same rater on measure the same things different occasions or one occasion 3. Internal consistency The extent to which the test Kuder-Richardson questions are related to one 21 Formula another, and measure the same trait 4. Intra-rater The extent to which the same Correlations between scores rater is consistent in his of the same rater on different rating form one occasion to occasions, or one occasion. another, or in occasions but with different test-takers 5. Inter-rater The extent to which the Correlations among rating different raters agree about provided by different raters. the assigned score or rating. Table 3: Types of reliability (Source: Hoang, 2005, p.31 as cited in Shohamy, 1985, p.71) However, the reliability is said to be a necessary but not a sufficient quality of a test. And the reliability of a test should be closely interlocked with its v alidity. While reliability focuses on the empirical aspects of the measurement process, validity focuses on theoretical aspects and seeks to interweave these concepts with the empirical ones. For this reason it is easier to assess reliability than validity. 17 Test reliability could be analyzed by looking at the test score. If the test score unchanged in different times the test is taken, the test is said a reliable one and vice versa. However, this depends on some of conditions and situations such as the circumstances in which the test is taken, the way in which it is marked and the uniformity of the assessment it makes. Therefore, it is necessary for evaluators when they try to estimate the reliability of a test. 1.3.5. Validity Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness and usefulness of the specific inferences made from test scores. Test evaluation is the process of accumulating evidence to support such inferences. Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity refers to the degree to which that evidence supports the inferences that are made from scores. The inferences regarding specific uses of a test are validated, not the test itself. Traditionally, validity evidence has been gathered in three distinct categories: content-related, criterion related and constructed evidence of validity. More recent writing on validity theory stress the important of viewing validity as a ‘u nitary concept’ (Messick, 1989). Thus, while the validity evidence is presented in separate categories, this categorization is principally an organizational technique for the purpose of the presentation of research in this manual. According to Milanovic et al (1999), content and construct validity are conceptual whereas concurrent and predictive (criterion-related) validity are statistical. Or in other words, scores obtained on the test may be used to investigate criterion -related validity, for example, by relating them to other test scores or measure such as teachers’ assessment or future prediction. (pp. 220-221) Another type of test validity, for Milanovic et al, is face validity which refers to the degree to which a test appears to measure the knowledge or abilities it claims to 18 measures, as judged by untrained observer such as the candidate taking the test or the institution which plans to administer it. (Milanovic et al, 1999, p. 221) In a book by Alderson et al (1995), the authors divided validity i nto other categories; they are internal, external and construct validity. Internal validity according to them consists of three sub-types – face, content and response validity. For external validity, there are two sub-types; they are concurrent and predictive validity. And construct validity relates to five forms; they are comparison with theory, internal correlations, comparison biodata and psychological characteristic, multitrait – multi method analysis and convergent – divergent validation and factor analysis. (Alderson et al, 1995, pp. 171-186) The validity of the test is paid much attention by number of researchers, test evaluators should take time in checking the validity of the test based on the categories of it which is categorized by authors and researchers. Through the test scores, evaluators check whether the test is valid or not so that they will have good adjustment to the test they evaluated. Summary: In this chapter, we have attempted to establish the theoretical framework for the thesis. Language testing is one of most important procedures for language teachers in student assessing. There are number of approaches to language testing and testing reading. This has been discussed in the first part of the chapter. The second matter has been explored in the chapter is the theory of test evaluation which related to the criteria of a test need analyzing by test evaluators. 19 CHAPTER 2 METHODOLOGY AND RESULTS This chapter will include the research questions, the selection of participants who took part in the study and the testing materials. The methods of data collection and data analysis as well the results are presented afterwards. 2.1. Research questions On the basis of the literature review, this chapter aims at answering two research questions: 1) Is the final reading test for the students at HATECHS reliable? 2) To what extent is the final reading test valid in terms of face and construct? 2.2. The participants The students at HATECHS are from different provinces, cities and towns in the Nor th of Vietnam. They are generally aged between 18 and 21. Thirty participants were chosen randomly from students at Faculty of Finance and Accounting of school year 2006 – 2007. All of them are first year students. In addition, seven teachers at English Department were chosen for the interview. These teachers are all female and mostly get more than five year experience of teaching English. These teachers all took part in teaching the students at the school year 2006-2007. At the school, the students take an English course in the first year. The course is divided into two components, each lasts 60 periods. It is a compulsory subject at school. After finishing the course, they are required to have pre-intermediate level. However, students often have varying English levels prior to the course. Some of them have learnt English for 7 years at high school, or some have learnt it for 3 years due to each part of the country. Some of them even have never learnt English because at the lower level of school they learned other foreign languages not English. It is 20 therefore important for teachers to apply appropriate methods in teaching them to help them become more proficient. It is also critical that teachers give them suitable tests which meet their need and the requirements of the subject. 2.3. Instrumentation and Data collection 2.3.1 Course objectives; Syllabus and Materials used for the students at HATECHS In this section, we will discuss the syllabus, the course book for teaching reading and the standard evaluation for students at HATECHS 2.3.1.1. Course objectives Teaching objectives for the reading are to help students after finish the English course at HATECHS be able to: - be aware of the reading skill techniques - enrich the students’ vocabulary in various topics. - be at the proficiency of pre-intermediate level. 2.3.1.2. Syllabus During the course, the book that the students use is Let’s Study by Do Tuan Minh, National University Publication House, 2005. The book consists of 20 units, and in the first semester the students are expected to cover 10 first units and the second semester the last 10 units will be covered. The final reading test would be based on the contents of 10 last units. The total time for the whole semester is 60 classes in 10 weeks (each class lasts 45 minutes), in each class the time for reading is one fourth. The syllabus is described in the following table: 21 Unit Title Time (classes) Pages 11 My hometown 5 81 12 What’s the weather like today? 5 87 13 Traveling 5 92 14 Holidays and festivals 5 99 15 Future jobs 5 106 Stop and check and test 1 2 16 A British Wedding 5 117 17 At school 5 125 18 City life and country life 6 133 19 Part-time jobs 6 140 20 Social evils 6 147 Stop and check, test 2 2 General revision 2 Final test 1 Total 60 Table 4: The syllabus for teaching English – Semester 2 2.3.1.3. Assessment Instruments * Standard for the Final Reading English Test Basing on what they have been taught, teachers of English Department design the reading test to measure students’ achievement according to course objectives. Preliminary English Test (PET) is used as a model to construct final reading test for students at HATECHS. PET is used only for reading skill. Accordingly, the PET’s standard for reading test is presented. According to PET’s standard, a reading test should consist of five parts which are presented as follows. The PET (See appendix 2) was chosen as the criterion measure to evaluate the final reading test. 22 Parts Texts Items Questions 1 Five signs 01- 05 5 multi-choice questions 2 Eight related texts 06 - 10 5 paragraphs for people descriptions and related texts 3 Text for getting information 11 - 20 10 True/ False questions 4 Text with viewpoints or ideas 21 - 25 5 multi-choice questions 5 Text with gap filling 26 - 35 10 multi-choice questions Table 5: Components of the PET reading test * The Final Reading Test for students at HATECHS Based on the PET’s standard, components of the final reading test is summarized in the next table: (See Appendix for the details of the test and the key) Parts Texts Items Weight Marks 1 Five signs 5 20 % 5 2 Eight related texts 5 20 % 5 3 Text for getting information 10 20 % 10 4 Text with viewpoints or ideas 5 20 % 5 5 Text with gap filling 10 20 % 10 Total 35 100% 35 Table 6: Components of the final reading test The test was designed in PET form, therefore all the instructions are clear. 23 2.3.2. Data collection Step 1: The students took the final reading test. They did the test at the final examination for the course. Then, we randomly chose 30 test papers of the students at Accounting and Finance Department for the study. Step 2: The students who were chosen as the participants of the study were asked to take the PET. This test took place two weeks after the final exam. The participants had not been announced the time of the test taking place. Step 3: Test papers of final reading test and the PET were collected and marked by two teachers. Two teachers were also randomly chosen and they did not know the participants. The tests were marked according to the keys to tests provided by test designers. Step 4: The researcher conducted a semi-structured interview with seven teachers at English Department. Researcher gave out questions related to final reading test to the teachers for discussing. The researcher noted down the opinions and points of view and used for the study. 2.3.3. Data analysis and Results The reliability of the test would be calculated followed Kuder-Richardson 21 formula basing on test scores. The formula helped researcher found out the coefficient that shows the reliability of the test. To find out the face validity, the researcher gave out a semi-structure interview to the teachers at the teachers’ annual meeting at the beginning of the school year. In this meeting, mostly teachers have discussions about the teaching methods also syllabus improving. At the English Department, there are 7 teachers. At the meeting of school year 2007-2008, teachers of English Department had main discussions about the new final test for students at HATECHS as the idea of the researcher. In order to find out the evidence for the construct validity of the test researcher also asked her participants to do a reading test that from the sample PET for 2007 exams by Cambridge ESOL examination. Then the test papers were collected and marked by 2 teachers at the Department. 24 After that, the researcher collected the test scores to interpret them using statistical instruments. In this study, researcher used Pearson Correlation Coefficient Formula to check the validity of the test. 2.3.3.1. Test score analysis First of all, the raw scores of the final reading test and PET are presented in the table below. This raw scores include the scores of test as a whole and the detailed scores in each part of the test. Ss 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Whole 32 32 32 28 28 28 26 26 26 26 25 25 25 25 23 23 23 23 21 21 18 18 18 14 14 11 7 7 4 4 FINAL TEST P.1 P.2 P.3 5 5 8 5 5 9 5 5 7 4 4 8 5 4 7 5 3 8 5 3 6 5 4 7 5 3 8 4 4 7 5 4 6 5 3 7 4 4 7 4 3 8 4 2 7 4 4 6 4 3 6 4 4 7 5 3 6 4 4 5 3 2 5 3 1 5 4 3 1 3 2 4 3 2 4 2 1 3 2 1 2 2 1 1 1 0 0 1 0 1 P.4 5 4 5 4 4 5 3 2 5 4 4 5 4 4 4 5 2 4 2 3 5 2 3 2 2 2 1 2 2 0 P.5 9 9 10 8 8 7 9 8 5 7 6 5 6 6 8 4 8 4 5 5 3 7 7 3 3 3 1 1 1 2 Whole 28 25 28 28 25 24 24 23 22 23 23 22 23 22 23 20 20 21 20 21 21 19 18 19 18 14 14 4 14 4 P.1 5 5 5 5 5 5 5 4 4 4 5 4 4 4 5 3 4 4 3 5 4 4 3 4 3 2 4 2 3 2 PET P.2 P.3 4 8 3 7 3 7 4 8 3 7 3 6 4 7 2 6 2 7 3 5 4 6 2 5 3 5 2 5 2 5 3 4 2 4 2 4 3 5 2 5 3 6 2 5 1 4 2 4 2 5 2 4 1 4 1 0 1 2 0 0 P.4 3 3 3 2 3 3 2 3 2 2 2 3 2 3 3 2 2 2 3 2 2 1 1 2 2 2 1 0 1 1 Table 7: The Raw Scores of final reading test and the PET *Note: The total score of the test is 35 P.5 8 7 10 9 7 7 6 8 7 9 6 8 9 8 8 6 6 9 6 7 6 7 9 7 6 4 4 1 7 1 25 The Mean and Standard Deviation is calculated by the hereby formulas: M fx M: Mean N ∑: sum of N: Number of students x: raw score f: frequency And: x x 2 SD x: raw score N ∑: sum of N: number of students x : the mean And the results are presented in the following table: M SD Whole 21 7.92 FINAL TEST P1 P2 P3 3.8 2.9 5.5 1.20 1.40 2.40 P4 3.3 1.37 P5 5.6 2.60 Whole 20 5.67 P1 4.0 0.96 PET P2 P3 2.5 4.9 1.10 1.9 P4 2.1 0.77 P5 6.8 2.08 Table 8: Means and Standard Deviations 2.3.3.2. The reliability of the test Reliability estimate is one of difficult strategies and it is normally impossible to achieve a perfectly reliable test but the test constructors must make their test as reliable as possible. They do this by reducing the causes of unsystematic variation to a minimum. They should ensure, for example, that the test instructions are clear, and that there are no ambiguous items. 26 The final reading test is a kind of objective test, types of reliability related to the correlations between the raters need not to be calculated because the raters mark the test based on the keys to the tests; therefore, the test scores would be the same by different raters in any occasions. In other words, the parallel-form, intra-rater and inter-rater reliability need not calculate. In language testing, test - retest reliability is also not an appropriate strategy. Psychologically the students always want to get better results in the second time doing the test. Internal Consistency Reliability There are several techniques to calculate the internal consistency reliability such as split-half, Kuder-Richardson 20 and Kuder-Richardson Formula 21. However, in practice the Formula 21 is said to be the easiest in computing. For this Formula, if we can not calculate the variances of each item or it is very difficult for us to calculate them, we can use it to calculate the consistency reliability. The Formula is illustrated as follows: x2 x n n rtt 1 2 n 1 s t where: rtt: the KD reliability n: number of items in the test x: the mean of score on the test s2t : the variances of the test scores Shohamy (1985), on the other hand, also presents Kuder-Richardson Formula 21 in another form: 27 Rxx 1 x. K x K . SD 2 where: x: mean SD: standard deviation K: number of items on a test According to Kuder-Richardson Formula 21 in Shohamy’s form, the researcher calculated the Rxx in the following table: Test Rxx 1 x. K x K . SD 2 x (mean) SD (standard deviation) Final test 21 7.92 0.87 PET 20 5.67 0.80 Table 8: The reliability coefficients In practice, the ideal reliability coefficient is 1. A test with a reliability coefficient of 1 is the one which would give precisely the same results for a particular set of candidates regardless of when it is administered. A test with reliability coefficient of 0 would give the sets of results quite unconnected with each other, and the test would fail to be a reliable one. For a reading test, according to Lado (1961), a highly reliable reading test is usually in the 0.90 to 0.99 range of reliability coefficient while oral tests or essay types may be in the 0.7 to 0.79 range (cited in Huges, 1989, p. 32). Table 8 shows the results of reliability coefficients calculated from test scores. The coefficient of Rxx is equal to 0.87. From the results of the calculation, the final reading test is quite good in accordance with the range of reliability coefficient by Lado (1961). In other words, the final reading test for students at HATECHS is reliable in terms of internal consistency. 28 In testing, a test valid must be reliable, but a reliable test may not be valid at all. The final reading test for students at HATECHS is reliable as the result above. To find out whether it is valid or not, the evidence for test validity will be explored in the next part. 2.3.3.3. The test validity The primary concern for any test is that the interpretations and the uses we make from the test score are valid. The evidence that we collect in support of the validity of particular can be of three general types: content relevance, criterion relatedness and meaningfulness of construct (Rouhani, 2006, as cited in Bachman, 1990). And for a teacher in setting his test, face validity is also vital (Harrison, 1983, p.11). With in this thesis, the author wished only to find the face and construct validity which are said to be important for test designers. Face validity Face validity, for Harrison (1983), is concerned with what teachers and students think of the test. A question would be raised that whether the test appears to the teachers and students reasonable way to assess the students, or whether it seems trivial, or too difficult, or unrealistic. The face validity is, therefore, found out by only way is to ask the teachers and the students concerned for their opinions either formally by means of a questionnaire or informally by discussion or staff room. At the annual meeting of teachers at English Department at HATECHS, there were number of arguments on the new final reading test. For the question whether it is a reasonable instrument to assess students, six of seven teachers agreed that the final reading test was an effective instrument to evaluate the achievement of students in reading skill. This seemed to be agreeable with the result that researcher got in finding out the mean of the test scores. The mean was calculated and equal to 21. However, one of seven teachers, who disagreed with the final test, had the idea of assessing students’ ability in reading using cloze procedure which is said to be one of effective procedures in assessing the achievement of students. With the mean of test 29 score was 21 together with the number of students got marks under 18 was not big, almost teachers agreed that the test was reasonably difficult. Or in other words, the test was not too difficult or too easy. In the question about the reality of the test, almost teachers agree that it is highly realistic. When designing the test, test designer used number of texts in the textbook which help students be familiar with what they have been taught. The intention of test designer make the test be realistic. To sum up, in the discussion of teachers, the researcher had found out that the final reading test for students at HATECHS has face validity. Construct validity According to Hughes (1995) we should randomly choose the participants to take two tests (treatment test and criterion test) and then compare the result of the two tests. If the comparison between the two sets of scores reveals a high level of agreement, then the treatment test may be considered valid. (Hughes, 1995, pp. 23-34). In this study, two tests are Final Test and PET, Final test is the treatment one and PET is the criterion. The agreement between two tests is called ‘validity coefficient’, a mathematical measure of similarity. To find out the validity coefficient, the researcher used the formula called Pearson Correlation Coefficient. This coefficient is symbolized as r. The formula is shown as follows: r xy NS x S y where r: validity coefficient x: X- X (X, X : scores, mean on the treatment test) y: Y- Y (Y, Y : scores, mean on the criterion test) 30 N: number of students S x: the Standard Deviation of treatment test S y: the Standard Deviation of criterion test According to the formula, it is necessary to find out the total of x.y. This is based on the scores of two tests and the result is shown as below: ∑ (X- X Ss 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 X 32 32 32 28 28 28 26 26 26 26 25 25 25 25 23 23 23 23 21 21 18 18 18 14 14 11 7 7 4 4 Y 28 25 28 28 25 24 24 23 22 23 23 22 23 22 23 20 20 21 20 21 21 19 18 19 18 14 14 4 14 4 )( Y- Y ) X- X Y- Y 11 11 11 7 7 7 5 5 5 5 4 4 4 4 2 2 2 2 0 0 -3 -3 -3 -7 -7 -10 -14 -14 -17 -17 8 5 8 8 5 4 4 3 2 3 3 2 3 2 3 0 0 1 0 1 1 -1 -2 -1 -2 -6 -6 -16 -6 -16 ∑ (X- X )( Y- Y ) (X- X )( Y- Y ) 88 55 88 56 35 28 20 15 10 15 12 8 12 8 6 0 0 2 0 0 -3 3 6 7 14 60 84 224 102 272 1227 If we continue to calculate like this way, we will have the results of five pa rts of two tests as follows: 31 ∑ (X- X )( Y- Y ) Part 1 = 23.8 Part 2 = 29.1 Part 3 = 109.1 Part 4 = 18.1 Part 5 = 126 After calculating the total of (X- X )( Y- Y ), the coefficients are computed follows the Pearson’s formula. And the last results are presented in the following table: r xy NS x S y Whole 0.91 P1 P2 P3 0.69 0.63 0.80 P4 P5 0.57 0.78 Table 9: The validity coefficients Table 9 shows the correlation coefficients between two tests as a whole and five parts as the details. Hughes (1995) points out that, perfect agreement between two sets of scores will result in a validity coefficient of 1. Total lack of agreement will give a coefficient of zero (Hughes, 1995, p. 24). Having a look at Table 9, we can see clearly the validity coefficient as a whole is 0.91; this means the agreement between two tests is “high level”. This is enhanced by the coefficients of each part in the tests. The statistics in Table 9 show the validity coefficients between five parts of tests. They are 0.69, 0.63, 0.80, 0.57 and 0.78 respectively. All of them show the comparatively high level of agreement between the two tests. From what we have calculated and criterion of Hughes for validity correlation coefficient, we can come to conclusion that the Final Reading Test for students at HATECHS is construct valid. 32 2.3.3.4. Summary to the results of the study Chapter 2 has provided the practical context of the study. In this chapter, we have attempted to look at the reliability of the final reading test for students at HATECHS as well to find out the evidences for the validity of the test. This was gained from test score analysis. The results are summarized as follows: (The total scores on the test is 35 – each correct answer got 1 points, wrong one got 0 point) 1. Mean of the final reading test scores: M = 21 2. The Standard deviation of the test scores: SD = 7.92 3. The reliability coefficient: R xx = 0.87 4. The validity coefficient: r = 0.91. 5. The test has face validity. Based on the results of the data analysis, we realize that the test is valid. It is reliable but not very high reliable – the high reliable one is said to have the reliability coefficient ranging from 0.90 to 0.99. However, in the condition of school where the research was carried out, the coefficient is 0.87 is acceptable for the reading test. 33 PART THREE: CONCLUSION 34 1. Conclusion As the research questions raised for exploration at the beginning of the study have been answered, it is high time we should bring all the issues together. In Part One, the primary concern of the thesis has been stated. The issue relates to the practice of evaluating the reliability and validity of the final reading test for students at HATECHS. However, the practical concern would necessarily raise theoretical questions. Therefore, in Chapter 1, Part Two of the thesis, we have reviewed the theories relating to approaches to language testing and reading testing. In the section of test evaluation, the criteria in evaluating a test have been discovered. This helps to establish the theoretical backgrounds for actual study in Chapter 2. In Chapter 2, the main part of the thesis, the researcher has analyzed the test scores and found out evidences to prove the reliability and the validity of the test. In this chapter, the researcher has had conclusion to her study, that the final reading test for students at HATECHS is valid and reliable. 2. Limitations Due to the researcher’s knowledge and time, the study cannot avoid the limitations. Firstly, the study is limited to the evaluating the reliability and the validity of the test, in evaluating the validity of the test, the researcher has just found out the evidence for face and criterion-related validity. Secondly, the number of participants is 30, which is rather small for the number of students of the school; as well the participants only from Department of Finance and Accounting. Finally, the condition and circumstances where the test taking place are not discussed, this leads to the unreliability of the test in the aspect of test-retest reliability. For the limitations, the researcher would like to bring about the future directions in the next section. 35 3. Future Directions From the results and limitations to the study, in the future direction, we wish to study further on the evaluating the test in details. Or in other words, in the future, the study would continue with the test item analyzing. Doing so, we could explore the discrimination of the test as well the item difficulty of the test. Additionally, we wish to interpret the test scores of bigger number of participants. Finally, in the future studies, we also wish to give out suggestions for test designers for a better final reading test for students at HATECHS. 36 REFERENCES Aebersold, J., & Field, M. (1997). From reader to reading teacher. Cambridge, UK: Cambridge. Alderson, J. C. (1996). The testing of reading. In C. Nuttall (Ed.) Teaching reading skills in a foreign language, 212-228. Oxford: Heinemann. Alderson, J. C. (2000). Assessing Reading. Cambridge: Cambridge University Press. Alderson, J. C., Clapham, C., and Wall, D. (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press. Alderson, J.C. (1979). The effect on the cloze test of changes in deletion frequency. Journal of Research in Reading, 2, 108-118. Anderson, N. (1999). Exploring second language reading: issues and strategies. Boston: Heinle. Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. Bachman L. F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press. Bachman, L. F. (1982). The trait structure of cloze test scores. TESOL Quarterly, 16, 61-70. Bachman, L.F. (1985). Performance on cloze test with fixed-ratio and rational deletions. TESOL Quarterly, 19, 535-556. Brown, J.D. (1993). What are the characteristics of natural cloze tests? Language Testing, 10, 93 -116. Carol, B. J. (1980). Testing communicative performance: An interim study. London: Pergamon Institute of English. Cohen, A. D. (1998). Strategies and processes in test taking and SLA. In L. F. Bachman and A. D. Cohen (Eds.) Interfaces between second language acquisition and language testing research, 90-111. Cambridge: Cambridge University Press. Harrison, A. (1983). A Language Testing Handbook. London: Mcmillan Press. Heaton, J. B. (1988). Writing English Test. London: Longman. Henning, G. (1987). A Guide to Language Testing. Cambridge: Cambridge University Press. 37 Hoang Van Trang. (2005). Evaluating the reliability of the achievement writing test for the first-year students in the English Department, CFL-VNU and some suggestions for changes. Unpublished MA. College of Foreign Languages, Vietnam National University, Hanoi: Vietnam. Hughes, A. (2003). Testing for Language Teachers. Cambridge: Cambridge University Press. Jonz, J. (1976). Improving the basic egg: The multi-choice cloze. Language Learning, 26, 255-265. Jonz, J. (1990). Another turn in the conversation: What does cloze mean? TESOL Quarterly, 24, 61-83. Klein-Braley, C. (1983). A cloze is a question. In J.W. Oller, Jr., (Ed.), Issues in language Testing research (pp. 218 – 228). Rowley, MA: Newbury House. Klein-Braley, C. (1985). A cloze-up on the C-test: A study in the construct validation of authentic tests. Language Testing, 14, 47-84. Lado, R. (1961). Language Testing. London: Longman. Messick, S. (1989). Validity. Linn, R.L. (Ed) Educatioanl Measurement. Third Ed. American Council on Education. Macmillan Publishing Co. N.Y. Milanovic, M. (Ed.) (1999). Dictionary of language Testing. Local Examinations Syndicate: Cambridge University Press. Raatz, U. (1985). Better theory for better tests? Language Testing, 2, 60-75. Shohamy, E. (1985). A practical Handbook in Language Testing for the Second Language Teachers. Tel-Aviv: Tel-Aviv University Press. Ur, P. (1996). A Course in Language Teaching. Cambridge: Cambridge University Press. Weir, C. J. (1990). Communicative Language Testing. New York: Prentice Hall. Weir, C. J. (1993). Understanding and Developing Language Tests. New York: Prentice Hall International. 38 APPENDIX 1: THE FINAL READING TEST TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ NỘI ------------------------------------ FINAL ENGLISH TEST Skill: Reading Time allowed: 60 minutes Mark Marker’s signature: Your name:…………………………… 1…………………………… Group:…………………………………… 2………………………….. Date of birth: …………………………… Date:…………………………………….. -----------------------------------------------------------------Part 1 Question from 1-5 Look at the sign in each question Someone asks you what it means. Mark the letter next to the correct explanation – A, B, C or D on your answer sheet. Example: 0 Silence please Examination in progress A B C D Please be quiet while people are taking their examination Do not talk to the examiner Do not speak during the examination. The examiner will tell you when you can talk Example: 0 1 Stand in a queue, here for the tickets A B C D A D PART 1 B C It is difficult to buy the tickets. You can buy the ticket anywhere you like. Tickets are available in a queue. To buy tickets, you must queue here. 39 2 We are closed for staff training until 9.30 3 Please leave the shop before 9 p.m. 4 5 Please hand in your key at the desk Sorry- All tables fully booked this evening A B C D We can train you to work here. We are not open today because of staff training. The shop is run by trained staff. The shop will open at 9.30 today. A B C D The shop opens at 9 p.m. The shop will close at 9 p.m. People can stay at the shop until 9 p.m. People can enter the shop after 9 p.m. A B C D Don’t lock the room. Keep your key safe. Lock your door before leaving. Leave your key at reception. A B C D You can only have a meal if you have booked. You do not need to reserve a table. The restaurant is not open this evening. If you wait you will be given a table. Part 2 Questions 6- 10 The people below all want to learn a new sport. On the next page there are descriptions of eight sports centres. Decided which sports centre would be the most suitable for the following people. For questions 6-10, mark the correct letter (A-H) on your answer sheet. Example: 0 6 7 A G B H PART 2 C D F Dionysis works in the city centre and wants to take up a sport that he can do regularly in his lunch hour. He enjoys activities which are fast and a bit dangerous. John and Betty already play golf at weekends. Now they have retired, they want to learn a new activity they can do together in the mornings in the countryside. 40 8 9 10 In six weeks’ time, Juan is having a holiday on a Caribbean island, where he plans to explore the ocean depths. He has a 9-to-5 job and wants to prepare for this holiday after work. Tomoko and Natalie are 16. They want to do an activity one evening a week and get a certificate at the end. They would also like to make new friend. Alice has a well-paid but stressful job. She would like to take up a sport which she can do outside the city each weekend. She also wants go get to know some new people. Sporting Opportunities A Suzanne’s Ridding School B You can start a horse-riding at any age. Choose private or group lessons any weekday between 9 a.m. and 8.30 p.m. (3.30 p.m. or Saturdays). There are 10 kilometres tracks and paths for leisurely rides across farmland and open country. You will need a ridding hat. C Adonis Dive Centre Our Young Sailor’s Course leads to the Stage 1 Sailing qualification. You’ll learn how to sail safely and the course also covers sailing theory and first aid. Have fun with other members afterwards in the clubroom. There are 10 weekly two-hour lessons (Tuesdays 6 p.m. – 8 p.m.). D Our experienced instructors offer one-month courses in deep-sea diving for beginners. There are two evening lessons a week, in which you learn to breath underwater and use the equipment safely. You only need a swimming costume and towel. Reduced rates for couple. E Hilton Ski Centre If you are take our 20-hour course a week or two before your skiing holiday, you’ll enjoy you holiday more. Learn how to use a ski-lift, how to slow down and, most importantly, how to stop! The centre is open from noon to 10 p.m. Skis and boots can be hired. Lackford Sailing Club Windmill Tennis Academy Learn to play tennis in the heart of the city and have fun at our tennis weekends. Arrive on Friday evening, learn the basic strokes on Saturday and play in a competition on Sunday. There’s also a disco and swimming pool. White tennis clothes and a racket are required. F Avon Watersports Club We use a two kilometer length of river for speedboat racing and water-skiing. A beginners’ course consists of ten 20-minute lessons. You will learn to handle bots safely and confidently, but must be convenient central position and is open daily from 9 a.m. to 4 p.m., with lessons all through the day 41 G Glenmoreie Golf Club After a three-hour introduction with a professional golfer, you can join this golf club. The course stretches across beautiful rolling hills and is open from dawn until dusk daily. There are regular social evenings on Saturdays in the club bar. You will need your own golf equipment. H Hadlow Aero Club Enjoy a different view of the countryside from one of our twoseater light aeroplanes. After a 50-hour course with our qualified instructor, you could get your own pilot’s license. Beginners’ lessons for over-18s are arranged on weekdays after 4 p.m. Part 3 Questions 11 – 20 Look at the statements below about the wedding tradition. Read the text to decide if each statement is correct or incorrect. If it is correct, mark A on your answer sheet. If it is not correct, mark B on your answer sheet. Example: 0 PART 3 A B 11.Wedding cake was made of sugar and honey. 12.Until today the wedding cake is the symbol of good luck and fertility. 13.A single woman can place a piece of wedding cake under her pillow. 14.The smell of flowers attract the evil spirits. 15.To spread the good fortune and luck the bride throw the bouquet of flower. 16. In early time, a woman wore her wedding dress in her wedding ceremony. 17.The wedding dress is normally white because this is the color of virginity. 18.The bride should make her own a dress for her wedding. 19.In early time, the golden ring is the symbol of love and marriage. 20.The vein in the third finger was believed to run directly to the heart. 42 Wedding Tradition Wedding cake The first wedding cake dated back to the Middle Age. It was the made of sugar icing and decorated with meaningful symbols like doves, horseshoes, etc. Until today the wedding cake is the symbol of good luck and fertility. The bride and the groom cut the wedding cake together and from that moment they share their new life together. All the guests should eat some to ensure good luck. A single woman can place a piece of wedding cake under her pillow and should dream of the man she is going to marry. Wedding Dress In early tradition is a sign can drive times, a woman wear her best dress in her wedding. The of wearing a white dress was only started in 1949. White of virginity and joy. People also believe that the coulor away evil spirits. It was believe that the bride should never make her own dress or try it before the wedding. She shouldn’t let her groom see her in her wedding dress before the wedding, either. These were to make sure that the marriage took place. Bridal Bouquet Flowers played a very important part in olden times – the smell of the flowers were believed to ward off evil spirits and bring good fortune. The throwing of the bouquet is a way of spreading the bride’s good fortune and luck. Whoever catches it will be blessed with good luck and will be the next to marry. Wedding Ring In the past, a golden ring was given to the bride’s family in payment for the bride. Now it is simply the symbol of love and marriage. The unbroken circle is also age-old symbol of ‘Eternity’. It is a tradition to place the wedding ring on the middle finger of the left hand. Perhaps it’s because the ancient Romans believed that the vein in the third finger ran directly to the heart, so the wearing of rings on that finger joined the couple’s hearts and destinies. Part 4 Questions 21 – 25 Read the text and questions 43 For each question, mark the letter next to the correct answer – A, B, C or D – on your answer sheet. Example: 0 My name is Mandi. Three months ago, I went to a disco where I met a boy called Tom. I guessed he was older than me, but I liked him and thought it didn’t matter. We danced a couple of times, then asked how old I was. I told him I was 16. I thought that if I told him my real age, he wouldn’t want to know me, as I’m only 13. After the disco we arranged to meet the following weekend. The next Saturday we went for a burger and had a real laugh. Afterwards he walked me to my street and 22 Who is she writing to? A her boyfriend B her parents C a teenage magazine D a schoolfriend 23 Why is Mandi worried? A Tom has been behaving strangely. B She’s been telling lie. C She’s not allowed to go to disco. D Her parents are angry with her. 24 Why can’t Tom come to Mandi’s house? A She doesn’t want her parents to meet him. B Her parents don’t like him. C He’s nervous of meeting her parents. D She doesn’t want him to see where she live. 25 Which of these answers did Mandi receive? A Tell me what you really feel. B You must everyone. C Everyone’s been unfair to you. being C Now I really don’t know what to do. I can’t go on lying to my parents every time we go out, and Tom keeps asking why he can’t com round to my house. I’m really worried and I need some advice. Why has Mandi written this? A to describe her boyfriend B to prove how clever she is C to explain a problem D to defend her actions by PART 4 B kissed me goodnight. Things went really well. We see each other a couple of times a week, but I’ve had to lie to my parents about where I’m going and who with. I’ve always got on with them, but I know that if they found out how old Tom was they’d stop me seeing him. 21 start A D honest with 44 D Don’t worry, I’m sure Tom will change his mind. Part 5 Questions 26 – 35 Read the text below and choose the correct word for each space. For each question, mark the letter next to the correct word A, B, C or D – on your answer sheet. Example: A D 0 PART 5 B C For many young people sport is (0)……… popular part of school life and (26)………in one of the school teams and playing in matches is very important. (27)………….someone is in a team it means a lot of extra practice and often spending a Saturday or Sunday away (28)…………home, as many matches are played then. It (29)……….also involve traveling to other town to play against other school teams and then (30)………..on after the match for a meal or a drink. Sometimes parents, friends or other students will travel with the team to support (31)………..own side. When a school team wins a match it is the whole school which feels proud, (32)………only the players. It can also mean that a school (33)……….well-known for being good at certain sports and pupils from that school may end up playing (34)………national and international teams so that the school has some really (35)……..names associated with it! (0) A a B an C the D and 26 A having B being C taking D putting 27 A If B As C Then D So 28 A at B on C for D from 29 A ought B is C can D has 30 A being B staying C leaving D spending 31 A their B its C our D whose 32 A but B however C and D not 33 A turns B makes C comes D becomes 34 A up B to C for D beside 35 A old B new C common D famous 45 Appendix 2: The PET (criterion measure) ------------------------------------------------------------------Reading (Time allowed: 60 minutes) Part 1 Question from 1-5 Look at the sign in each question Someone asks you what it means. Mark the letter next to the correct explanation – A, B, C or D on your answer sheet. Example: 0 Silence please Examination in progress A B C D Please be quiet while people are taking their examination Do not talk to the examiner Do not speak during the examination. The examiner will tell you when you can talk Example: 0 1 2 Please keep this entrance clear Supersaver tickets cannot be used on Fridays A D PART 1 B C A B C D Only use this entrance in an emergency. Do not park in front of this entrance Always keep this door open Permission is needed to park here A B C D You need a special ticket to travel on a Friday. You can save money by traveling on a Friday. Supersaver tickets can be used every day except Fridays. Supersaver tickets cannot be bought before the weekend. 46 3 4 Please show the librarian all books when you leave the library Machine out of order. Drinks available at bar 5 Keep this door locked when room not in use A B C D A B C D A B C D Return your books before you leave the library. The librarian needs to see your books before you go. Make sure you take all your books with you. The librarian will show you where to put your books. This machine is not working at the moment. There is a drinks machine in the bar. Drinks cannot be ordered at the bar. Use this machine when the bar is closed. This room cannot be used at present. This door must always be kept locked. Keep the key to this door in the room. Lock the door when it is not being used. Part 2 Questions 6- 10 The people below are looking at the contents pages of magazines. On the next page are parts of the contents pages of eight magazines. Decided which magazine (letter A-H) would be the most suitable for each person (numbers 6-10) For each of these numbers mark the correct letter on your answer. Example: 0 6 7 A G B H PART 2 C D Sarah is a keen walker. She lives in an area which is very flat and when she goes on holiday she likes to walk in the hills. She is looking for new places to go. Jane is keen on music. She likes reading about the personal life of famous people to find out what they are really like. 8 Peter is going to France next week on business and has a free weekend which he plans to spend in Paris. He would like to find out what there is to do there. 9 Paul likes visiting other countries. He is also interested in history and likes reading about famous explorers from the past. F 47 10 A C Mary likes clothes but hasn’t got much money so she is looking for ways of dressing smartly without spending too much. MARIA MARIA She conquered the world of opera with the most extraordinary voice of the century – and died miserable and alone. Michael Tonner looks at Callas, the woman behind the opera singer. BUSINESS IN PARIS John Felbrick goes to Paris to see what facilities it offers for business people planning meetings Here and there Our guide to what is happening in London, and this month we’ll also tell you what’s on in each of the capital cities of Europe. Explore Africa Last year Jane Merton joined a trip across Africa, exploring the most cut-off parts of the continent. Read what she has to say. E B D F Read about Neij Ashdown’s recent walk along one of Britain’s oldest paths. It passes through some of the most beautiful hill country. Enter our competition and win a week for two in Thailand Don’t go into the hills unprepared. If you’re a hill walker, we have advice for you on what to take and what to do if something goes wrong. We show pictures of Linda Evangelista, the supermodel from Toronto, wearing next season’s clothes for the woman with unlimited pocket money. Festivals This is the season for street festivals. We’ve traveled to three of the big ones in South America and bring you pictures and information. How I got there Georgina Fay tells us how she became a famous clothes designer overnight. In the Freezer We talk to the two men who have just completed a walk across the Antarctic. Tighten That Belt Well-known fashion designer, Virginia McBrid , who now lives in Paris, tells us how to make our old clothes look fashionable. 48 G Wake up children Penelope Fine’s well-known children’s stories are going to be on Sunday morning Children’s TV. We talk to this famous author and find out how she feels about seeing her stories on screen. Flatlands It may not look like promising walking country – it hardly rises above sea level, but we can show you some amazing walks. H My audience with Pavarotti David Beech talks to the famous singer about his future tour of the Far East. New light Julian Smith talks to the granddaughter of one of the men who reached the North Pole for the first time in 1909. She tells us about his interesting life. Part 3 Questions 11 – 20 Look at the statements below about a student hostel. Read the text to decide if each statement is correct or incorrect. If it is correct, mark A on your answer sheet. If it is not correct, mark B on your answer sheet. Example: 0 PART 3 A B 21.Every student has a key to the main door. 22.You can borrow your friend main door card. 23.Insurance companies will pay if someone steals your card and takes things from your room. 24.Spare rooms are least likely to be available in summer. 25.Your brother can stay free of charge if he uses the other bed in your room. 26.Guests must report to Stan when they arrive. 27.The cleaners take away food that they find in bedrooms. 28.If cook late at night, you should leave the washing-up until the morning. 29.Students who play loud music may have to leave the hostel. 30.You should ask Stan to call a doctor if you are ill. Hostel rule To make life in this student hostel as comfortable possible for every one, please remember these rules. and safe as Security You have a special card which operates the electronic lock on your room door and a key for the main door of the hostel. These are your responsibility and should never be lent to anyone, including your fellow students. If you lose them you will be charged 20 pounds for a replacement. Do not leave your room unlocked even for short periods (for example, when making yourself a coffee). Unfortunately, theft from student hostel is very common and insurance companies will not 49 pay for stolen goods unless you prove that your room was broken into by force. Visitors There are rarely any rooms available for visitors, except at the end of the summer term. Stan Jenkins, the hostel manager, will be able to tell you and can handle the booking. A small charge is made. Stan also keeps a list of local guesthouses, with some information about what they’re like, price, etc. You are also allowed to use empty beds for up to three nights, with the owner’s permission (for example, if the person who shares your room is away for the weekend), but you must inform Stan before your guest arrives, so that he has an exact record of who’s in the building if a fire breaks outs. Students are not allowed to charge each other for this. Kitchen There is a kitchen on each floor where light meals, drinks, etc. may be prepared. Each has a large fridge and a food cupboard. All food should be stored, clearly marked with the owner’s name, in one of these two places. Bedrooms are too warm for food to be kept in, and the cleaners have instructions to remove any food found in them. After using the kitchen, please be sure you do all the washing up immediately and leave it tidy. If you use it late in the evening, please also take care that you do so quietly in order to avoid disturbing people in nearby bedrooms. Music If you like your music loud, please use a Walkman! Remember that your neighbors may not share your tastes. Breaking this rule can result in being asked to leave the hostel. Musicians can use the practice rooms in the basement. Book through Stan. Health Any serious problems should be taken to the local doctor. The number to ring for an appointment is on the ‘Help’ list beside the phone on each floor. For first aid, contact Stan or one of the students whose names you will find on that list, who also have some first aid training. Part 4 Questions 21 – 25 Read the text and questions For each question, mark the letter next to the correct answer – A, B, C or D – on your answer sheet. Example: 0 A D PART 4 B C Dear Mr Lander, I run ‘Snip’ hairdressing shop above Mr Shah’s chemist’s shop at 24 High street. I started the business 20 years ago and it is now very successful. My customers have to walk through the chemist’s to the stairs at the back which lead to the hairdresser’s. This has never been a problem. Mr Shah plans to retire later this year, and I have heard from a business acquaintance that you intend to rent the shop space to a hamburger bar. I have thought about trying to rent it myself and make my shop bigger but I cannot persuade anyone to lend me that much money. I don’t know that to do. My customers come to the hairdresser’s to relax and the noise and smells of a burger bar will surely drive them away. Also, they won’t like having to walk through a hot, smelly bar to reach the stairs. I have always paid my rent on time. You have told me in the past that you wish me to continue with my business for as long as possible. I believe you won another empty shop in the High Street. Could the burger bar not go there, where it would not affect other people’s businesses? 50 26 What A B C D is to to to to the writer’s main aim in the letter? show why her business is successful explain why her customers are feeling unhappy avoid problems for her business complain about the chemist downstairs 27 Who was the letter sent to? A the writer’s landlord B the writer’s bank manager C the owner of the burger bar D the local newspaper 28 What A B C D 29 Why is the writer worried about her customers? A They do not like eating burger. B They may not be allowed to use the stairs. C The smells will not be pleasant. D The hairdresser’s will get too crowded. 30 Which of these is part of a reply to the letter? A Thank you for your letter. I am sorry you shop has had to close down because of lack of business. does the writer think about the burger bar? It will make her lose money. It will not be successful. The High Street is not the place for it. Other shopkeepers will complain about it too. B Thank you for your letter. I understand you problem. I will ask them to look at the other shop but I can make no problems at the moment. C Thank you for your letter asking me to rent the ground floor shop to you. I will think about it and let you know. D Thank you for your letter. I am sorry that I am not able to lend you the money you ask for Part 5 51 Questions 26 – 35 Read the text below and choose the correct word for each space. For each question, mark the letter next to the correct word - A, B, C or D – on your answer sheet. Example: A D 0 PART 5 B C Sally After two weeks of worry, a farmer (0)…………the north of England was very happy yesterday. James Tuke, a farm who (26)…….....sheep, lost his dog, Sally, when were our (27)……… together a fornight ago. ‘Sally was running (28)………of me,’ he said, ‘and disappeared over the top of the hill. I whistled and called (29)………. she didn’t come. She’s young, so I thought perhaps she’d gone back to the farmhouse (30)……….her own. But she wasn’t there. Over the next few days I (31)………as much time as I could looking for her. I was afraid she’d heard an animal crying while she was out of walking near the (33)……..of a cliff. I rushed out and found Sally on a shelf of rock halfway down. She was this and (34)……….but she had no (35)……….injuries. She was really lucky!’ B of C at D to (0) A in 26 A goes B grows C keeps D holds 27 A working B worked C work D works 28 A behind B beside C ahead D around 29 A but B so C and D even 30 A by B on C with D of 31 A used B spent C gave D passed 32 A more B again C further D after 33 A edge B side C border D height 34 A poor B dull C weak D broken 35 A strong B hard C rough D serious APPENDIX 3: QUESTIONS FOR DISCUSSION (For teachers) The questions are designed for collecting data for my research. Your assistance in taking part in the discussion is highly appreciated. Thank you very much for your cooperation! ------------------------------------ 52 Questions for discussion 1. Does the test appear to you a reasonable way of assessing the students? 2. Is the test too difficult or easy? 3. Is it realistic?