CHAPTER II THEORETICAL FRAMEWORK 2.1 Definition of Test Testing is an unavoidable part of teaching and learning process. After a certain period of learning, a test has to be conducted. According to Brown (2002), test is a method of measuring a person's ability, knowledge, or performance in a given domain. Test is one of the ways to measure what someone got during the learning process. It is the way to see the learner's progress, ability, and knowledge on given material, from the first they began to learn until the test has to be conducted. A psychological or educational test is a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual (Carroll, 1986, p.46). Furthermore, Bachman (2004, p.20) inspired by Carroll's theory and said that test is a measurement instrUment designed to elicit a specific sample of an individual's behavior. As on type of measurement, a test usually quantifies the characteristics of individuals according to explicit procedures. In language testing, test is used as a measurement tool to find out someone's abilities in language learning. For the test makers, test can be an evaluation to see the teacher's ability in teaching, whether they should change their teaching system or continue their method. 6 However, learning and test is a part that is inseparable. The test could be a horrible and scary thing for some students who do not realize about the aims and the purposes of the test itself. 2.2 Functions of Tests There is no science without measurement. Testing that includes language testing is one form of measurement. A useful test which can measure someone's ability must provide reliable and valid measurement for a variety of purposes. Henning (1987) mentioned the functions of tests as follows: 2.2.1 Diagnosis and Feedback Generally, language tests and educational tests have purposes to find out the strengths and weaknesses in the learners' ability during the learning process. This kind of test is known as diagnostic test. The value of diagnostic test value is that it provides critical information to the student, teacher, and administrator that should make the learning process more efficient (Henning, 1987, p.2). Then, the information can be used to improve the learners' ability through the knowledge of the learners' strengths and weaknesses. 2.2.2 Screening and Selection The function of a test is to help in making decision of whether someone is allowed to participate in one institution or not. The selection 7 decision is made by determining who will be the most benefit from instruction, to reach mastery of language, and to become the useful practitioner to the tester's ability. In the area of language testing, a common screening instrument is named an aptitude test. It is used to predict the success or failure of students' prospective in a language-learning program (Carroll, 1965). 2.2.3 Program Evaluation Another important use of tests, especially achievement test, is to find out the effectiveness of programs of institutions. If the results of these tests are used to modifY the learning program to better needs of the students, this process is termed formative evaluation. As cited from Henning (1987, p.3) the final exam is administrated as part of the process of what is called summative evaluation (Scriven, 1967). 2.2.4 Placement In this case the tests are used to classifY the level of the student and place the student at an appropriate level of instruction based on their capability. Usually this test is used in the courses. The institute administers the test to measure the test-taker ability and take them in the right class according to their level. 8 2.2.5 Providing Research Criteria The score of the tests are used to provide the information as a standard value for another context research. The result test to be a research, then the research will be essential information for the other research. 2.3 Types of Tests The development of the purposes of the tests is followed with the development of the types of the tests. According to Hughes (1989), different purposes of testing will usually require different kinds of a test. 2.3.1 Criterion-referenced Tests Tests which are designed to provide the direct information about what someone can actually do in the language are criterion-referenced tests. The criterion-referenced tests are intended to classifY people according to whether they are able or not to perform their abilities in a set of tasks. They do not compared with the achievement of the other students, but their own achievement will be measured. It means the value of tests respect to the degree of their learning language. The tasks are set and the performances are evaluated whether the students pass or fail. In applying of this kind of measurement, these kinds of tests have both side of strengths and weaknesses. According to Henning (1987, p.7), the first positive side, the process of development of criterion-referenced 9 test will help in clarizying objectives. The tests are useful when the objectives are under constant revision. The tests are useful with small and unique groups for which norms are not available. Furthermore, on the negative side, Henning (1989) explain that the objectives measured are often too limited and restrictive. The objectives must be specified operationally. Another possible weakness is that scores are not considered to be referenced to a norm yet. The example of the criterion-referenced test is a teacher-made test. The teacher-made test is usually conducted in the classroom. Harris (1969, p.l) identified that the concept of the teacher-made test is rather informal. Classroom tests are generally prepared, administrated, and scored by one teacher. In this situation, test objectives can be based directly on course objectives. The instructor, the test-writer and the evaluator is the same person. The students know well what is expected of them and what kind of standard that is required in the tests. 2.3.2 Norm-referenced Tests The opposite of the criterion-referenced tests is the norm- referenced tests. By definition, the norm-referenced tests must be prepared and administrated before being conducted to a large sample of people from the target population. The norm-referenced tests relate one sample's performance to the other sample's performance. The acceptable standard will be known after the tests have been developed and administrated. The 12 standard is determined based on the mean or average scores all students from the same population (Henning, 1987). Similar to criterion-referenced tests, norm-referenced tests also have their own strengths and weaknesses. Harris (1969) explained for the strengths, the comparison between each population's achievements can be easily made. Also, because estimates of reliability and validity are provided, one can know for sure whether an educational institution can be trusted or not. The standard that is produced from the norm-referenced tests is fairer and less arbitrary than the criterion-referenced tests. This is because the acceptable standards of achievement are determined with reference to the achievement of other student. In the purpose of the range of performance results, norm-referenced test provides more information about their abilities rather than information about pass-fail. Everything has strengths, surely has weaknesses. The weakness of the norm-referenced is such tests are usually valid only with the normgroup. The norm group is typically a large group of individuals who are similar to the individuals for whom the test is designed. Norms change as the characteristics of the population change, and therefore such tests must be upgraded periodically. The development of norm-referenced tests is independently of any particular course of instruction. That's why the results with objectives basic are difficult to match. The norm-referenced tests are also known as standardized tests. Gronlund (1985) said that the standardized tests are based on a fixed or standard content, which does not vary from one form of the test to another. 11 This content may be based on a theory oflanguage proficiency, as with the Test of English as a Foreign Language, or it may be based on a specification of language users' expected needs, as in English Language Testing Service test. In the standardized tests, there are standard procedures for administering and scoring of the test, which is still the same as one administration of the test to the next. Moreover, standardized tests have been thoroughly tried out and through a process of empirical research and development, their characteristics are well known. So that, what type of provided measurement scale can be known. The reliability and validity can be investigated and demonstrated carefully for the intended uses of the test. The score distribution norms have been established with the normgroups. And if there are alternate forms of the test, these are equated statistically to assure that reported scored on each test indicate the same level of ability. 2.4 Characteristics of a Good Test Test is said to be a good test when the test has three criteria qualities: validity, reliability, and practicality. It means the test must be appropriate in terms of the objectives, dependable in the evidence it provides, and applicable to the particular situation. Like a piece of cake, it would not be a delicious cake if there is one thing of ingredient left. It would not be said as the good test if one of three criteria qualities left. In the preparing the good test, the teacher must certainly understand the three concept of the good test quality and how to apply them (Hughes, 1989). 12 2.4.1 Validity The validity of the test concerns whether it measures what the testtaker wants to measure. In Language Program, Brown (1996) stated that test validity is defined as the degree to which a test measures what it claims, or purports to be measuring. Gronlund (1981, p.65) said that the validity of the test related to the results of the test. The validity of the test is interpreted by the test result, not the test it self. The result can be a degree whether the purpose of the test-takers' want can be reached or not. For example, if the result is to measure the ability of making a paragraph, the test should be arranged to measure the ability to write a paragraph. 2.4.2 Reliability The reliability of the test concerns whether the test result is consistent or not. Reliability refers to the consistency of measurement (Gronlund, 1981, p.94). The reliable test result consider to the students' understanding of the material test that is given and learned. The test is said constant or reliable if the score of the student is more or less the same. Meanwhile, ifthe score has significant differences, the test is unreliable. The difference in score may be happen because of any factors that influence the test-taker performance. The noisy sound, the feeling of the test-taker, the situation of the time or anything else can disturb the test-taker concentration. 13 2.4.3 Practicality The practicality of the test concerns with the usability. The usability is about the availability of place and time. The other important thing is the administration process of the test. The shorter time that is used to administrate the tests, that is called practicality of the tests. 2.5 Theory of Reliability The reliability of the test concerns whether the test result is consistent or not. The characteristic of the reliability is termed consistency (Henning, 1987, p. 73). The accuracy in the consistency is reflected in the obtaining of similar scores when the test; as a measurement tool; is done repeatedly on the different occasion. But the test is still administrated to the same students with the same ability. As cited by Hughes (1989, p.29), the similar the scores would have been, the more reliable the test is said to be. The test is said constant or reliable if the score of the student is more or less the same. Meanwhile, if the score have significant differences, the test is unreliable. The reliable test result consider to the students' understanding of the material test that is given and learned. Furthermore, in the book, Testing for Language Teachers, Hughes (1989) informed that the ideal reliability of the test is 1.0- a test with a reliability coefficient of 1.0 is one which would give precisely the same result for a particular set of candidates regardless of when it happened to 14 be administrated. The coefficient reliability of test is found by the comparing the reliability of the different tests. Hughes (1989) added that a test which had a reliability coefficient of zero would give sets of result quite unconnected with each other. It is in the perspective that the score that someone actually got would be no help at all in attempting to predict score on the next test. Moreover, every author has a different opinion about how high a reliability coefficient that should expect for different types of language tests. As cited as Hughes (1989, p. 32), Lado (1961) suggested as the difficulty in achieving reliability in the testing based on the different abilities. The coefficient reliability of tests can be estimated in various methods. According to Hughes (1989), there are three methods that are usually used to estimate the coefficient reliability of tests. Then the coefficient reliability of tests can be used to find the criteria of a test to produce a constant score through knowing the standard error of measurement. 2.5.1 Retesting Method (Test-retest Reliability) This method must have two sets of scores for comparison and to calculate the scores between those both tests. Then the tests are readministered to the group of subject to take the same test twice (Hughes, 1989). The test-retest can provide the estimate of the stability of the test scores over time. As similar as Hughes' opinion, Henning (1987) said that 15 the same test is re-administered to the same people following an interval of no more than two weeks. The purpose of the interval limitation is to prevent any changes in the tester's true score, which can affect the coefficient reliability by this method. If the second administration of the test after the first test is too soon, the coefficient reliability of the test will be high. It is because the test-takers likely remind the test and do the second test as same as the first test. It can be said that the test-takers lack of the measurement of their own abilities. Otherwise, if there is a long gap between administrations, the other learning or forgetting will take a place, rewrite their true abilities in the first tests, and the coefficient will be lower than it supposed to be. 2.5.2 Alternate-forms Method (Parallel-form Reliability) In this method, two sets of tests are administered to the same group of students and scores are correlated to see the consistency. The two sets of the tests must have the equivalent forms. To demonstrate the equivalence of tests, these tests must show the equivalent difficulty, the equivalent variance in the scoring, and the equivalent covariance; the equivalent correlations coefficient (Henning, 1987, p.82). The coefficient reliability of this method is expected to be the same as the coefficient reliability which is figured out by the test-retest method. But the reminder in the test-retest method can be controlled in the alternate-forms method. That thing is because the test-takers do not need to remind and recall the items of tests for the second-same test from the 16 previous test. The test-takers do not need to perform the same tests, twice. However, this method is difficult to do. It is almost impossible to fulfill the criteria of the equivalent tests. 2.5.3 Split-half Method (Split-half Reliability) In the book Fundamental Considerations in Language Testing, Bachman (1990, p.l72) wrote that split-half method is the calculating of the coefficient reliability which the test is divided into two halves and then determine the scores on the two halves are consistent with each other. This method only needs once administration of test to the students. The requirement of this kind of method is the relationship between those two halves should measure the same ability. The items in the two halves have to be equal means and variance. The two halves also should independent of each other. Here of this means that individual's performance on the one half does not implicit the performance on the other half. The split-half method uses the odd-even method. This approach is grouping all the odd-numbered items together into one half and all the even-numbered into another hal£ This way is applicable for the requirement of the test as mentioned as the previous paragraph. The example of the test which has independent items and the same ability measurement is the multiple-choice tests of grammar or vocabulary. According to Lado (1961), the split-half coefficient of reliability could be estimated by the following formula, 17 Where:N =the number of students in the sample X = odd items scores Y = even items scores L;X =the sum of X scores L;Y = the sum ofY scores L;X 2 =the sum of the squares of X scores L;Y2 =the sum of the squares of X scores L;XY = the sum of the products of X and Y scores for each student r x/ = the square of the correlation of the scores on the two halves of the test This formula could calculate the correlation between the two sets of half scores by means of the above formula. The result will be classified by Walpole (1992) in the book Pengantar Statistika: -0.1 to -0.5 = low negative correlation -0.6 to -1.0 =high negative correlation 0.0 = no correlation 0.1 to 0.5 = low positive correlation 0.6 to 1.0 = high positive correlation 18 The next formula is That formula is root of r ,./becomes r xy where: r xy =the obtained reliability of half the test r x/ = the square of the correlation of the scores on the two halves /of the test Then use the Spearman-Brown formula to estimate the reliability of the entire test. Where:r,, =the obtained reliability coefficient of the entire test r11 = the obtained reliability of half the test 22 After getting the result, Gronlund (1981) implied the degree: 0.900 to 1.000 = very high correlation 0.700 to 0.899 = high correlation 0.500 to 0.699 =moderate correlation 0.300 to 0.499 = low correlation 0.000 to 0.299 = little if any correlation 19 2.6 TOEFL as a Good Language Testing Product Generally, the TOEFL test is intended to evaluate only certain aspects of the English language proficiency of persons whose native language is not English (Duran, Canale, Penfield, Stansfield, LiskinGasparro, 1985, p.l ). The test is used to review and qualified the language ability of incoming foreign students whose native language is not English by colleges and universities in the United States of America and Canada. 2.6.1 Aspects of the Standardized TOEFL Since 1976, TOEFL consists of three sections, with the range TOEFL score Comprehension, from 227-627. Structure and The three Written sections Expression, are and Listening Reading Comprehension and Vocabulary (Duran, Canale, Penfield, Stansfield, Liskin-Gasparro, 1985). Listening comprehension section. This section is designed to measure the test-taker's ability to understand spoken English in American dialect. In this section there are three item types, statements, dialogues, and mini-talks/ extended conversations. Structure and written expression section. The structure expression is intended to find out the test-taker's control of synthectic features appropriate to various clause and phrase types, negative constructions, comparative forms, word ordering in sentences, and the statements of parallel relationship at the phrase and clause level. Each items of the structure consist of incomplete sentence and appropriate word and 20 distracter words to complete the sentence. The appropriate word must be in grammatically correct given the structure and content of the sentence. Moreover, the written expression section consists of sentences and four underlined words. The test-takers have to choose which one between the underlined words is inappropriate in the context of a sentence. The written expression section is designed to find out the learner's ability in the synthectic and semantic characteristics of function and content words, appropriate usage of function words, appropriate word ordering, appropriate diction and idiomatic usage, appropriate and complete clause or phrase structure, and maintenance of parallel forms. Reading comprehension and vocabulary section is created to measure the test-taker's ability in understanding of the reading material and the meaning and use of the words. 2.6.2 Bina Nusantara Cnrricnlum of TOEFL In creating the uniform teaching and learning, SAP/ MP are needed as guidance. It is also needed to standardize the material will be given for all students in Bina Nusantara University. 2.6.2.1Module Plan The syllabus for Bahasa Inggris 3 odd semester 2008/2009: • Introduction to the course • Sentences with reduced clauses • Sentences with reduced clauses (continued) 21 • Reading (Market Leader) • Sentence with Inverted subjects and verbs • Sentences with inverted Subjects and verbs (continued) • Reading exercises - TOEFL • Reading exercises - TOEFL (Continued) • Problem with usage • Writing paragraph and essay • Reading exercises -TOEFL • Reading exercises -TOEFL • Reading exercises - Ethics (Market Leader) 2.6.2.2 Relationship of Cnrrlculum and Classroom Setting Harsono (1997, p. 21) explained that there was a close correlation between curriculum and its components (objective, materials, methodology, evaluation, lesson plan, and classroom situation) as shown in diagram below (cited from Fanny Wibowo, 13-14). 22 ( CURRICULUM I LESSON PLAN CLASSROOM SITUATION Figure 1 The Relation of Curriculum and Classroom Setting The highest position is the curriculum. Meanwhile, there is the classroom situation in the lowest position. It means that the curriculum is the most important and essential thing in the teaching-learning program. The syllabus of the curriculum is applied in the plan of the teachinglearning program. From the syllabus, the teacher gives the material to find out the students' achievements of the objectives. The way how to conduct the teaching-learning program is also important to help the students to achieve the objectives. Then the evaluation is administered to the students to measure how extensively the objectives they achieved during the teaching-learning process. Those four 23 components from the syllabus are expected can guide the teacher in the plan of teacher-learning program. The best place to organize it is in the school (lesson plan). With the plan, the teacher can handle the classroom in the right way, under control of the curriculum. 2.7 Criteria of a Test to Produce a Constant Score The constant score that is produced by the standardized test make the standardized test to be the test that is trusted and standardized by the people. Then the question appears about the non-standardized test, for example the teacher-made test. Can the teacher-made test be trusted by the people, or even more than that, can the teacher-made test become the standardized test? The criteria of reliable test in the standardized test are the important requirements to make the teacher-made test to be the standardized test. According to Hughes (1989, p.36), the criteria how to make test more reliable and to produce the constant score are: 2.7.1 Do not allow the testers too much freedom In some kinds of language test there is a part where the tester offers to choose a question and allow them to answer it in the way they want it. The example is in the writing test. Generally, the writing tests provide some topics for the students to choose and then allow them to elaborate the topic in each their ways. Hughes (1989) proposed that the more freedom that is given, the greater is likely to be the difference between the performance actually elicited and the performance that would have been 24 elicited had the test been taken a day later. Too much freedom has a bad effect in the reliability of the test. That is why the multiple-choice item is more reliable than the essay item. But it is not impossible to make the reliable test by the essay item. Limit the topic as limited as possible is the way to make the essay test more reliable (Hughes 1989). 2.7.2 Provide clear and explicit instructions Hughes (1989) said that the unclear instructions influence the testers to misinterpret what they are asked to do. Then the test-takers will do the tests based on their own understanding of the instructions. That way will make the test inaccurate and non-reliable. The test-takers can not perform the true items in the test and in the true instructions according to their own ability, to be measured accurately. 2.7.3 Use items that permit objective scoring The first recommendation of objective item that may be turn up is the multiple-choice items. However, good multiple-choice items are notoriously difficult to write and always require extensive pre-testing (Hughes, 1989). For the alternate of the multiple-choice items, the open-ended items can be used as items which are more reliable than the multiple- 25 choice items. The open-ended item has a unique, possibly one-word, correct response which the test-takers produce themselves. 2.7.4 IdentifY the test-taker's by number, not name The result of the tests is expected to reflectthe test-takers' abilities. So the scoring systems have to be as fair as possible, as objective as possible. The way in scoring which is not objective will affect the accuracy and the reliability of the tests. Hughes (1989) explained that, studies have shown that even where the candidates are unknown to the scores, the name on a script will make a significant difference to the scores given. For example, the gender and the nationality of the same will affect the scorer in the scoring system. The identification of the student by number will reduce this effect.