Testing language Skills From Theory To Practice (Part One) Prepared By: Mousazadeh Evaluation is a major consideration in any educational setting. Teachers have always wanted to know how much their students have learned. The government and private sectors which pay teachers and employ the students afterwards are interested in having precise information about students’ abilities. Students, teachers, administrators and parents all work toward achieving educational goals. Measurement and evaluation are essential devices to help students and teachers and administrators to make sound educational decisions. Some educational decisions will affect a large number of people (Ex: the entrance exam to the universities). Some educational decisions will affect only a single person (Ex: should Ali be placed in an advanced group?). A good decision is defined as one that is based on all relevant information. The term test is usually considered the narrowest of the three terms (tests, measurement, and evaluation). Tests connote the presentation of a set of questions to be answered. As a result of a person’s answers to a set of questions, we obtain a measure or a numerical value of a characteristic of that person. Measurement implies measuring characteristics by means other than giving tests. Using observations, rating scales or other devices that allow us to obtain information in a quantitative form is measurement. Stufflebeam et al. (1971) stated that evaluation is “the process of delineating, obtaining and providing useful information for judging decision alternatives”. Evaluation is interpreted as the determination of the congruence between performance and objectives. Evaluation is categorized as professional judgment or as a process that allows one to make a judgment about the value of a measure. Evaluation requires that we have a goal or objective in mind. Both test givers and test takers benefit from the test results. Testing will encourage the students and will motivate them in learning the subject matter. Teachers should provide positive classroom experiences for their students through giving tests. Appropriate evaluation provides a sense of accomplishment in the students and alleviates their dissatisfaction about the educational program. Testing will help the students prepare themselves and thus learn the materials. Repeated preparations will enable students to master the language. Students will benefit from the test results and the discussion over these results. Several tests or quizzes will make students better aware of the course objectives. The analysis of the test results will reveal the students’ areas of difficulty. The students will have an opportunity to make up for their weaknesses. They should provide good instruction and appropriate evaluation. He believes that a better awareness of course objectives and personal language needs can help the students adjust their personal activities towards the achievement of their goals. An appropriate test should provide answers to the following questions: Has the instruction been successful? Were the materials for the instruction at the right level? Have all language skills been emphasized equally? What points need reviewing? Should the same materials be used next year or do they need some modification? They need testing to explain and justify their activities in class. The analysis of the results should provide answers for the following questions: Were the test instructions clear? Was the allotted time sufficient? How did the students feel when responding to the items? Were the test results a reflection of the students’ performances during the course? They are frequently used to evaluate the students’ progress in school. Through classroom achievement tests, teachers can measure the efficiency of the instruction. They state that teacher-made tests are valuable because: They measure the students’ progress based on the classroom activities. They motivate students. They provide an opportunity for the teacher to diagnose students’ weaknesses. The content of the test might be ambiguous. Sometimes the tests instructional materials. are irrelevant to Any teacher-made tests must be based on a predetermined content to measure the students’ knowledge at a given point of time. They are commercially prepared by skilled test makers and measurement experts. They provide samples of behavior under uniform procedures. They are different in terms of the following aspects: Direction for administration and scoring Sampling of content Construction Norms Purposes and use In teacher-made tests there is usually no uniform directions specified. In standardized tests, specific instructions, standardized administrations and scoring procedures are used. In teacher-made tests, both content and sampling are determined by classroom teacher. In standardized tests, content is determined by curriculum and subject matter experts and involves extensive investigations of existing syllabi, textbooks, and programs. Sampling of content is done systematically. In teacher-made tests, the construction may be hurried and random; there is often no test blue prints, item try outs, item analysis or revision; quality of test may be quite poor. In standardized tests, meticulous construction procedures are used that include constructing objectives and test blue prints, employing item try outs, item analysis and item revisions. In teacher-made tests, only local classroom norms are available. In standardized tests, in addition to local norms, national and district norms are available. In teacher-made tests, particular objectives set by teacher and for intra class comparisons are measured. In standardized tests, broad curriculum objectives and for inter class; school and national comparisons are measured. It is meant that the same fixed set of questions is administered with the same set of directions, time restrictions, scoring procedures. Scoring is usually based on objective procedures. However, some may also include essay type questions. Teacher-made tests usually cover a single unit of work or that of a term. Standardized tests usually have a wider range of coverage (that is they cover more material). They assess either one year’s learning or more than one year’s learning. Testing is viewed as a practical teaching strategy giving learners useful opportunities for discussion of language choices. He believes that “language testing today reflects current interest in teaching, but it also reflects earlier concerns for scientifically sound tests.” Traditional tests Multiple- choice tests Testing commutation They are closely related to GTM in language teaching. The early stage of traditional testing is called intuitive stage in which the relationship between language teaching and testing is stronger. Knowing about the language was emphasized. Students had to memorize many language rules and lists of words. Traditional tests also include a great deal of writing (composition) and reading comprehension. (Ex1: convert the following statement into past tense) (Ex2: write the main parts of these verbs: go/ buy) (Ex3: make sentences using each of these words: bashful, diligent) (Ex4: translate the following sentence or passage into Farsi) Structural linguistics and behavioristic psychology These two disciplines suggested that “language mastery could be evaluated scientifically bit by bit” (Madsen, 1983). Behaviorist psychologists consider language learning as a set of habits. Structural linguists analyze the components of language (sound, morphemes, words, syntax). Objective tests were devised to measure different language elements. The main reasons were the emergence of structural linguistics and bahavioristic psychology and unreliability of subjective tests. Open –ended and multiple –choice tests They are the most popular types of objective tests. The students are presented with alternatives or options (including one correct answer and distracters). They are expected to choose the correct alternative. They measure only a single or discrete feature of the language. They provide the learner with restricted contexts, usually no wider than the item context. Constructing good test items with reasonable distracters is very difficult. She suggests that” the inexperienced test constructors should first prepare openended items and administer them to some students. The wrong answers provided by the students could be used as reasonable distracters later on.” Uncommon and implausible distracters are dangerous instruments to be used in language testing. Many types of multiple choice tests expose students to a lot of unlikely errors. Many language tests prepared by teachers intend to examine linguistic components separately. These linguistic components constitute language skills. These tests move toward global testing They make more comprehensive demands on the learners Two very popular types of global test are dictation and cloze. The term cloze is taken from Gestalt psychology and is based on a passage with some deleted words. It requires perceptive and productive skills and a sound knowledge of lexical and grammatical systems. Students should linguistic clues. take advantage of all The students should rely on some other contextual clues, too. Tests are constructed to enable learners to manipulate language functions and to identify utterances as belonging to a certain function of language. There is a misconception about language that successful language usage would lead to successful language use. While linguistic aspects of language are only one part of the communication process. Functions of language tests A test is an instrument for collecting numerical information of an attribute. The purpose is to determine the degree of existence of an attribute. The function of a test refers to the purpose for which a test is designed. Prognostic and evaluation of attainment Placement test, aptitude test, and selection test. They are not directly related to a particular course of study. They are based on a clearly specified course of instruction. Achievement test, proficiency test, and knowledge test The scores are used to make decisions about the most appropriate channel of educational or occupational career for the testees. The main goal is to make sound decisions about the future success of the examinees on the basis of their present capabilities. The purpose is to provide information upon which the examinees’ acceptance or non acceptance into a particular program can be determined. Ex1: taking a selection test to obtain a drivers’ license Ex2: taking a selection test to demonstrate your capabilities for employment as a typist. There is a criterion for pass or fail in selection test but not for the placement test. Due to administrative limitations, admitting all applicants who pass a selection test is not possible. In other words, when the number of applicants passing a test exceeds the capacity of the educational program, it changes into competition test. There are two options: To increase the facilities to admit more applicants or To modify the passing criteria (i.e. the difficulty level of the test can be increased). They are used to determine the most appropriate channel of education for the examinees. The purpose is to help those who need more help. Ex: taking placement test in language department of university to take academic courses. Placement test To predict applicants’ success in achieving certain objectives in the future The examinee does not need to have prior knowledge of the subject being tested They can be contributed to making decisions on the future career of the applicants Ex: how good a pilot, an engineer, or a teacher can one be? Developing aptitude tests is a very delicate and time consuming task. Weak tests may provide invalid and misleading information. Evaluation of attainment tests deal with the extent to which examinees have learned the materials they have been taught. While they are directly related to educational settings, prognostic tests are not so. They are used to measure the degree of students’ learning from a particular set of materials. They measure the detailed elements of an instructional topic. They are used to determine the strength and weaknesses of the examinees in a particular course of study. They are developed on the basis of materials being taught. They are designed to measure students’ overall achievement in a particular language class. Achievement tests can be used for both instruction and evaluation purposes. Achievement test focus on measuring students’ achievement of the materials covered within the course. Proficiency tests measure the overall language ability of the learners. They measure the degree of learners’ knowledge through his language education. They measure the degree of his capability in language components. They measure the degree a person is able to practically demonstrate his knowledge of language use. Many universities use proficiency tests for admission purposes (TOEFL) Construction of proficiency tests is more difficult than other tests. It is not easy to define proficiency. They are used in situations where the medium of instruction is a language other than the learners’ mother tongue. They measure the examinees’ knowledge in areas other than the language itself. Forms Of Language Tests Explanation of the concept of the item Different classifications of item formats The advantages and disadvantages of item formats classifications The test appearance may put the testee in an unexpected situation. It may disappoint or encourage the testee. The appearance of the test may be harmonious with or contrary to his presuppositions about it. The form refers to the physical appearance of the test. The form depends on the nature and varieties of attributes to be measured. The form also depends on the function of the test. An item is the smallest unit of a test. An item consists of two parts: the stem and the response. The purpose of the stem is to elicit information from the examinee and to make examinees provide the examiner with information. Stem can be presented as a question, a statement, an incomplete sentence, or as other varieties. The response refers to the information elicited from the examinees. The response can range from recognizing a single word to providing a comprehensive essay presenting discussion or explanation of a complex issue. The stem is followed by three, four or five responses. The responses are called alternatives, options or choices one of which is the correct response and the others are called distracters. Alternatives correct response distracters An alternative may or may not be the correct response; in other words alternatives include both the correct response and the distracters. Whereas distracters consist of only wrong alternatives. Subjective versus objective items Essay type versus multiple choice items Suppletion versus recognition items, and psycholinguistic classification Translation tests were used as major techniques of testing (translating a passage or a set of sentences from one language into another). The advantage was that the content of translation tests was relevant to the materials to be tested. In other words the content was a valid representation of the materials. The main shortcoming of translation test was that the scoring procedures were not systematic. The scoring is not systematic. It requires a great amount of time and energy. Fluctuations of scores from one scorer to another creates serious problem for the consistency of test scores. The scoring did not follow any objective criterion. To compensate for the inadequacies of subjective tests. To apply psychometric principles to language tests. To develop consistently scored tests. Objectivity or subjectivity refers to the way a test is scored and has little or nothing to do with the form of a test. It is misunderstanding to assume all composition tests are subjective or all multiple- choice tests are objective. It refers to all kinds of items in which the examinee is required to produce language elements. It refers to all kinds of items in which the examinee is required to select the correct response from among given alternatives. There are different varieties of essay type formats ranging from a single word production to producing a comprehensive explanation and each of them requires a certain type of activity. They cannot be classified under the same category. Recognition form items require the examinees to recognize the correct response from among the alternatives provided for each stem. They require the examinee to supply the missing parts of the stem or complete an incomplete stem. The degree of production and the way they will be scored were not clear. In the new classification, the form of the item is determined by taking theoretical principles of language processing into account. Because it assumes both psychological and linguistic principles as the underlining theoretical assumptions of item formats. Verbal non verbal Perception production of oral or written materials identification analysis recognition comprehension Verbal manifestation includes oral and written forms Non- verbal manifestations include all sorts of graphic devices. Statistics involves collecting numerical information called data, analyzing them and making meaningful decisions on the basis of the outcome of the analysis. There are two major areas : Descriptive statistics inferential statistics. According to Hatch and Farhady (1982), in descriptive statistics we describe sample data. Each characteristic of the sample is called a statistic. Through utilization of the methods of inferential statistics and from the statistic, we can make inferences about the characteristics of a given population. Statistic: the characteristics of a given sample Parameter: the characteristics of a given population Normally the first step in summarizing the data is to arrange the scores in the order of size, usually from the highest to the lowest. For ties, ranks are averaged. Scores: 19Ranks: 1- 172- 16163.5- 3.5- 145- 136- 127.5- 127.5- 89- In addition to the time and trouble required to determine the ranks, the list is long and inadequate for making comparisons with other groups or classes that are much larger or much smaller. Ex: ranking 19th in a class of 20 is poorer than ranking 19 in a class of 100 students. The status of a score should not simply be announced by the number of scores above or below it. Ex: Reza got the third highest mark in the class. The number of scores in the entire distribution would have to be made known. Frequency is the number of times each score occurs. EX: 19, 19, 19, 17, 17, 15, 15, 15, 15, 15, 13, 12, 11 f19=3 f13=1 f17=2 f12=1 f15= 4 f11=1 Frequency distribution Percentile percentage By using percentile we can determine the position of a score in a given distribution. By percentile we can report how a given student is doing. It refers to the frequency of each score divided by the total number of scores. (RF=f/N) The relative frequency is multiplied by 100 and the result is called percentage. By percentage we can say that what percent of the subjects passed the test or received a particular score. (Percentage= RFX100) It indicates the standing of any particular score in a group of scores. It shows how many scores fall below the given score or point in a distribution. The cumulative frequency is obtained by adding the frequency of successive intervals in the previous work. The cumulative frequency column is constructed from the bottom up. Lower case letter (f) is used for absolute frequency and upper case (F) for cumulative frequency. To compute the percentile rank of any level or point, the corresponding F should be divided by the total number of scores and the result is then multiplied by 100. (percentile= F/N X 100) A percentile rank of an individual means what percent of the students who took the test scored at or lower than the level in question. The percentile rank of an individual score is often more helpful than the particular score itself. Scores (ranked order) Frequency(f) Relative frequency(RF) Percentage Cumulative frequency(F) percentile 19 17 15 13 12 11 2 2 3 1 1 1 2/10 2/10 3/10 1/10 1/10 1/10 20% 20% 30% 10% 10% 10% 10 8 6 3 2 1 100% 80% 60% 30% 20% 10% It is a valuable supplement to summarize the data and statistical analysis. A graph or chart attracts the reader’s attention. A graph is often an effective method of clarifying a point. It is said that the pictures speak for themselves. The picture or graph is a more concrete representation of the data. He states that today our attention is called more to the limitless possibilities in visual education. The correct graph reveals the message briefly and simply. Better comprehension of data than is possible with textual matter alone. More analysis of subject than is possible in a written text. A check of accuracy. Bar graph. In bar graphs vertical bars are used. The height of each bar represents the number of members or the frequency of that class. First two axes should be drawn (horizontal and vertical lines) Enter the scores on the horizontal axis and the frequency of each score on the vertical one. The histogram is a series of columns. One class interval is the base for each column and the height is the number of cases or frequency in that class. It is customary to extend the scale one class interval above and below the range. In the histogram, the top of each column is indicated by a horizontal line and the length of one class interval represents the frequency in that class. The points are joined by straight line. Usually for description of a set of data, just two or three properties of a set of scores are singled out. Indexes known as summary statistics describe the typical size and spear of scores. Two properties including measures of central tendency and measures of dispersion (variability) are described. The mode the mean the median The mode. The mode is the score that occurs most frequently in a set of scores. 12, 14, 15, 16, 16, 16, 20 When all of the scores in a group occur with the same frequency it has no mode. 11, 11, 12, 12, 13, 13, 14, 14 The mode is the average of the two adjacent scores. 12, 13, 14, 14, 15, 15, 16, 17, 18, (Md) is the score at the 50th percentile in a group of scores. It is the score that divides the ranked scores into halves. Half of the scores are larger than the median and the other half are smaller. a) If the data include an odd number of scores: The median is the middle score when they are ranked. b) If the data include an even number of scores: The median is the point half way between the central values when the scores are ranked. 10,11, 12, 14, 17, 19, 20 10, 12, 13, 14, 15, 17, 19, 20 Md= 14.5 The mean is the arithmetic average. It is computed by dividing the sum of all scores by the number of scores. It is represented through the following formula: X ֿ= ∑x/N Xֿ= mean ∑= sum of scores x=any individual score in a distribution N= total number of scores If we subtract the mean from the scores, the resulting difference is a deviation score (D) it can be either + or -. The sum of all N deviation scores would be zero. The sum of the squared deviations of scores from the mean is less than the sum of the squared deviations around any point other than the mean. score mean Deviation score D squared 0 2 -2 4 1 2 -1 1 1 2 -1 1 3 2 1 1 5 2 3 9 Range, variance, standard deviation Measures of central tendency only locate the center of the distribution. The location of the center may not be adequate to provide a logical picture of the data. It is the difference between the largest number and the smallest number in the distribution. It is the simplest measure of variation to calculate since only two numbers are used. It doesn’t tell us anything about how the other terms vary. If there is one extreme value in a distribution, the dispersion will appear very large. If we remove the extreme term, the dispersion may become small. First the mean of the numbers are calculated. The mean is subtracted from each number. The results of subtraction are squared. The average of the squared results are computed which is variance. Standard deviation is the square root of the variance. Variance=∑(x-xֿ)/N-1 SD tells us about the degree of dispersion of scores in a distribution. By comparing the SD of different groups we would know to what extent they are homogeneous. Linear correlation The coefficient of correlation(Pearson correlation) Rank order correlation Point Biserial correlation همبستگى دو رشته اى نقطه اى 20, 19, 18, 17, 16,15 15, 16, 17, 18, 19, 20 20, 19, 18, 17, 16 20, 19, 18, 17, 16 The function refers to the purpose of the test and the form refers to the way an item is presented to the examinees. The form and the function of a test are interrelated. The function of a test can impose certain limitations on the form of the items of that test. Determining the function and the form Planning (determining the content of the test) Preparing the items Reviewing the items Pretesting Validating the test The characteristics of the examinees The specific purpose of the test The scope of the test The nature of the population to which the test is likely to be administered (age, gender,..) Level of intellectual and cognitive abilities The examinees’ language background (doing contrastive analysis) The examinees’ educational system The purpose is to gather quantitative information about the degree of the examinees’ command in a particular area of knowledge. Examining the instructional objectives (ex: including major structural points covered during the instruction) Dividing the major topics into their specific points (the degree of detaildness depends on practical factors such as test length and test time). Preparing table of specifications with two dimensions (on one dimension, topics and subtopics are listed. On other dimension, form and number of items is described. The purpose is to assure the test developer that the test includes a representative sample of the materials covered in a particular course. Preparing items Avoid using broad general statements Avoid using statements which measure trivial points Avoid using negative statements Avoid using long and complex sentences Make true and false statements approximately of similar length, difficulty and distribution. Use homogeneous materials in a single matching item Include an unequal number of items in each column Clarify the way the items are to be matched the from the two columns Keep the list brief and place the shorter column to the right The stem should be quite clear and state the point to be tested unambiguously. The stem should include as much of the item as possible. Negative statements should be avoided because they are likely to be ignored by the examinees. All of the statements should be grammatically correct by themselves and consistent with the stem. Every item should have one correct or clearly best answer. All distracters should be plausible. All distracters should be of similar length and level of difficulty. Using “all of the above” or “none of the above” as an alternative is not recommended. Correct responses should be distributed approximately equally but randomly among the alternatives. The stem should not provide any grammatical clue which might help the examinee find the correct response without understanding the item. The stem should not start with a blank. Pretesting means examining or reviewing the test objectively not subjectively. To determine objectively the characteristics of the individual items including (IF), (ID), and (CD) Validation which determines the characteristics of the items together and includes reliability, validity and practicality It refers to the easiness of an item. It is the proportion of correct responses to the total number of responses. IF= ∑C/ N IF= item facility, ∑C= sum of the correct responses, N= total number of responses It is when all examinees get an item correctly and equals to 1. Zero (IF) indexes beyond 0.63 are too easy, and (IF) indexes below 0.37 are too difficult It is the proportion of wrong responses to the total number of responses. Item difficulty= 1 – item facility It refers to the extent to which a test item discriminates more knowledgeable examinees from less knowledgeable ones. ID= item discrimination, CH= number of correct responses of the high group, CL= number of correct responses of the low group, ½ N= total number of responses divided by 2. (Item discrimination index beyond 0.40 can be acceptable.) It refers to the frequency with which alternatives are selected by the examinees.