LTS Handbook Version 4.0

advertisement

BUREAU FOR INTERNATIONAL

LANGUAGE CO-ORDINATION

L A N G U A G E T E S T I I I N G

S E M I I I N A R

22 November ~ 3 December 2003

BILC Language Testing Seminar i

BILC Language Testing Seminar ii

CONTENTS

I.

INTRODUCTION …………………………………………………………………… I-1

II.

FAMILIARIZATION WITH THE SCALE .………………………………………… II-1

III.

EVALUATION OF READING COMPREHENSION .…..………………………… III-1

IV.

EVALUATION OF LISTENING COMPREHENSION ….……………..…………. IV-1

V.

TEST CONSTRUCTION, ANALYSIS, AND ADMINISTRATION .…….……….. V-1

VI.

EVALUATION OF SPEAKING PROFICIENCY ……..………………..………….. VI-1

VII.

EVALUATION OF WRITING PROFICIENCY ……..…………………..………… VII-1

APPENDIX A – NATO Standardization Agreement 6001 (Edition 2) .………….…. A-1

APPENDIX B – Trisections ……………………..…………………………………… B-1

APPENDIX C – Testing Listening Comprehension: Tips for Teachers ……………… C-1

APPENDIX D – The Concepts of Reliability and Validity ………………………….. D-1

APPENDIX E – Scale-Related Terminology ………………………………………… E-1

APPENDIX F – General Testing Terminology ………………………………………. F-1

APPENDIX G – Selected Bibliography and Web Links ……………………………. G-1

BILC Language Testing Seminar iii

CHAPTER III

EVALUATION OF

READING COMPREHENSION

Chapter III Evaluation of Reading Comprehension

SOME TESTING TERMINOLOGY

I

TEM

An entire “question”. It may be considered a miniature test.

S

TEM

The initial part of the item: either a partial sentence to be completed, a question, or several statements leading to a question or incomplete phrase.

O

PTIONS

Choices from which the examinee must select an answer.

K

EY

The right answer. The one option that is distinctly correct or more suitable than the others.

D

ISTRACTORS

The incorrect options.

O

RIENTATION

, or S

ETTING

The context presented at the beginning of the item to provide a fuller understanding of the situation presented for testing.

T

EXT

The written or spoken material providing the content on which the item is focussed.

BILC Language Testing Seminar III-1

Chapter III Evaluation of Reading Comprehension

MULTIPLE-CHOICE TEST ITEM

ORIENTATION

A message at the office

O

P

T

I

O

N

S

John,

March 5

Betty called today at 12:15. She said you have a piece of certified mail to pick up. The mail room closes at 3 o’clock today.

Thank you,

N. F.

This note tells John to

(A) close the mail room at three.

(B) go to get some mail.

(C) mail a letter for Betty.

(D) pick up Betty at the mail room.

STEM

KEY

TEXT

DISTRACTORS

(options A, C, D)

BILC Language Testing Seminar III-2

Chapter III Evaluation of Reading Comprehension

CONSTRUCTED-RESPONSE TEST ITEM

ORIENTATION

A message at the office

John,

March 5

Betty called today at 12:15. She said you have a piece of certified mail to pick up. The mail room closes at 3 o’clock today.

Thank you,

N. F.

After reading this note, John should

STEM

TEXT

SCORING KEY

After reading this note, John should go to the mail room to pick up a piece of certified mail before 3 o’clock/go to the mail room to get some mail/get his mail before 3 o’clock.

BILC Language Testing Seminar III-3

Chapter III Evaluation of Reading Comprehension

TWO BODIES FOUND NEAR RANCH

LINKED TO GANG

NUEVO LAREDO, Mexico – Mexican authorities working closely with the FBI unearthed two bodies yesterday near a ranch believed to be controlled by a local kidnapping and drugsmuggling gang.

The bodies of Jose Martinez and Arturo Ortiz, both of Nuevo

Laredo, were found in shallow graves near Batista Ranch on the outskirts of Nuevo Laredo’s airport, said Jaime Ramirez, an assistant prosecutor for northern Tamaulipas State.

During a subsequent raid of the ranch, police arrested Luis

Rivas, a Mexican who had been living across the border in

Laredo, Texas, and four other suspects.

Near the ranch, investigators found a Toyota pickup truck they believe was used in the slaying of another man, Jose

Hernandez, who was killed in this seedy border city of 275,000 on November 2.

Authorities also seized $81,000 in cash, 10 handguns and three walkie-talkies from inside the ranch.

BILC Language Testing Seminar III-4

Chapter III Evaluation of Reading Comprehension

MOTORCYCLIST INJURED IN COLLISION

A Seaside man was injured Thursday afternoon when his motorcycle collided with a car on Garden Road in Monterey. Police identified the injured motorcyclist as Harold R. Romney of 1284

Buena Vista Street.

He was injured in a collision with a car driven by Tom Lee, 45, of

Milpitas, according to a police report. Romney was taken to

Community Hospital, where a spokesman said it was expected that he would be released after treatment.

The accident, which occurred just before 3 p.m., was on Garden

Road north of Olmsted Road.

BILC Language Testing Seminar III-5

Chapter III Evaluation of Reading Comprehension

U. S. AIR FORCE PILOT KILLED IN CRASH

Major Samuel A. Wainwright, 32 of Bradford, Connecticut, died

Thursday when the F-15 fighter he was flying crashed in Germany’s

Eifel Mountains, the U. S. Air Force announced Friday.

The Air Force said the single-seat fighter was on a 36th Tactical

Fighter Wing training mission. The Air Force said the cause of the crash had not been determined.

The Armed Forces Network reported that witnesses to the crash claimed to have seen the fighter explode before it crashed.

BILC Language Testing Seminar III-6

Chapter III Evaluation of Reading Comprehension

SECOND MAN HELD IN TWINS’ DEATH

A second suspect was arrested yesterday in the shooting deaths

Friday of George E. and Gerald E. Hayes, 27-year-old twins, Prince

George’s County police reported.

Kenneth L. Osborne, 29, of no fixed address, surrendered at 4 p.m., police said. He was held without bond in the county jail on two first-degree murder charges.

The bodies of the Hayes brothers, who lived on Stanton Road SE, were found in separate locations in Upper Marlboro about 10 miles from the place where they were probably shot – the rear of the “51”

Club in Hillcrest Heights. Early Sunday police charged Ralph P.

Lucas, 26 of Hillcrest Heights with two counts of murder.

BILC Language Testing Seminar III-7

Chapter III Evaluation of Reading Comprehension

THE ROLE OF TEST SPECIFICATIONS

Participants should read Hughes’s chapters 7 and 8, especially pages 59-62

.

Note Hughes’s analysis of test specifications:

STATEMENT OF THE PROBLEM

CONTENT

• Operations

Types of text

• Addressees

Topics

FORMAT AND TIMING

CRITERIAL LEVELS OF PERFORMANCE

SCORING PROCEDURES

The page numbers refer to the second edition (2003) of Arthur Hughes, Testing for Language Teachers.

BILC Language Testing Seminar III-8

Chapter III Evaluation of Reading Comprehension

SAMPLE of Test Specifications

STATEMENT OF THE PROBLEM

A proficiency test is needed to measure STANAG Levels 2 and 3 for listening, speaking, reading, and writing. This test should provide a strong indicator that those examinees attaining Level 2 or Level 3 can perform the tasks presented in the STANAG descriptors.

Officials responsible for personnel assignments could use proficiency test results to extrapolate probable success in jobs closely reflective of tasks in the descriptors.

CONTENT

• Operations

These are based on tasks and accuracy requirements found in the descriptors for Level 2 and Level 3.

• Types of text

Texts for listening and reading tests should come from authentic sources written by native speakers for native speakers and not intended for instructional purposes.

Examples would be periodicals and broadcasts.

• Addressees

Addressees for speaking and writing tests are native and near-native speakers of the target language who are experienced in speaking to and reading texts produced by nonnatives.

• Topics

Topics should be similar to those mentioned in the Level 2 and Level 3 descriptions. In brief, Level 2 topics should be presented in a factual, concrete manner. Level 3 topics will normally include both concrete and abstract material.

FORMAT AND TIMING

Listening Comprehension: 40 authentic Level 2 texts, no longer than 30 words, with comprehension evaluated by 4-option multiple-choice items

40 authentic Level 3 texts, no longer than 75 words, with comprehension evaluated by 4-option multiple-choice items

Reading Comprehension:

Time controlled by recording, approximately 75 minutes.

40 authentic Level 2 texts, no longer than 50 words, with comprehension evaluated by 4-option multiple-choice items

40 authentic Level 3 texts, no longer than 150 words, with comprehension evaluated by 4-option multiple-choice items

Power test. All examinees should normally be finished within three hours.

BILC Language Testing Seminar III-9

Chapter III Evaluation of Reading Comprehension

DEFENSE LANGUAGE PROFICIENCY TEST Ill

READING TEST (LOWER RANGE)

FORMAT

LEVEL

NUMBER OF

ITEMS

TYPE OF ITEM CONTENT

0-1+ 10

Multiple-choice with four English options

Ten authentic signs, notices, headlines, menus, receipts, tickets, printed to look authentic.

1+

2/2+

2+/3

10

30

25

Information

Identification

Two authentic passages at the 1+ level.

Each passage consists of 60 words.

Includes notices, notes, announcements, and news summaries. Five sets of four

English options to translate the five underlined portions in each passage.

Information

Identification

Two authentic passages at the 2+ level with underlined portions at the 2 to 2+ level. Each passage consists of 125 words. Includes factual news articles and descriptive passages. Fifteen sets of four

English options to translate the fifteen underlined portions of each article.

Comprehension

Questions with four English options

(Inference Questions)

Four to five authentic passages at level 3 with questions at the 2+ to 3 level. Each passage consists of 175 - 200 words.

Includes a variety of authentic prose with unpredictable and unfamiliar text types.

News stories and involving hypothesis, argumentation and supported opinions. general reports

3/3+ 25

Total 100 Items

True, False, Not

Addressed

(Inference Questions)

Three to four authentic passages at the 3 to 3+ level. Each passage consists of

200- 250 words. Includes passages which require interpretation and depends on sociolinguistic and cultural references.

Passages include complex structures, idioms, and uncommon connotative intentions.

BILC Language Testing Seminar III-10

Chapter III Evaluation of Reading Comprehension

FINAL LEARNING OBJECTIVES (FLO) SUB-SKILL

Reading: Answer Content Questions on Target Language texts in English (Task # 12)

Target Language written text equivalent to 500 words of English contents: military & security, economic & political, scientific & technological,

• cultural, geography lexical aid: YES

Performance required

• answer content questions on Target Language text done in 30 minutes maximum

80% of essential elements of information captured in summary

TEST SPECIFICATIONS

Format: Constructed Response

Test Tasks

• 5 written passages in Target Language of roughly 80-100 words in length, each followed by 2 questions contents: military & security, economic & political, scientific & technological, cultural, • geography

Behavior / Standards

• answer two content questions displayed on screen on each Target Language passage use lexical aid to verify spelling and meaning type answers into the computer

Scoring method by live scorers based on correct essential elements of information provided scores generated include both raw and percentage scores

BILC Language Testing Seminar III-11

Chapter III Evaluation of Reading Comprehension

READING COMPREHENSION

PROFICIENCY ADVANCEMENT TEST (PAT) ONE

1. PART ONE - (SIGNS AND EXPRESSIONS)

Stimulus – Printed signs, frozen/memorized phrases.

Options – Four English options.

Number of Items – 10

Level – 0+

2. PART TWO (VOCABULARY IN CONTEXT)

Stimulus – Short printed sentences with underlined word.

Task – Choose translation of underlined words.

Options – Four English options.

Content – High frequency/familiar vocabulary in familiar context.

No. of Items – 15

Level – 0+/1

3. PART THREE (CONTEXTUAL COMPREHENSION)

Stimulus – 4 short sentences or paragraphs with blanks.

Task – Choose correct form to fill blank.

Options – Four target language options.

Content – High frequency/familiar grammar patterns and forms.

No. of Items – 15

Level – 0+/1

4. PART FOUR (QUESTIONS ON PASSAGE/GISTING)

Stimulus – 2/3 short passages (45-70 words).

Task – Answer multiple choice questions (factual summary).

Options – Four English options.

Content – Authentic

passages reduced to Level 1/2 wording.

Multiple choice options will help furnish context: glosses in English may be used if absolutely necessary to furnish context.

No. of Items – 12

Level – 1/1+

In these documents the word “authentic” refers to published written material in the target language outside the course instruction.

BILC Language Testing Seminar III-12

Chapter III Evaluation of Reading Comprehension

5. PART FIVE (CLOZE PASSAGE)

Stimulus – Short passage (80-120 words) with paraphrase.

Task – Select options to restore passage.

Options – 3-4 symbols, 12-16 deletions, 10-15 options under each symbol.

Content – Simple authentic passage with very little editing; paraphrase will be used to furnish context. Glosses in English may be used if absolutely necessary to furnish context.

No. of Items – 16

Level – 1+/2

6. PART SIX (INFORMATION IDENTIFICATION)

Stimulus – Short passage (125-150 words) with underlined words and phrases.

Task – Select multiple choice option that translates underlined portion.

Options – 4 multiple choice.

Content – Simple authentic passage with very little editing; glosses in English may be used if absolutely necessary to furnish context.

No. of Items – 17

Level – 1+/2

Total No. of Items – 85

Level Range – 0+/2

BILC Language Testing Seminar III-13

Chapter III Evaluation of Reading Comprehension

STEPS TO FOLLOW IN ITEM WRITING ACTIVITY

1.

Each group has a portfolio with nine Level 2 and Level 3 texts.

2.

Review the texts.

3.

Determine where each text would probably fit into the approved Table of Specifications.

4.

Plan the item development activities for the next few hours. a.

Decide how the group will do its work. b.

Decide on a priority order for working with the texts.

5.

For each text used: a.

Edit if necessary. See the paper on “Editing of Authentic Texts” (page III-15). b.

Review the Content/Task/Accuracy statement for the level of the text. c.

Determine how you will test the text:

(1) Multiple-choice? Constructed response? Other?

(2) Agree upon the primary testing point. See the paper on “The Item Writing

Process/Reading Tests”, page III-16, which refers to Hughes and other testing experts. (Examples of testing points: “Identifying the main idea in the text”;

“recognizing key words and phrases”; “making inferences about the writer’s attitude”) d.

Write options. See the paper on “A Sample Procedure for Item Development” (page

III-17). e.

Review the completed item within the group. f.

Ask the facilitator to arrange for review by another group. g.

Revise, if necessary. h.

Request a review by the facilitator. i.

Revise, if necessary.

BILC Language Testing Seminar III-14

Chapter III Evaluation of Reading Comprehension

EDITING OF AUTHENTIC TEXTS

The ideal text would require no editing at all. However, some authentic texts will seem promising for testing purposes but require some adjustment to fit test specifications. For example, a longer text may need to be shortened to fit specifications. It is important to preserve the authenticity of the text. If a two paragraph text can be shortened by totally eliminating the second paragraph, that is a simple solution. However, the text may seem inconclusive or incomplete unless one or two sentences from the second paragraph are included.

A good text may include the names of elected officials or persons in the news; we can anticipate that the text will soon be outdated. The solution is to use the title rather than the name of the elected official. It is also possible to devise a totally fictitious name for a newsmaker.

A text may contain typographical errors that would be distracting to examinees. These should be corrected.

A text may include acronyms or abbreviations that are common in the culture but not widely known by non-natives at the levels you are testing. You can replace them with full titles of the organizations, etc.

Following editing of this type, test writers will want to review their work by asking the following questions:

Is all spelling correct?

Is the punctuation correct?

Does the edited material retain semantic information and cohesion so that it can

• reasonably be considered a “text”?

Would native readers/listeners find the edited text awkward or inauthentic?

Note that it may be tempting, particularly when working with spoken texts, to clean up or correct the language – to make it seem more educated or literary or a better reflection of the target culture. Such corrections may very well present a false impression of the target language and actually be non-native. Test writers should try to avoid this type of editing.

BILC Language Testing Seminar III-15

Chapter III Evaluation of Reading Comprehension

THE ITEM WRITING PROCESS

READING TESTS

Read Hughes’s chapter 11, especially pages 142-143 (Selecting Texts) and pages 153-155

(Procedures for Writing Items and Practical Advice on Item Writing).

In Assessing Language Ability in the Classroom, Andrew Cohen (quoting Alderson and

Lukmani, Reading in a Foreign Language, 1989) provides a taxonomy of reading comprehension skills that lend themselves to testing. These include: word and phrase recognition identification or location of information discrimination of contextual features in a text (cause, sequence, chronology, hierarchy) interpretation of complex ideas, actions, events, relationships inference (deriving conclusions and predicting the next steps) synthesis of information evaluation

Madsen, in Techniques in Testing, adds the skill of paraphrase. This could include recognition or production of an accurate paraphrase.

In deciding how to construct an item, we also need to look at the content and tasks associated with the proficiency level we are testing. These can be found in the reading trisections

(Content/Task/Accuracy) printed in Appendix B.

BILC Language Testing Seminar III-16

Chapter III Evaluation of Reading Comprehension

A SAMPLE PROCEDURE FOR ITEM DEVELOPMENT

1.

Locate sources for authentic material at the desired levels.

2.

Select an authentic text at the correct level.

3.

Consider the following questions:

Is it likely the text would be familiar to the test taker?

Is it free-standing or does it depend on greater context?

Does it lend itself to good distractors?

Is it representative of the culture?

Does it include sensitive material that might offend an examinee?

Will it soon be out-dated?

Is the language sufficiently contemporary?

4.

Edit a text that is too long or one that includes easily dated information. It may also be advisable to spell out abbreviations or acronyms if these alone raise the difficulty level of the text. Occasionally, it may be necessary to remove typographical errors.

Does the text still seem authentic after editing?

Does the text remain free-standing?

• Is syntax correct after editing?

5.

Examine the text for its potential as an item. For example:

Identify the main idea.

Identify supporting information.

If the text is at Level 3, does it lend itself to a question focussing on a.

inference b.

opinion c.

tone d.

synthesis e.

other

6.

Review the tasks associated with the proficiency level to be tested.

7.

Prepare the key (correct options) and experiment with distractors (incorrect options).

8.

Put the item aside and review it later. Revise if necessary.

9.

Request a review from a colleague. Revise if necessary.

10.

Request another independent review of a group of items from a native speaker not assigned to the project. Revise if necessary.

BILC Language Testing Seminar III-17

Chapter III Evaluation of Reading Comprehension

RESOURCE MATERIALS FOR ITEM REVIEW

A.

QUESTIONS TO ASK WHEN EVALUATING ITEMS

Read Hughes, pages 62-64 and page 154.

1.

Is the task perfectly clear?

2.

Is there more than one possible correct response?

3.

Can examinees respond correctly without understanding the text? a.

That might involve “matching a string of words in the question with a string of words in the text” or b.

It might involve testing commonly known facts such as “Elizabeth II is the Queen of

England” or “Inhaling smoke from other people’s cigarettes can cause ___________.”

4.

Is there adequate time to perform the task(s)?

Other considerations:

5.

Options should measure a task suitable for the item type and level.

6.

One and only one option should be correct: this is the “key”.

7.

Distractors should be plausible.

8.

At the lower levels, distractors should not be too close in meaning to the key.

9.

At the higher levels, examinees should be able to make more subtle distinctions, so distractors can be closer in meaning to the key; however, they must be clearly incorrect.

10.

Obvious patterns that inadvertently reveal the key should be avoided in constructing options. For example, avoid consistent use of paraphrase for keys while distractors repeat significant words from the text, or vice versa.

11.

The complete test should contain a well-balanced sample of the target language domain that is appropriate for each level. There should be a variety of topics and the authentic texts selected should provide a varied sample of grammar and vocabulary.

12.

Content, for the most part, should be contemporary and varied. Texts of a historical nature should be limited in number, if used at all.

13.

Slang expressions should be avoided because they tend to become outdated.

14.

Regionalisms should be avoided.

BILC Language Testing Seminar III-18

Chapter III

BILC Language Testing Seminar

Evaluation of Reading Comprehension

B.

GUIDELINES FOR REVIEWING YOUR OWN AND COLLEAGUES’ ITEMS

1.

Texts a.

Is the target language correct and authentic? Was the text written by a native speaker for native speakers, and not for instructional purposes? b.

Does the text reflect contemporary use of the language? c.

Does the text avoid specialized technical vocabulary? Would it be understood by the average reader in the target country? d.

Is the text length realistic for the level you want to test? e.

Is the text at all ambiguous? If so, can this be corrected? f.

Will the content soon seem outdated? If so, can editing correct the problem? g.

Is it unlikely the text has been translated into the examinees’ native language and widely circulated?

2.

Multiple-Choice Options a.

Are all the options clearly distinct from one another? b.

Are all the options approximately the same length? If not, are two longer options balanced by two shorter ones? c.

Are the options generally parallel in form? d.

Are all options equally general or equally specific? If not, is there balance? e.

If two options are related in some way, are the other two similarly related? f.

Do the options fit the text in terms of proficiency level and style? g.

Is the language correct and idiomatic?

3.

The Key a.

Is there a correct answer? b.

Is there one and only one correct answer? c.

Is there any possibility examinees could determine the key without understanding the text? d.

Does anything artificially draw attention to the key? Is it longer than the other options? Shorter? More detailed? Less? e.

Is the key grammatically and stylistically consistent with the text? f.

Is the key different from other options in more than minor or unimportant details?

4.

Distractors a.

Is each distractor as carefully planned as are all other parts of the item? b.

Are all of the distractors plausible? c.

Is each distractor written to appear attractive to examinees who cannot fully understand the text? d.

Is each distractor based on a specific problem or misconception that an examinee might have with regard to the text? e.

Does each distractor have plausible form, meaning, sociolinguistic context? f.

Does each distractor differ from the key and from each other in more than trivial or minor details?

III-19

Chapter III Evaluation of Reading Comprehension g.

Do distractors avoid tricky or misleading language? h.

Is each distractor, although plausible, clearly incorrect? i.

Do distractors assist in determining the difficulty of the item by forcing the examinee to make distinctions in order to arrive at the correct answer? j.

Do distractors clearly convey their meaning? k.

Are distractors approximately equal to each other in over-all difficulty?

5.

The Stem a.

Does it clearly state the task for the examinee? b.

Is it succinct – free of any unnecessary information? c.

Does it give away the correct answer? d.

Is the language grammatically correct and idiomatic? e.

Is it stated so that one and only one correct answer can be selected from the options? f.

Is it free of unnecessarily high level vocabulary and grammar? g.

Is it stated positively? If not, are such words as “NOT”, “NEVER”, “EXCEPT”, printed in capital or bolded letters to highlight the negative statement? h.

Is it free of general qualifiers that may reveal or distort the key (e.g., “usually”,

“sometimes”, “possibly”, etc.)? i.

In an item dealing with judgment or controversy, does the stem cite the authority for determining the correct answer (e. g., “The author of this article believes”)?

6.

The Complete Item a.

Does the item measure a language task or a feature of the text that examinees at the targeted proficiency level should control? b.

Does it avoid testing trivial points? c.

Is there anything ambiguous or contradictory about the item? d.

Is there clearly one (and only one) correct answer? e.

Does the item test only understanding of the authentic text? Does it avoid testing factual or cultural knowledge outside the scope of the text? f.

Could examinees identify the correct answer on the basis of outside knowledge, without understanding the text? g.

Does the item contain offensive material about age, gender, race, ethnic characteristics, etc?

7.

The Complete Test a.

Does the test avoid duplication of items and objectives? b.

Does the content of one item reveal the correct answer to another item? c.

Does a distractor appear in more than one item? d.

Are the items sequenced according to proficiency level? e.

Have the correct answers been randomly sequenced before final typing to avoid a discernible pattern? f.

Does the test represent a well-balanced sample of the target language domain at the levels tested? g.

Is the test content fair and unbiased toward examinees from different backgrounds

(e.g., gender, age, race, ethnic background, etc.)? h.

Is the answer key correct?

BILC Language Testing Seminar III-20

Chapter III Evaluation of Reading Comprehension

SAMPLE ITEM WRITING PROCESS

1.

Select an authentic text:

U. S. LIFE EXPECTANCY

The average American’s life expectancy has increased more than 50 percent during the 20 th

century according to the American Council of Life Insurance.

In the year 1900, a baby boy had an average life expectancy of 46 years and a baby girl could look forward to living 48 years on average. Today, the life expectancy of newborn males is about 70 years, while newborn females can expect to live nearly 78 years.

2.

Determine the level:

The text is a brief and simple report of factual information. It is Level 2.

3.

Determine if editing is required:

4.

This text is less likely to seem outdated if the words “has increased” in the first sentence is changed to “increased”. This does not affect the meaning or difficulty level but removes the indication that the text was written before the 21 st

century began.

Determine which tasks associated with the level would be appropriate for testing this text:

Locate the main idea: The life expectancy of Americans has increased since 1900.

Locate supporting ideas: Boys born in 1900 could expect, on average, to live 46 years; girls, 48 years.

Boys born today can expect, on average, to live about 70 years; girls, almost 78.

The text describes an increase of more than 50%. • Locate other facts:

The American Council of Life Insurance reported this information.

Answer factual questions: For example: How long could an American male born in

1900 expect to live?

BILC Language Testing Seminar III-21

Chapter III Evaluation of Reading Comprehension

5.

Considering the text and these tasks, develop an acceptable key (correct answer) for a multiple-choice question:

a. Possibilities:

Life expectancy in the U. S. has gone up since 1900. or

Americans born today can expect to live longer than those born in 1900.

(Both are paraphrases of the main idea, with a moderate amount of synthesis of information) or

Women’s life expectancy has increased more dramatically than men’s.

(This is a paraphrase and synthesis of supporting ideas) b. Decision: We will ask only one multiple-choice question, focusing on recognition of the main idea. The key will read:

(A) Life expectancy in the U. S. has gone up since 1900.

6.

7.

Develop some plausible distractors:

Perhaps:

(B) American men are now living longer than women.

(C) Newborn females should live 78 years longer than those born in 1900.

(D) Fifty percent of the American people are female.

(E) Fifty percent of the American people are living longer.

(F) The average American has life insurance.

(G) Fewer newborn babies die today than in 1900.

The key vocabulary in these options appears in the text. Each might attract some examinees whose reading ability is below Level 2. All of these options are either contradicted in the text

(for example, (B)), or are not addressed in the text (for example, (F)).

Develop a stem for organizing the options:

For example: According to this article

The stem is an introduction to the options that tells examinees how they should address the text. It will normally be written in the same language as the options. The stem may be in the form of a question, a statement, or an introduction phrase.

BILC Language Testing Seminar III-22

Chapter III Evaluation of Reading Comprehension

8.

Now put the stem and options together:

According to this article

(A) Life expectancy in the U. S. has gone up since 1900.

(B) Fifty percent of the American people are living longer.

(C) Fewer newborn babies die today than in 1900.

(D) Boys born today can expect to live 78 years.

Note: Some changes were made to the distractors in the process of completing the item.

9.

Now we are ready to show the Item to a colleague and get feedback.

CONSTRUCTED RESPONSE ITEM

After completing steps 1-4, above, we might decide to write a Constructed Response Item.

5.

Develop a stem:

For example:

According to this article, an American born today

Today’s baby boys

. Males born in 1900

while females

and baby girls

.

6.

Now we are ready to show the Item to a colleague and get feedback.

.

BILC Language Testing Seminar III-23

Chapter III Evaluation of Reading Comprehension

REVENUE

The property tax levied as part of the state’s 1875 constitution was repealed in 1966. In 1967, Nebraska turned to a general sales tax and a personal income tax to generate state revenue. The state’s annual income is about $2 billion. More than 25 percent of this income comes from the federal government in the form of various programs; the rest comes from state taxes.

The state’s annual expenditures are also about $2 billion.

Education is Nebraska’s greatest expenditure, using about one-third of the state’s revenue. Other major expenditures include highways and public welfare.

BILC Language Testing Seminar III-24

Chapter III Evaluation of Reading Comprehension

Take the Wildlife Pledge!

Follow these 4 easy steps to preserve wildlife in your own backyard!

Send your gift to NWF today to save wildlife in America’s forests, rivers, wetlands, prairies and other wild places. Then take these 4 easy steps to open your backyard to a wonderful world of wildlife.

1.

Food . Provide as much vegetation as possible including shrubs, trees and other plants that produce foods such as nectar, pollen, berries, nuts and other seeds.

2.

3.

4.

Water . Make water available with a birdbath, a small pond, a recirculating waterfall or a shallow dish. All species of wildlife need water for both drinking and bathing.

Cover . Provide protective cover for wildlife with densely branched shrubs, hollow logs, rock piles, brush piles and stone walls.

Places to raise young . Build birdhouses and nesting shelves attached to posts, trees or to a building. Dense plantings of shrubbery provide safe areas for many species of nesting birds. Species such as salamanders, frogs and toads require a large body of water as a safe haven for raising their young.

A2015

Wildlife-Friendly in 4 Easy Steps!

Please visit our website for additional information at www.nwf.org/habitats

BILC Language Testing Seminar III-25

Chapter III Evaluation of Reading Comprehension

ENERGY AND TRANSPORTATION IN IRAQ

Because of its vast oil reserves, most of Iraq’s energy comes from oil-powered plants. Hydropower stations, which have been built along the Tigris River and its tributaries, also generate electricity. The best sites for dams, however, are located on parts of the Tigris that lie in other countries. In 1986 three dams – including a huge, Russian-built structure at Kadisiya – went into operation on the Euphrates River.

Railways and highways link all of Iraq’s major cities. The government owns and operates the state railway, which has about

1,400 miles of track. Rail connections also take passengers to Syria,

Turkey, and Europe. Over 6,000 miles of paved roads, including an all-weather highway, link Iraq to neighboring countries. About

250,000 vehicles for passengers and freight use the roads between major Iraqi cities. In rural areas, camels, donkeys, and horses provide transportation.

BILC Language Testing Seminar III-26

Chapter III Evaluation of Reading Comprehension

SHOOTING CLUB FIRE KILLS

12

, INJURES 8

BRUSSELS, Belgium – A fire engulfed a shooting club Sunday and sparked several gunpowder explosions, killing a dozen people and injuring eight others as they tried to flee, authorities said.

Most of the victims were badly burned when they were unable to escape from the 901 Club, said Jean Berneau of the emergency coordination center.

BILC Language Testing Seminar III-27

Chapter III Evaluation of Reading Comprehension

A Soviet scheme to divert the flow of two great Siberian rivers southward to irrigate new agricultural lands “may have significant repercussions for the weather patterns of at least a hemisphere, if not the whole globe,” writes Jason Forbes in the London Magazine (July

16).

Soviet planners are determined to reverse much of the flow of the

Pechora River and the Yenisei and Ob rivers. This will provide vital irrigation to millions of acres in Kazakhstan, a region opened to agriculture only in the past thirty years, and triple the grain production in the huge area. Dr. Forbes says, “As far as it is possible to tell, the Soviet planners have dismissed the possibility that the side effects might do more harm than the irrigation does good.”

The diversion will reduce by 20 percent the fresh water flow into the Arctic Ocean. “The ice-covered Arctic is the key factor in establishing the climatic patterns of the Northern Hemisphere.” The diversion will allow the warmer salt water to rise to the surface and melt the ice. “By reducing the fresh water flow and causing the ice to break up, the irrigation schemes may directly reduce the rainfall in exactly the regions they are designed to help.”

BILC Language Testing Seminar III-28

Chapter III Evaluation of Reading Comprehension

THEY CALL THIS PROGRESS?

Paying bills by phone sounds appealing. It takes less effort, and most consumers assume it’s speedier than paying by mail.

Not so, says Consumers Union. If you use your bank’s bill-paying service, the process may actually take longer. That’s because payments are still made by mail after you call. And since the bank service doesn’t enclose the electronically coded bill stub, the creditor may take longer to process the payment. What’s more, many banks use centralized bill-paying services. So while you think the bill is being paid locally, it might be sent from across the country.

Why worry about the delay? Well, for one thing, when you hang up the phone your account is immediately debited by the amount of the payment – so you may lose a few days’ interest unnecessarily. More important, if you’re paying a credit-card bill, the time this process takes could leave you paying penalty charges.

BILC Language Testing Seminar III-29

Chapter III Evaluation of Reading Comprehension

W

Terrorism, Real and Imagined

ill we learn the right lessons from our failure to protect the

U.S. embassies in Nairobi and Dar es-Salaam? It’s worth asking, because or radiological weapons. Three key events seem to have convinced the secretary of defense and his top officials that mass destruction chemical or biological attack is not based on actual terrorist incidents, and it ignores preparations for a potential new wave of conventional there is good reason to assume that the most relevant questions about this painful matter involve the Clinton administration’s habit of worrying so much about terrorism conducted with weapons of mass destruction that it may be neglecting the ever-present risks of conventional terrorism.

No security service in the world can provide its clientele with impenetrable anti-terrorist protection.

Security services are reasonably expected, however, to learn from experience, to identify dominant terrorism trends and to prepare for these contingencies.

In both domestic and international terrorism there has been, since 1983, no more visible trend than car bombs

– the kind used at the Marine barracks and American Embassy in Beirut, the

World Trade Center, the federal building in Oklahoma City, the Israeli

Embassy and the Jewish Community

Center in Buenos Aires, and the

Khobar Towers in Saudi Arabia.

A terrorist pattern has been systematically established: unclaimed car bombs. To believe that anti-

American terrorists would refrain from using this tactic against U. S. embassies just because these commonly targeted symbols are not located in Tel Aviv, Riyadh or Kuwait

City was naïve and unprofessional.

That the CIA and the State

Department were aware of the problem is evident from The Post’s reports on the success of the agency’s operatives in foiling several recent attacks on American embassies, and from Ambassador Prudence

Bushnell’s warning letters to her superiors in Washington about the embassy’s security problems.

Yet any elementary examination of America’s counterterrorist policy in recent years reveals a preoccupation with unconventional terrorism and a steadily growing conviction that the next blow to the

United States will involve the successful use of chemical, biological terrorism is almost inevitable: the

1995 nerve gas attack on a crowded

Tokyo subway station by the

Japanese millenarian cult Aum

Shrinrikyo; the 1997 disclosure of alarming information about the former Soviet Union’s massive biowarfare program; and the disturbing discoveries about the extent of

Saddam Hussein’s hidden chemical and biological arsenal.

A fourth element was the impact on President Clinton of a popular science fiction novel. “The Cobra

Event” describes in chilling detail a terrorist attack on New York City with a genetically engineered mix of smallpox and cold viruses.

According to a report by William J.

Broad and Judith Miller in the New

York Times Aug. 7, Clinton became so fixated on the threat that he urged

Speaker Newt Gingrich to read the book and make urgent preparations against an unconventional terror attack on the United Sates his personal project.

Billions of dollars have been sought by the administration since

1995 to prepare America for the shock of mass destruction terrorism, and Congress has been quick to provide the money. More important, the new emphasis has resulted in the replacement of traditional terrorism specialists by biologists and chemists. People who are respected scientists but who have never talked to or studied actual terrorists have become the president’s top advisers on counterterrorism. Thus, while the agencies responsible for protecting

U.S. citizens and installations abroad were sinking into monotonous routine work, America’s most creative counterterrorism thinkers were devoting themselves exclusively to answering challenges posed by weapons of mass destruction.

The dual fallacy upon which the current frenzy is based was clear even before the explosions in East

Africa. The expectation of a massive terrorism. So far (and this includes the famed 1995 Japanese subway attack) the world has not witnessed any mass-casualty event resulting from unconventional terrorism. Most of the funds allocated to countering this threat have been committed on the basis of dubious conjecture and unsubstantiated worst-case scenarios.

In all this time, no serious thinking was devoted to what might happen when ordinary terrorists decided to resume their attacks and do what they know best: identify soft American targets, assemble conventional explosives and kill a large number of unprotected civilians.

There is, in fact, neither empirical evidence nor logical support for the growing conviction that a

“post-modem” age of terrorism is about to dawn, an era afflicted by a large number of chemical and biological mass murderers.

Terrorism, we must remember, is not about killing, Terrorism is a form of psychological warfare in which the killing of a relatively small number of innocent civilians is used to send a brutal message of hate and fear to hundreds of millions of people. Most known terrorists are unlikely to resort to weapons of mass destruction for the simple reason that they do not need them to accomplish their goals.

There are steps that can be taken now to ensure that there are no more

Nairobis and Dar es-Salaams. The most important would be to rediscover conventional terrorism and reallocate the nation’s counterterrorist resources accordingly.

BILC Language Testing Seminar III-30

Chapter III Evaluation of Reading Comprehension

ECUADOR’S NATIVE CURRENCY DIES –

DOLLAR DAYS FROM NOW ON

Nation keeps fingers crossed on new cash

QUITO, Ecuador – In the dimly lit Poor Devil Pub, a wall poster speaks volumes about Ecuador’s clash of sentiments in the final hours of the sucre, the local currency that will cease circulation at midnight today, to be replaced by the U. S. dollar.

Advertising a symbolic burial of the sucre by a group of artists in a downtown cemetery, the poster shows a bullet hole through the head of Gen. Antonio Jose de Sucre, the 19th century national hero for whom the currency is named. Scrawled across his face is: “Hasta la vista, baby.”

“Of course we are hoping the government is right, that this is the path to a better life,” said Marisa Berrios, organizer of the burial, which will take place tomorrow – the first day that the sucre, after a six-month phaseout, will no longer be accepted at banks or stores.

“But we’ve sold our national identity on a bet. I have this feeling that what was once a poor country with the sucre may turn into just another poor country with the dollar. We are asking ourselves: Have we done the right thing?” Many people are awaiting the answer in

Ecuador, which has become a test case for the extreme tactic of

“dollarization”.

Some experts call dollarization the wave of the future for countries seeking a firmer rudder to navigate the new globalized economy. But the full effect won’t be known for years. And in the meantime, for many in this troubled South American nation of 12 million people,

“D-day” is bringing mixed results.

After suffering its worst economic crisis in 50 years from damaged crops, social unrest and rampant political corruption, Ecuador decided in January to adopt the dollar as a last ditch effort to avoid an economic collapse.

Economic growth has been modest since then, while inflation has been higher than expected due to a surprisingly fast hike in consumer prices. At the same time, underemployment – people with marginal work, who generally earn less than the $116 a month minimum wage

– has risen from 48.2 percent of the work force in March to 58.2 in

June.

The central Bank began replacing sucres with greenbacks in

March. Today, over 90 percent have been replaced with about $400 million from Ecuador’s bank accounts in the United States.

Those sucres still left can be exchanged at the Central Bank during the next six months. But legal tender for all business here will be the dollar. Next week, Ecuador takes the next step, releasing newly minted coins in the same sizes, denominations and colors as the U. S. penny, nickel, dime, quarter and 50-cent piece.

BILC Language Testing Seminar III-31

Chapter III Evaluation of Reading Comprehension

Ferret Gets Gold Crown on Injured Tooth

One of Wyoming’s rare black-footed ferrets underwent a fairly routine dental procedure – for humans, that is, – and now sports a gold crown on one of his teeth, state wildlife officials said.

The State Wild Game Commission officials said they were concerned that the ferret, named Clark, could have contracted a potentially serious infection because of a tooth ailment.

Clark is one of 25 black-footed ferrets housed at the Wild Game

Commission’s Mitchell Wildlife Research Center in northern

Wyoming. They are the only black-footed ferrets known to exist in

North America.

The department called a specialist when biologists realized Clark was suffering from a broken and abscessed canine tooth. Dr. Michael

Fountain, who specializes in wild animal dentistry, performed a root canal and capped Clark’s tooth. Clark’s fine now, the agency said.

Officials said Clark is especially valuable to the agency’s ferret breeding program because he is not closely related to the other captive ferrets. Because of the small number of animals, genetic diversity is very important, they said.

An outbreak of canine distemper in the late 1980s in a wild ferret colony near Minihinsee threatened to wipe out the tiny animals. But

Wild Game Commission biologists managed to capture the remaining

18 ferrets and establish the captive breeding program.

The effort at the Mitchell Center had been unsuccessful until last spring, when eight kits were born. All but one lived.

Biologists hope to continue the breeding to expand the ferret population and eventually reintroduce the weasels into the wild at several points in North America.

BILC Language Testing Seminar III-32

Chapter III Evaluation of Reading Comprehension

TIPS TO REDUCE RISKS

In brief, the following are specific suggestions to help avoid heart disease:

If you smoke, quit.

If you are overweight, lose weight.

Reduce the amount of cholesterol and saturated fat you eat.

If you have high blood pressure, lower it. If it is normal, keep it normal.

Do some kind of aerobic exercise three or four times a week for at least 20 minutes each time. Get regular aerobic exercise.

Learn to manage stress.

BILC Language Testing Seminar III-33

Chapter III Evaluation of Reading Comprehension

Two Salinas men were arrested Wednesday in connection with an

April 17 stabbing outside a restaurant, police said.

Gabriel Cadena, 23, and Johnny Cambunga, 22, were being held on charges of attempted murder, assault with a deadly weapon, and battery in connection with stabbings that injured two employees of

Rosa’s Café in Salinas, police said.

Witnesses told police two men were fighting with a broken bottle.

When a restaurant employee tried to intervene, he was stabbed in the back. Another employee was stabbed in the face and neck. A third employee was shoved against the outside wall of the restaurant.

All three employees were treated and released.

BILC Language Testing Seminar III-34

Chapter III Evaluation of Reading Comprehension

South Africa is shooting pigeons in its diamond producing area because the birds are being used to smuggle gems out of the country.

Diamonds are leaving the country in an extremely worrisome manner: strapped onto the bodies of pigeons and flown out of the country. The law is now to shoot all pigeons on sight. Mineworkers have been implicated in the widespread theft, and diamond producers will need to spend about $8 million to improve security.

BILC Language Testing Seminar III-35

Chapter III Evaluation of Reading Comprehension

HOT FUDGE SUNDAE FOILS PURSE ATTACK

ALBANY, N.Y. – A 56-year-old woman fought off a would-be purse snatcher with a hot fudge sundae, police said.

The woman, who was not identified, was accosted at a downtown intersection about 7:30 p.m., Tuesday, by a young man who tried to grab her purse from her shoulder, said Sgt Howard Busch.

She “let him have it,” Busch said. “She struck him repeatedly with a hot fudge sundae she was carrying.”

The man fled, dripping ice cream, the officer said. The woman suffered a minor cut on her hand and scratched her glasses, he said.

No arrests have been made.

BILC Language Testing Seminar III-36

Chapter III Evaluation of Reading Comprehension

VINEGARS

Vinegars can be used fresh and young as well as aged, round and full, and their flavors are infinite. Good vinegar ages as does wine, carefully tended in casks or laid down in bottles. Some are family secrets, while others are the specialty of large companies whose initial goal was to produce consistent vinegars to be used in pickling, a highly critical preservation process in pre-refrigeration days. Today we sometimes add vinegar to provide the acid that will prevent the development of botulism in canned vegetables. Tomatoes and other fruits are often picked before they are fully ripened and before their natural acidity is naturally developed; the addition of vinegar with at least 5 percent acidity remedies the problem.

Our palates are not sure guides to acid levels. The heavy tastes in balsamic vinegar, for instance, can conceal a high acidity of 7 percent while grain vinegars, offering few additional flavors, will seem more acid than they are. Today we refer to “red wine vinegar” and have indeed taken a step toward specificity from the indiscriminate

“vinegar” of 50 years ago, but the differences are enormous between a rich, fruity, joyously young California cabernet vinegar and the thinner, more aristocratic, seemingly more acid product of Dijon. And both are light years away from the unpleasantly thin, barely red, watery liquid that passes as red wine vinegar in many supermarkets.

BILC Language Testing Seminar III-37

Chapter III Evaluation of Reading Comprehension

The Echoica leadership has warned the press to follow the party line and resist the arguments of opponents of the country’s unorthodox self-management system of socialism.

A statement by the nine-man collective presidency last week said

“forces of international reaction” are bringing pressure on Echoica in an effort to destabilize it.

Their object is “to bring in doubt and spread despondency and disbelief in Echoica’s capacity to overcome its problems,” the statement maintained.

Echoica’s grave economic problems include debts to the West of

$20 billion, high inflation, an acute shortage of hard currency, stagnating industrial production and falling living standards.

The statement said hostile forces, which it did not identify, are trying to penetrate socialist institutions and to influence the media.

Media editorial boards are giving a distorted picture of Echoica reality, it said.

The statement did not specify which articles are in disfavor.

BILC Language Testing Seminar III-38

Chapter III Evaluation of Reading Comprehension

In the technological age, there is a ritual to disaster. Each piece of physical evidence in a plane crash becomes a fetish object, painstakingly located, mapped, tagged, and analysed, with findings submitted to boards of inquiry that then probe, interview, and soberly draw conclusions. It is a ritual of reassurance, based on the principle that experience can be preventative. But what if the assumptions that underlie disaster rituals are suspect? Scholars have made the unsettling argument that public post-mortems are as much exercises in self-deception as they are genuine opportunities for reassurance.

For these revisionists, high-technology accidents may not have clear causes at all. They may be inherent in the complexity of our technological systems.

BILC Language Testing Seminar III-39

Chapter III Evaluation of Reading Comprehension

EXAMINING THE PHYSICS OF THE

DEADLY DYNAMIC OF PANIC IN CROWDS

Around the world, wherever large groups gather, people have witnessed the mysterious power of panic.

Twelve people died in July at a soccer match in Zimbabwe. A month earlier, nine fans were crushed to death at a Pearl Jam concert in Denmark. And several times in the last decade, Muslim pilgrims making the holy journey to Mecca have metamorphosed into sprawling, deadly human stampedes – all episodes seemingly without reason.

A team of scientists now say they have built a powerful computer program that, for the first time, solves the mystery of how crowds move under pressure, a program that will help architects and emergency planners prevent panics.

Unlike past studies, which tried to understand the mentality of the mob, the new simulation suggests that human traffic jams are almost as predictable as the motion of the planets. The human tendency to follow the crowd, and to push when stressed, can unleash forces powerful enough to bend steel or topple brick walls, the authors say, unless architects redesign public spaces to create a smooth flow of people even when they panic.

“This is an elegant and important study,” said Gareth Watson, an associate professor of mechanical engineering at the Chicago

Polytechnic Institute. “It will even save lives.”

As the world becomes more crowded and connected, scientists are increasingly confronted with complex phenomena – stock market crashes, traffic jams that arise for no clear reason – which emerge when large numbers of individuals interact. The crowd study is part of an ambitious attempt undertaken over the last decade to harness cheap computing power to decipher the forces that drive these systems.

“The biggest surprise is that panic and human behavior can be understood in terms of a physics model,” said Arnold Hoven, one of the researchers who developed the program, which is described in the current issue of Nova. “Before people believed that interactions in crowds could only be understood in psychological terms.”

In one famous example, after 11 fans were crushed to death at a

1979 Who concert in Cincinnati, the incident was blamed on “mob psychology”, on mind-altering drugs, and even on a strong vein of uncaring in the youth of the day.

continued on next page

BILC Language Testing Seminar III-40

Chapter III Evaluation of Reading Comprehension

(cont’d)

But the new research argues that the blame lies elsewhere.

Researchers know that, if people are brought close together, they tend to push on each other. They also know that people have a tendency, especially in emergencies, to follow others near them, on the assumption that those around them know more.

When Hoven programmed these kinds of pedestrian forces into a computer, the result is a flowing dance of flickering dots, each representing a person trying to make their way.

In one scenario, as people in a room start to push too hard at an exit, the screen shows a kind of human turbulence develop, and it takes more time for the dots to escape. In another simulation, Hoven reported that the number of people able to leave a “smoky room” dropped from 74 to 58 as the tendency to herd, called the “panic parameter”, increased.

An even more unexpected result, Hoven said, was that a wellplaced narrow column in front of an exit allowed more people to escape because people in the back of a crowd could not place as much pressure on those trying to get out. It’s an architectural concept, he said, that he hopes to patent.

One scientist criticized the study for not refining the model with more precise observations of how pedestrians interact. “These parameters need to be calibrated against data,” said Joseph Spolsky, a

Chicago Polytechnic professor who has spent years gathering that kind of information on Illinois drivers for a simulation of traffic flows.

Gathering that data, he said, “is the really hard work.”

BILC Language Testing Seminar III-41

Chapter III Evaluation of Reading Comprehension

HOLD THAT TIGER

A wildlife controversy “of international dimensions” is brewing in a remote Indian district bordering Nepal, reports Rini Shahin in the

Independent Statesman of New Dehli/Calcutta. It involves a conservationist named “Jacky” Acjan Mehta, his six-year-old “foster daughter” Tiffany, and some angry scientists. Tiffany, a tiger born in a

British zoo and now living in the wild – if she is still alive – is charged with having “introduced a European strain” that “genetically polluted the breed of the Royal Bengal Tiger.”

She also is suspected of more heinous crimes. Jacky’s adversaries note that in recent years in the area “where Tiffany opted for independence” twenty-two people have been killed by an unidentified man-eater.

BILC Language Testing Seminar III-42

Chapter III Evaluation of Reading Comprehension

Mexican Lizards First Born in Captivity

DETROIT (AP) - Four Mexican beaded lizards were hatched at the

Detroit Zoo, and zoo officials say the births, possibly the first in captivity, were quite a surprise.

The zoo has three adult lizards, but no one was exactly sure if they were males or females, said William A. Austin, curator of education.

Four eggs hatched between Jan. 31 and Feb. 11, but zoo officials waited until last week to announce the births to protect the newborns from too much attention .

BILC Language Testing Seminar III-43

Chapter III Evaluation of Reading Comprehension

VACAVILLE

H ELICOPTER CRASHES ; PILOT WALKS AWAY

A helicopter dusting crops crashed and burned Saturday, but the pilot walked away. No one else was on board.

The aircraft went down in a field about 100 yards north of

Interstate 80, west of Vacaville, at about 3:30 p.m., Solano County

Sheriff’s spokesman Gary Faulkner said.

Downed power lines in the area may have played a role, and the

Federal Aviation Administration will investigate, Faulkner said.

The pilot, whose name has not been released, was taken to Queen of the Valley Hospital in Napa where he was treated and released.

BILC Language Testing Seminar III-44

Chapter III Evaluation of Reading Comprehension

By following these simple tips we can all do our part to use water wisely

1

Use your automatic dish-washer only for

6

Put a layer of mulch around trees and full loads. It uses about 25 gallons a cycle, so plants. Mulch retains moisture and plants don’t turn it on unless it’s full.

2

If you wash dishes by hand, don’t leave the water running for rinsing. Fill your second sink or a large dishpan with clear rinse water require less frequent watering.

7

Water your lawn only when it needs it.

Step on the grass. If it springs back when you take your foot off, it doesn’t need water. If it as needed.

3

Don’t let the faucet run while you clean fruits and vegetables. Instead, rinse them in a sink of clean water.

4

Use your automatic washing machine only for full loads. Your clothes washer uses about does need water, give it a deep soaking.

8

Drip irrigation ensures that the water you use gets right to the roots of plants – where it will do the most good. Consider installing a

35 gallons. Make sure you save up for a full load.

5

Don’t run the hose while washing your car or boat. Soap down your car or boat using a pail of soapy water, then hose only to rinse. drip system if you have extensive plantings.

9

Don’t water the gutter or sidewalks.

Position your sprinklers so water lands on your lawn or garden, not on concrete or other paved areas.

10

Check faucets, pipes, hoses and sprinkler heads for leaks. If you find one, have it fixed.

California-American Water Company www.calamwater.com

BILC Language Testing Seminar III-45

Chapter III Evaluation of Reading Comprehension

JURY CONVICTS ARTIST

SANTA CRUZ – A Santa Cruz artist who claimed his bank account grew from less than two dollars to more than $4 million because he meditated for fame and fortune was convicted this week of grand theft.

A Superior Court jury deliberated for two hours Thursday before delivering its guilty verdict in the case of Winston Cheever.

Cheever, 40, was accused of stealing $ 2,080 from First State Bank of Santa Cruz by manipulating the bank’s automated teller machines last July. Cheever contended the money appeared in his account after he meditated.

BILC Language Testing Seminar III-46

Chapter III Evaluation of Reading Comprehension

AUSTIN (AP) – Bypasses used as safety devices to route natural gas around meters can be used by pipeline companies and others to steal gas, a Houston engineer complained to a Texas Railroad

Commission panel on Thursday.

Lee Bergquist said the gas thefts could cost the state up to $100 million in taxes.

“I don’t know of any installation where the bypass line is either needed or justified. If there is, I’d like to see it,” said Bergquist.

The commission has proposed new rules governing gas measurement. Bergquist opposes the rules because he favors a ban on bypasses.

However, witnesses for the Texas Oil and Gas Association testified the bypasses are vital safety precautions used to reroute gas so the meters can be serviced.

Bob Ridley, an Association witness, said, “You could not inspect the meter without this so-called bypass.” Ridley said checking the meter without rerouting the gas could damage wells and facilities on the line.

BILC Language Testing Seminar III-47

Chapter III Evaluation of Reading Comprehension

An Unprecedented Crisis...

An unprecedented crisis is shaping up in the country’s health services. It has been simmering for some time as government hospital doctors engaged in various types of labour actions to protest the

Treasury’s refusal to grant them pay raises, in the form of the overtime allotments that were arranged within the Histadrut’s Kupat

Holim health fund.

But then the rug was pulled from under the government doctors’ demands by a last-minute piece of smuggled legislation, presumably at the behest of Finance Minister Moshe Nissim, which in effect declared illegal the Kupat Holim agreement. According to this legislation, any organization, like Kupat Holim, which receives government subsidies is barred from concluding wage agreements without the Treasury’s approval.

When the Histadrut Secretary General informed the Treasury that

Kupat Holim would observe the law, the health fund’s doctors took immediate counter action, closing all hospital operating theatres to all but emergency cases. Now after some second thoughts, Mr. Kessar has instructed his lawyers to determine whether in fact the Knesset action is binding since it is retroactive legislation. But the doctors are continuing their sanctions and threaten to intensify them...

The Health Minister, Mrs. Arbeli-Almoslino, wholly shunted aside by the Finance Minister, has appealed, justifiably so, to the Prime

Minister for his personal intervention to avert catastrophe. He and his

Cabinet as a whole should now take responsibility for who lives and who dies in this country as a result of the disarray in health services, she demands.

Mr. Shamir, however, is diverted by other things, including stonewalling Mr. Shultz. But that is not the major reason for this consistent reluctance to enter the fray. More important is the wide latitude he has given Mr. Nissim to run the government’s economic affairs. This makes it all the more difficult for him to step in when Mr.

Nissim feels he is battling to save the Treasury’s overall policy...

BILC Language Testing Seminar III-48

Chapter III Evaluation of Reading Comprehension

TUNA DIPLOMACY

The U.S. Senate has passed, unanimously, the South Pacific Tuna

Treaty. It was an action that failed to make even the small-type legislative roundup in most American newspapers, but it is a matter of deep concern in a number of small island nations.

In recent decades, as colonialism became an anachronism in the

Pacific, a whole series of mini-states have come into being. They are jealous about their independence but lack the strength to protect it.

They also share another common trait; most came to independence with limited capital resources.

These island states do have common access to one source of wealth: the sea. And they have been sorely put upon because foreign tuna fleets have been taking harvests in their waters without permission or compensation. The situation has created a diplomatic opening which Russia has been attempting to exploit, with some limited success.

According to American Samoa congressman Fofo I.F. Sunia, the new treaty will provide $50 million over the next five years to compensate for exploitation of fishing resources within territorial limits of the small South Pacific countries. U.S. fishing vessels will be subject to fees to raise some of this money.

It is a matter of simple equity that these lands receive some compensation when one of their most important resources is extracted by foreigners.

BILC Language Testing Seminar III-49

Chapter III Evaluation of Reading Comprehension

Severely depressed people tend to display the inner pain they feel externally: often through a lack of concern about dress, appearance, and grooming. A recent report suggests, however, that people who are not so severely down may actually try to enhance their appearance to lift their spirits.

Researchers Kay Parrsons and Ida Lowell had 25 women complete questionnaires assessing their daily mood, appearance, and clothing for four weeks. About half of the women had voluntarily sought emotional counseling at a university women’s center; the rest were

North Florida State University faculty members and staff.

Since depression is marked, in part, by lowered self-esteem and increased insecurity, Parrsons and Lowell say they expected to find that the women’s ratings of their clothing and appearance would fall as their moods worsened. But in analyzing responses to the questionnaires, they found the opposite: as the women’s level of depression rose, so did their positive feelings about their clothing and appearance. They speculate that for someone who is depressed, whose self-confidence is low, clothes and appearance may take on greater importance than for someone who is not depressed. In fact, they reason, since the women completed the questionnaire after they dressed each day, “it is possible, at least on some days… that depressed persons use clothing as a tool to boost morale.”

BILC Language Testing Seminar III-50

Chapter III Evaluation of Reading Comprehension

NEW YORK (AP) – Parents may harm their children by refusing to let them believe in Santa Claus, a psychologist says.

“Being the only one in the classroom who knows for sure that there’s no such guy could make a child feel very different, very strange,” said Dr. Harold R. Mackey of the Carstairs School of

Medicine in New Jersey.

“In some instances, the denial could lead to a sense of strength –

‘This is my family’s way of doing things and I’m proud of it.’ But that would be unusual,” he added in an interview in the December issue of

Family Magazine.

Dartmouth psychologist Lloyd Jensen said parents should examine their reasons for refusing to let a child believe in Santa Claus. “Some mothers and fathers are too constricted to feel comfortable with fantasy and may not want their children to be fanciful either,” he said.

“You really ought to be able to get down on the floor and make believe with your children.”

Parents who notice their children are growing suspicious about the existence of Santa Claus shouldn’t necessarily try to bolster their belief, the magazine said. Instead, they should recognize that their children’s thinking is becoming more sophisticated.

BILC Language Testing Seminar III-51

Chapter III Evaluation of Reading Comprehension

Germans to Investigate Jail Hanging

HANOVER, Germany (AP) – The state Justice Ministry will investigate the death of a 14-year-old Turkish youth who hanged himself in a jail cell, authorities said Thursday.

The boy killed himself on May 11 in the Vechta penitentiary, near

Oldenburg in the state of Lower Saxony, a spokesman for the Justice

Ministry in Hanover said.

The Turkish youth was the fifth young person to commit suicide in

Lower Saxony penal institutions in the past year, the spokesman said.

A subcommittee in the Hanover parliament is investigating the deaths.

BILC Language Testing Seminar III-52

Chapter III Evaluation of Reading Comprehension

Fighting Raged...

Fighting raged in northern Chad yesterday as a French military mission arrived to assess President Hissene Habre’s needs in his attempt to drive out an estimated 8,000 Libyan troops.

Army Chief of Staff Gen. Jean Saulnier, who heads the French mission, hinted that France’s military presence in its former central

African colony could be boosted if necessary.

He told reporters after a two-hour meeting with Habre that

Operation Sparrowhawk – consisting of 1,200 men, several Jaguar fighter-bombers and sophisticated radar equipment stationed south of the 16 th Parallel – had responded adequately to the current Military situation.

But he added that this might not always be the case. Pressed by reporters, Saulnier said only that he would let Habre disclose the result of his talks.

The military high command reported fierce fighting around

Yebbi-Bou in the rugged northwestern Tibesti mountain range where it said 15 Libyan soldiers were killed, several captured and seven military vehicles destroyed.

BILC Language Testing Seminar III-53

Chapter III Evaluation of Reading Comprehension

The winter of 1890-91 was very cold in Russia, but snowfall was light. The snowmelt in the spring of 1891 provided little water for farm fields, and the weather was very dry. Western Russia experienced a drought. Ponds and wells dried up, and winds lifted topsoil high into the sky. The grain harvest – particularly of rye, which was the main food source for many peasants – was poor.

Russia did have a system to combat famine. During years of good harvests, farmers were supposed to fill storehouses with grain so that there would be enough food during famine years. In addition, Russia’s government had a plan to distribute money, food, and seeds during famines.

While this system was a good idea, it did not work well. The farmers were too poor to keep the storehouses filled. Russia’s government did not have nearly enough money or grain to combat a major famine. Geography also presented a problem; the large country of Russia lacked a transportation system that could distribute food through the wide area.

BILC Language Testing Seminar III-54

Chapter III Evaluation of Reading Comprehension

VINEGAR’S A WINNER

Articles about using vinegar around the house have been very popular in the past. We now have some new ideas to share with you.

1

The number one job of vinegar these days is ridding your automatic coffee pot of mineral deposits. Run a few cups of vinegar through the coffee-making cycle.

2

Safe, non-toxic vinegar and water make a good cleaner for glass, mirrors, TV screens, and windshields. Mix equal parts of each, use a clean cloth, and dry immediately.

3

Water spots on glasses, as well as tea and coffee stains on cups, disappear when wiped with a vinegar-soaked cloth. Tarnished brass and copper shines again when rubbed with salt and vinegar.

4

Grimy dirt and grease will disappear from the top of the refrigerator and stove when wiped with full strength vinegar.

5

Keep fresh cut flowers blooming longer by adding 2 tablespoons of vinegar plus 1 tablespoon of sugar to each quart of warm water.

Keep flowers in 3-4 inches of water to allow constant nourishment.

BILC Language Testing Seminar III-55

Chapter III Evaluation of Reading Comprehension

THAT TALL, THIN ASEXUALLY DRESSED

COWORKER MAY BE A WOMAN

Women in traditionally male jobs, at least those in the Washington area, tend to be tall, thin and have short hair. Those in traditionally female roles tend to be just the opposite – shorter, heavier, with lighter colored and longer hair.

Those are the conclusions of two researchers from the Virginia

Southern University. Just why they studied the situation is not clear from the AP dispatch from Charlottesville, but they did question 770 women between the ages of 25 and 35 in Virginia, Maryland and the

District of Columbia in situations that ranged from conferences to laundromats. They also knocked on the doors of some homes.

Jeffrey Stern, an assistant biology professor, and student, Eleanor

Tibbs, found that the taller, thinner women doing what had been considered men’s jobs tend to dress more austerely than those in socalled feminine jobs. Necklines are higher and hemlines are lower.

“If a woman works alongside a man and exudes sensuality, it would hinder the male’s office performance and, consequently, hers,” he said.

Stern acknowledged that the study might come under fire from some feminist groups, saying: “People can do anything they want with it. But this is the data.” Or, as some purists might say, these are the data.

BILC Language Testing Seminar III-56

Chapter III Evaluation of Reading Comprehension

W hat were the central differences between the world of classical antiquity and that of the barbarians who destroyed and replaced it? We can set out an instructive list of opposites of principle. Thus classical culture was based on the city: not only great cities like Athens and Rome, but the network of quasi-self-governing cities that made up the Roman

Empire. The Goths and the Vandals, the Alemanni and the Franks, and the other units that came into existence and went out again, as a successful or charismatic leader came to attract a horde of followers to his victorious standards, were based on the tribe and, within the tribe, on kinship groups. The tribe was led by kings with hereditary right, based on a long genealogy going back into the mists of heroic myth. The classical city was ruled by magistrates, in principle elected: even the Roman emperors had to go through a process of ratification (increasingly formulaic) by Senate and people, and the emperors never succeeded in making the empire a hereditary one. From the beginning they found it convenient to have a bodyguard composed of

Germanic tribesmen, whose loyalty was personal, to them alone, not to the

Senate. That was a custom which fundamentally contradicted the nature of the classical city.

It followed that where the ancient city had citizens, the Germanic tribe had henchmen, followers, the ancestors of the feudal subordinates of later

Europe. The city had elections; the tribe had hereditary right. The elected magistrate was responsible to the electorate, and could be prosecuted for misconduct when he left office; the tribal king was absolute, and (as in much of the third world nowadays) he lost power only by death – which might be by defeat in battle, or by assassination, or by both. The citizen did not go about armed, and the assembly of Greek free men in the

agora, or of Roman citizens in the

Forum, was unarmed; the assembly of the barbarians was of warriors carrying their weapons, a Wapentake, in the old English word, with which we can compare the German Waffen-

traeger. Applause or dissent in the classical assembly was by shouts or hisses; in the tribal assembly, men banged on their shields with their weapons.

The city had law courts; the tribe had on the one hand the ordeal and on the other, intimately linked to it, the duel. Classical citizens and aristocrats had long ago abandoned the custom of personal combat with social equals.

That was simply impossible. When poor Mark Antony, after losing the

Battle of Actium, in his desperation issued a public challenge to Octavian to fight him in single combat,

Octavian felt perfectly confident in laughing at him, remarking dryly that he had many other ways to die. You prosecuted an enemy, you stood against him for promotion and honors, but you did not fight him; that was for gladiators. It is indeed a curious reflection that it was out of the question for a pagan gentleman to duel, although his gods expressed no particular view on that matter; but when the Christian religion added its repeated ban on dueling to the prohibition of the criminal law, it became nonetheless, in practice, compulsory for the European – and even for some of the American – upper class.

The classical city strove to eliminate the blood feud; it based its procedures on a written code of laws and a written constitution. The tribe relied on the memory of the elders, and the blood feud was a sacred duty.

To the books which were one of the defining marks of Greco-Roman civilization the tribe opposed orality; to the poems of Homer and Virgil, studied by everyone at school, the tribe opposed the illiterate bard; to written histories, the poetical and cloudy memory of the sagas.

T he ancient city was marked by impressive buildings, in marble if the community could afford it, but in any case of clearly defined and classical style. It possessed colonnades for the citizen to take the air without discomfort from the heat; it boasted a theater for the performance of the masterpieces of Greek drama, and an odeum, rather smaller, for musical recitals. Everywhere there were temples, their stone roofs supported by elegant marble columns, the petrified and stylized representatives of the trees which formed the sacred groves in which the barbarians adored their gods – with, so the polished city dweller loved to hear, horrid rituals of human sacrifice.

In the arts, whether visual or verbal, ordered and increasingly standardized forms and conventions marked the city as belonging to the international community of civilized mankind. As the modern tourist still finds, all the way from Portugal to

Iraq the ruins of classical cities exhibit a uniformity of style unmatched in any later period, even in our own increasingly standardized age. Those cities, with their massive buildings and calculated display of wealth and sophistication, at once intimidated the barbarians and incited them to desire their treasures. In a later day Prince Blücher, who commanded the Prussians at Waterloo, said when he was shown the sights of

London: “Was zu plündern!” “What a place to loot!” It was the age-old response of the Germanic warrior to the peaceful opulence of the city.

The Romans were famous for their straight roads, their selfdiscipline, their military science; they confronted a world of the limitless forest, inhabited by men of enormous size, violent in emotion and action, drinking themselves into insensibility, rushing into battle with a terrifying attack which, if resisted, might turn as suddenly into panic. The classical world struggled, in part and at certain periods, out of the grip of superstition and into the clearer air of rational procedures and scientific thought.

BILC Language Testing Seminar III-57

Chapter III Evaluation of Reading Comprehension

SHRINE TO THOREAU AT WALDEN

Walden Pond, near Concord, Mass., bears very little resemblance now to the quiet and pastoral place where Henry David Thoreau retreated in 1845 and demonstrated his harmony with nature in a one-room cabin near the water’s edge.

Across the road from the greenish pond is the town dump. A developer is planning 250 housing units nearby. In the summer, up to

5000 visitors a day come for bathing, boating, fishing, and picnicking.

Some environmentalists are fighting an uphill battle to maintain the property as a sequestered shrine, free from trash and pollution.

Confronted with a law granting almost unlimited public access to the area, those who would preserve the pond can only mull over one of the classics of American literature, an essay on passive resistance known as “Civil Disobedience”, by Henry David Thoreau.

BILC Language Testing Seminar III-58

Chapter III Evaluation of Reading Comprehension

NEVER MIND SAVING THE COUNTRY

For Craig Meriweather, born into what he wryly calls the equestrian class, his book “The Class System in America”, is in some measure self-examination as well as social criticism. “I came to imagine,” he writes, “that I was born to ride in triumph and others were born to stand smiling in the streets and wave their hats.”

“There’s an element of exorcism in my examination of the topic,” he said in a telephone interview from his office at Carter’s magazine.

“Like many other people, I’m not immune to the seductions of money and wealth. So if questions of money and celebrity become secondary, become like some superstition, if money isn’t a hard wall of fact, then

I’m freed to be more contemplative.”

According to Mr. Meriweather, America’s wealth-equals-worth formula is the root of its social and economic ills. “I don’t know how to get out of this dilemma because I’m not an economist or a politician,” he said, “but I do know… we’re going to need more humor than is now evident.” Americans, Mr. Meriweather suggests, would do well to look at money as a commodity, “like pork bellies”. Beyond this, he has few answers. The writer’s role, he says, “is one of commentary and observation, not trying to save the country.”

BILC Language Testing Seminar III-59

Chapter III Evaluation of Reading Comprehension

ROTTWEILERS NOW ‘DEADLIEST DOG’

It’s not a record anyone would be proud of, but a study released by veterinarians Friday found that rottweilers have passed pit bulls as the deadliest dog breed in the United States. The authors didn’t blame the animals, but people for not knowing how to train their dogs and others for not knowing when to stay away from unfamiliar dogs.

Rottweilers were involved in 33 fatal attacks on humans between

1991 and 1998, the American Veterinarians Association said. Pit bulls, which had been responsible for more deaths than any other breed, were involved in 21 fatal attacks over the same period.

Rottweilers, first bred in Germany, surged in popularity during the

1990s as more people sought them for protection, said Roger Sparks, an epidemiologist with the Organization for Disease Prevention.

“People are more in fear of crime and violence, and this has led to a selection of bigger dogs,” he said. “If you start selecting bigger dogs, you’ll get bigger bites.”

The study’s authors, using data from the Humane Association of

North America and media accounts of dog maulings, reported 27 people – 19 of them children – died from dog attacks in 1997 and

1998.

The numbers highlight widespread mistreatment of dogs and a growing public ignorance of how to behave around them, researchers said. They blamed adults for not teaching children to stay away from unfamiliar dogs. “It’s not a Rottweiler problem or a pit bull problem,” said Frances Swinberg, the Humane Association’s vice president for research and educational outreach. “It’s a people problem.”

The annual number of reported fatal attacks has not varied widely in the past 20 years, the study said. But overall attacks are on the rise

– likely because families are busier, leaving them less time to train their dogs and watch their children.

“A dog has to have its behavior monitored and consequences put in place,” Sparks said. “People don’t seem to have a lot of time in their lives for that.”

Pit bulls led all breeds for fatal attacks between 1979 and 1998, with at least one pit bull involved in 66 mauling deaths, the study said. Rottweilers were blamed for 37 – most of those in the 1990s – followed by German shepherds with 17 and huskies with 15.

Researchers cautioned the breakdown does not necessarily indicate which dogs provide the highest risk of fatal attacks because incomplete registration of dogs and mixed breeds make it hard to determine how many of each type of dog Americans own.

BILC Language Testing Seminar III-60

CHAPTER IV

EVALUATION OF

LISTENING COMPREHENSION

Chapter IV Evaluation of Listening Comprehension

CHARACTERISTICS OF LISTENING COMPREHENSION

1.

The teaching and testing of listening comprehension in a foreign language presents special challenges. Many language students may not be skilled listeners – even in their native language. They may not realize what real listening entails; they may not have been trained to listen to a text for a purpose.

2.

We will use the word “text” to discuss listening comprehension, just as we do for reading comprehension. We will think of “texts” as those units of spoken discourse that we have identified as suitable for using as the basis for a test item or items.

3.

One major consideration that distinguishes listening tests from reading tests involves memory. Test writers must be careful not to overload examinees’ memory when testing their ability to understand spoken texts. If they want to use longer texts, they will need to consider repeating the text two or more times.

4.

Natural spoken language is quite different from a written text. Natural conversation includes hesitations, rephrasing, interruptions, false starts, digressions. Speakers may try to talk at the same time. They may change the subject abruptly. Discourse may seem disorganized and unclear in comparison with a written text. On the other hand, conversation normally includes redundancy. One speaker may emphasize and reemphasize a point. An argument may include enough repeated detail to clarify the meaning. One of the speakers may actually request clarification from another speaker.

5.

Test writers will need to make decisions about the naturalness of the spoken texts they use. They may decide to use more natural speech for higher level testing and more contrived, or edited, speech for the lower levels.

6.

Decisions about natural speech, repetition, pauses, etc. should become a part of the Table of Specifications for a listening comprehension test.

BILC Language Testing Seminar IV-1

Chapter IV Evaluation of Listening Comprehension

TYPES OF LISTENING

In his book Assessing Language Ability in the Classroom, Andrew Cohen mentions research by O. Inbar related to the continuum from oral to literate spoken texts. For example, a news broadcast is considered to be highly literate (or closely related to the written form of the language); a lecturette may be a mixture of oral and literate texts (with some of the speaker’s remarks prepared in written form and some of them spontaneous); a consultative discussion is highly oral. Inbar’s research concludes that examinees will perform better when tested on the highly oral texts.

Cohen describes four types of communicative listening tasks that might be used in testing.

1.

The Lecture Task: Examinees listen to an authentic lecture that includes false starts, filled pauses, and other unpolished discourse features. “After the lecture, tape-recorded multiplechoice, structured, or open-ended questions are presented, with responses to be written on the answer sheet.” He points out that questions could be framed to determine how well examinees followed the organizational cues in the lecture to discern meaning. For example, questions could involve major points made in the lecture or the ways ideas are framed.

(Harold S. Madsen points out in Techniques in Testing that, for testing purposes, these should be “lecturettes”, no more than 3-5 minutes long. If these are not authentic recordings from a lecture hall, then it is important to add natural hesitations, rephrasings, little digressions, plus some redundancy. He also points out the usefulness of allowing examinees to take notes as they would during a real lecture.)

2.

Dictation: Pace is determined by the size of phrase groups, the length of pauses, and the speed of reading the phrase groups. Cohen quotes Oller’s conclusion that the most common errors on a dictation test are inversion, incorrect word choice, insertion of extra words, and omission of dictated words. He also mentions the possibility of scoring in terms of meaningful segments, rather than words.

A variation on the traditional type of dictation requires examinees to listen to a spoken text and identify the correct printed version of the text from a printed multiple-choice item. In this format the printed options focus on elements within the spoken text, not on spelling or sound discrimination.

3.

Interview Topic Recovery: Examinees are required to get information from an interlocutor on a specific topic. This test could simulate the type of interview necessary to prepare an institutional report on the topic.

4.

Verbal Report Data on Listening Comprehension Tasks: Examinees respond orally to a set of questions about the spoken text. The questions may range from discrete point lexical items to broader communication tasks concerning meaning and relationship of ideas in a text.

Cohen points out the importance of avoiding the assessment of trivial facts in a spoken text.

That practice would lead to a specialized form of listening that is more closely related to rote memory than to authentic listening comprehension.

BILC Language Testing Seminar IV-2

Chapter IV Evaluation of Listening Comprehension

INTERPRETATION OF THE LANGUAGE PROFICIENCY LEVELS

Appendix 1 to Annex A to STANAG 6001 (Edition 2)

LISTENING COMPREHENSION

LEVEL 0 (NO PROFICIENCY)

No practical understanding of the spoken language. Understanding is limited to occasional isolated words. No ability to comprehend communication.

LEVEL 1 (ELEMENTARY)

Can understand common familiar phrases and short simple sentences about everyday needs related to personal and survival areas such as minimum courtesy, travel, and workplace requirements when the communication situation is clear and supported by context. Can understand concrete utterances, simple questions and answers, and very simple conversations. Topics include basic needs such as meals, lodging, transportation, time, simple directions and instructions. Even native speakers used to speaking with non-natives must speak slowly and repeat or reword frequently. There are many misunderstandings of both the main idea and supporting facts. Can only understand spoken language from the media or among native speakers if content is completely unambiguous and predictable.

LEVEL 2 (LIMITED WORKING)

Sufficient comprehension to understand conversations on everyday social and routine jobrelated topics. Can reliably understand face-to-face speech in a standard dialect, delivered at a normal rate with some repetition and rewording, by a native speaker not used to speaking with non-natives. Can understand a wide variety of concrete topics, such as personal and family news, public matters of personal and general interest, and routine work matters presented through descriptions of persons, places, and things; and narration about current, past, and future events. Shows ability to follow essential points of discussion or speech on topics in his/her special professional field. May not recognize different stylistic levels, but recognizes cohesive devices and organizing signals for more complex speech. Can follow discourse at the paragraph levels even when there is considerable factual detail. Only occasionally understands words and phrases of statements made in unfavorable conditions

(for example, through loudspeakers outdoors or in a highly emotional situation). Can usually only comprehend the general meaning of spoken language from the media or among native speakers in situations requiring understanding of specialized or sophisticated language.

Understands factual content. Able to understand facts but not subtleties of language surrounding the facts.

BILC Language Testing Seminar IV-3

Chapter IV Evaluation of Listening Comprehension

LEVEL 3 (MINIMUM PROFESSIONAL)

Able to understand most formal and informal speech on practical, social, and professional topics, including particular interests and special fields of competence. Demonstrates, through spoken interaction, the ability to effectively understand face-to-face speech delivered with normal speed and clarity in a standard dialect. Demonstrates clear understanding of language used at interactive meetings, briefings, and other forms of extended discourse, including unfamiliar subjects and situations. Can follow accurately the essentials of conversations among educated native speakers, lectures on general subjects and special fields of competence, reasonably clear telephone calls, and media broadcasts. Can readily understand language that includes such functions as hypothesizing, supporting opinion, stating and defending policy, argumentation, objections, and various types of elaboration. Demonstrates understanding of abstract concepts in discussion of complex topics (which may include economics, culture, science, technology) as well as his/her professional field. Understands both explicit and implicit information in a spoken text. Can generally distinguish between different stylistic levels and often recognizes humor, emotional overtones, and subtleties of speech. Rarely has to request repetition, paraphrase, or explanation. However, may not understand native speakers if they speak very rapidly or use slang, regionalisms, or dialect.

LEVEL 4 (FULL PROFESSIONAL)

Understands all forms and styles of speech used for professional purposes, including language used in representation of official policies or points of view, in lectures, and in negotiations. Understands highly sophisticated language including most matters of interest to well-educated native speakers even on unfamiliar general or professional-specialist topics.

Understands language specifically tailored for various types of audiences, including that intended for persuasion, representation, and counseling. Can easily adjust to shifts of subject matter and tone. Can readily follow unpredictable turns of thought in both formal and informal speech on any subject matter directed to the general listener. Understands utterances from a wide spectrum of complex language and readily recognizes nuances of meaning and stylistic levels as well as irony and humor. Demonstrates understanding of highly abstract concepts in discussions of complex topics (which may include economics, culture, science, technology) as well as his/her professional field. Readily understands utterances made in the media and in conversations among native speakers both globally and in detail; generally comprehends regionalisms and dialects.

LEVEL 5 (NATIVE/BILINGUAL)

Comprehension equivalent to that of the well-educated native listener. Able to fully understand all forms and styles of speech intelligible to the well-educated native listener, including a number of regional dialects, highly colloquial speech, and language distorted by marked interference from other noise.

BILC Language Testing Seminar IV-4

Chapter IV Evaluation of Listening Comprehension

LISTENING COMPREHENSION TASKS

Some possible testing techniques for listening comprehension include:

• Examinees listen and take notes.

Examinees identify specific information from a spoken text.

For example, they listen to a text and identify all references to time or dates or weather.

Examinees listen for specific key words.

For example, they listen to a text and determine where the speaker lives.

Examinees prepare a brief summary of a news story from radio or television.

Examinees listen to a spoken text and restate the main idea in a few words.

Examinees listen to a complete spoken text and then write down everything they remember.

• Examinees listen to a spoken text and answer content questions printed in a test booklet

(generally, WHO, WHAT, WHERE, WHEN questions).

• Examinees listen to a spoken text and answer multiple-choice questions. They may include recognition of the correct summary.

Test writers should consult the level descriptors and ensure that listening tasks are suitable for the level being tested. For example, examinees at Level 2 should be able to answer factual questions about texts. Examinees at Level 3 should be able to demonstrate their understanding of hypothesis and supported opinion.

BILC Language Testing Seminar IV-5

Chapter IV Evaluation of Listening Comprehension

SAMPLE SOURCES FOR LISTENING TEXTS

Level 1

Level 2

Level 3 interactions in survival situations introductions to TV and radio programs announcements at public events emergency announcements broadcasts of sports scores weather reports instructions or orders factual narration factual descriptions short news broadcasts broadcast interviews on current issues broadcast editorials speeches debates recordings of briefings, meetings, conferences

STEPS TO FOLLOW IN LISTENING ITEM WRITING ACTIVITY

1.

Each group has three taped texts.

2.

Review each text.

3.

Determine where the text would probably fit into the Table of Specifications.

4.

Plan the item development activity.

5.

Review the Content/Task/Accuracy statement for the level of the text.

6.

Determine how you will test the text.

7.

Agree upon the primary testing point or points. See the handout “Listening

Comprehension Tasks” (page IV-5).

8.

Write options. See “A Sample Procedure for Item Development” (page III-17).

9.

Review the completed item(s) within the group.

10.

Ask a facilitator or a member of another group to review the item(s).

11.

Revise, if necessary.

BILC Language Testing Seminar IV-6

Chapter IV Evaluation of Listening Comprehension

SAMPLE ITEMS FOR LISTENING COMPREHENSION

T EXTS AND T ASKS

LEVEL 1

From a discussion between two men

First Man: What seems to be the matter?

Second Man: I’m not sure. I have a pain in my chest.

First Man: Let’s go into the examining room and check it out.

Multiple-choice:

This discussion is taking place at

(A) an auto mechanic’s shop.

(B) a doctor’s office.

(C) a hardware store.

(D) a dentist’s office.

Constructed response:

Where is this discussion taking place?

or

This discussion is taking place at

or

The second man is consulting _____________ because ____________________.

Task to be performed:

Understand the main idea. (The second man is consulting a doctor about his pain.)

.

BILC Language Testing Seminar IV-7

Chapter IV Evaluation of Listening Comprehension

T EXTS AND T ASKS

A radio news story

LEVEL 2

South Africa is shooting pigeons in its diamond producing area because the birds are being used to smuggle gems out of the country.

Diamonds are leaving the country in an extremely worrisome manner: strapped onto the bodies of pigeons and flown out of the country. The law is now to shoot all pigeons on sight. Mineworkers have been implicated in the widespread theft, and diamond producers will need to spend about $8 million to improve security.

Multiple-choice:

Pigeons are in the news because they are

(A) part of a plan to prevent diamond smuggling.

(B) being used in a criminal activity.

(C) being shot to prevent spread of a disease.

(D) part of a safety program for mineworkers.

Constructed response:

What is happening to pigeons in South Africa?

or

In South Africa pigeons are .

The legal response is __________________________________ . It will also be necessary to __________________________________________ .

Task to be performed:

Answer factual questions. (Pigeons are used to smuggle diamonds. There is now a law to shoot them on sight. It will also be necessary for diamond producers to improve their security systems.)

BILC Language Testing Seminar IV-8

Chapter IV Evaluation of Listening Comprehension

T EXTS AND T ASKS

LEVEL 3

From a commentary on South African radio

The politics of desperation emanating from a neglected Third World will be the critical issue of the 21st century. Embodying as we do all elements of the global divide – white versus black, rich versus poor – a successful new South Africa can become an example for the gradual solution of the North-South cleavage. What a stunning irony this prospect represents – South Africa, the polecat of the world, becoming a broker of new relationships between white- and dark-skinned nations everywhere.

Multiple-choice:

According to this speaker

(A) his country’s racism gives it little credibility in the Third World.

(B) his country will negotiate political solutions to economic problems.

(C) his country’s deep divisions will persist into the 21 st

century.

(D) his country can transform an ignoble past into a valuable object lesson.

Constructed response:

What does this speaker think South Africa could contribute in the 21st century?

or

This speaker thinks that in the 21 st

century, South Africa could because

He admits that this would be ________________________ .

Task to be performed:

Understand argumentation. (South Africa can contribute to the solution of Third World problems by serving as an example. Having surmounted the same serious divisions that these neglected countries are experiencing, South Africa can demonstrate that these problems can be overcome. The speaker finds this ironic because of his country’s earlier reputation as “the polecat of the world”.)

.

BILC Language Testing Seminar IV-9

Chapter IV Evaluation of Listening Comprehension

T EXTS AND T ASKS

From a broadcast interview

LEVEL 4

Man: Dr. Albertson, you have expressed impatience with arguments whether the Democratic party or the Republicans first thought of some position on moral and social issues.

Woman: I genuinely believe the public is bored with this debate. Concerned people are asking whether their representatives will ever, conceivably, get around to discussing the positions themselves. I am interested in a collection of these shared or swiped or imitated positions, the ones concerning the derailed American young. I include in their number the growing support for uniforms in the schools, for curfews for minor children, and for welfare regulations that make staying in school and living at home with parent or guardian a condition of getting a grant for unwed teenage mothers.

Man: Are these issues subject to charges of larceny between Democrats and Republicans in the presidential campaign?

Woman: These positions have more in common than that. Often, for instance, the curfew proposition will be accompanied by sensible additional measures meant to discourage adolescent crime.

Man: And, where states and localities have already tried curfews, haven’t some shown encouraging results?

Woman: Umm. But it strikes me that there is something else these proposals have in common. Even allowing for their better features and the preferable versions already in effect. All rest on the assumption that what we think of as the good old days are still here or at least can be made to seem to be.

Man: Isn’t that an unfair charge? Aren’t many politicians and private citizens genuinely alarmed about juvenile crime?

Woman: There is, in the first place, a kind of cosmetic component to all the proposals. They will make things look different, from the uniformed kids to the relatively safe and quiet nighttime streets to the households where the welfare checks go. But inside the uniform or off the street corners or wherever else they may be, the same kids, with many of the same disorders, will still be somewhere. And if it is unfair to say that this represents a mere façade of political and social progress, it is not unfair, I think, to say that it does represent at least a measure of confusion of nostalgia with reform.

BILC Language Testing Seminar IV-10

Chapter IV Evaluation of Listening Comprehension

Man: Shouldn’t this country be able to retrieve what was best in an earlier incarnation, just as we have discarded much that was wrong and hypocritical in the past?

Woman: Certainly. But that will mean addressing life as it is, not as it used to be or as it didn’t used to be but is romantically reconstructed.

Politicians have an obligation to address the real problems in real ways and let the look of things take care of itself. I think we should keep that distinction in mind and try to hold both parties to it.

Multiple-choice:

According to Dr. Albertson

(A) Such mandatory measures as curfews and school uniforms were appropriate in the past, but not today.

(B) Regimenting young people may create apparent public order but lead to violation of civil liberties.

(C) Facile appeals to past values have obscured the political debate on programs involving alienated youth.

(D) Politicians try to disguise their own misconduct by focussing on social problems associated with young people.

Constructed response:

What political proposals are discussed?

What is Dr. Albertson’s objection to these proposals?

Task to be performed:

Follow unpredictable turns of thought. (Democrats and Republicans have argued about which party first thought of certain political proposals. Dr. Albertson is more concerned with the content of the proposals – particularly those related to young people.

The interviewer points out that some of these ideas have worked well. However, Dr.

Albertson counters that a common feature of the proposals is a desire to recapture the past. The interviewer objects to her charge that politicians and others are insincere. She replies that all the proposed measures will create a façade without solving the problems. She insists that nostalgia is being confused with reform. The interviewer asks whether the country cannot indeed recapture some of the values of the past. Dr.

Albertson agrees but states that would require a realistic examination of the problems, not a focus on appearances. She concludes that both political parties should be held accountable.)

BILC Language Testing Seminar IV-11

Chapter IV Evaluation of Listening Comprehension

SPECIAL CONSIDERATIONS FOR A

LISTENING COMPREHENSION TEST

1.

The test development team will need to have access to native or near-native speakers for voicing listening comprehension test tapes. Speakers should have clear voices suitable for recording. They should be able to use a standard dialect, and they should not have any distracting speech habits.

2.

Speakers should record the tape at a normal rate of speech. It should neither be artificially slowed nor unusually rapid.

3.

A team member should monitor each recording session to ensure that the script and plan are followed. Later, another team member should review the master recording to check for sound quality as well as errors. It is usually possible to edit and repair testing tapes when requirements are made clear to recording personnel.

4.

The team should use the best recording equipment available for their project. If the organization has a recording studio, that would be the ideal option. However, it is possible these days to record onto a computer. In any case, quality control of test tapes or compact disks is essential to ensure examinees are tested fairly.

5.

Ideally, listening comprehension tests will be administered in a lab with a central tape or

CD player and with headsets for each examinee. The setting should be quiet, and there should be no distractions. Less than optimum settings can be used, but it must be possible for each examinee to hear the tape/CD and for the test to be uninterrupted.

Please read Hughes’s chapter 16 for more information about testing conditions.

BILC Language Testing Seminar IV-12

Chapter IV Evaluation of Listening Comprehension

INSTRUCTIONS FOR LISTENING COMPREHENSION TEST

These are the instructions for the DLPT IV Listening Comprehension test as found on page 2 of the test booklet:

INSTRUCTIONS

This test measures your listening comprehension in (TL). In taking the test you should do the following:

1. Listen to the number of the item.

2. Listen to the English introduction which precedes the passage.

3. Listen to the (TL) passage.

4. Read the question in the test booklet about the passage. It is in English.

5. Choose the best answer from the four options in the test booklet.

6. Find the question number on your answer sheet and fill in the space that has the same letter as the answer you have chosen.

Since there is no penalty for guessing, it is to your advantage to answer all questions.

BILC Language Testing Seminar IV-13

Chapter IV Evaluation of Listening Comprehension

DLPT IV LISTENING COMPREHENSION

Sample Japanese questions

This is what the examinee hears on tape:

Sample # 1 (3 sec). From a TV weather report (5 sec)

Now answer sample question 1 (15 sec)

(5 sec)

The correct answer is A. If you were marking your answer sheet, it would look like this: (3 sec)

This is what the examinee sees in the test booklet:

I. What is the weather forecast for the Tokyo area?

(A) Cloudy with occasional snow.

(B) Cloudy with occasional showers.

(C) Sunny with occasional showers.

(D) Sunny with occasional snow.

The correct answer is A. If you were marking your answer sheet, it would look like this:

I. A B C D

BILC Language Testing Seminar IV-14

Chapter IV

This is what the examinee hears on tape:

Evaluation of Listening Comprehension

Sample # 2 (3 sec). From a TV news report (5 sec)

Now answer sample question 2 (15 sec)

The correct answer is C. If you were marking your answer sheet, it would look like this: (3 sec)

This is what the examinee sees in the test booklet:

II. According to the report, what happened at the station?

(A) A subway train plunged into a wall at a station.

(B) A person fell twelve meters down a subway staircase.

(C) A passenger car plunged down a subway staircase.

(D) Two passenger cars collided at the entrance to a subway station.

The correct answer is C. If you were marking your answer sheet, it would look like this:

I.

A B C D

BILC Language Testing Seminar IV-15

Chapter IV Evaluation of Listening Comprehension

BILC Language Testing Seminar IV-16

CHAPTER V

TEST CONSTRUCTION,

ANALYSIS, AND ADMINISTRATION

Chapter V Test Construction, Analysis, and Administration

THE ROLE OF MEASUREMENT IN EVALUATION

Evaluation is the process of gathering information for the purpose of making decisions. This process includes the determination of what needs to be known, how the information will be gathered, what importance will be assigned to various pieces of information, and how the information will be used.

Measurement is the quantification of this process.

Measurement and evaluation are intrinsically related. There is no good evaluation without good measurement.

BILC Language Testing Seminar V-1

Chapter V Test Construction, Analysis, and Administration

OVERVIEW OF TERMS

V

ALIDITY

The extent to which a test measures what it is supposed to be measuring.

C

ONCURRENT

V

ALIDITY

The extent to which scores on a test correspond to scores obtained on another valid test of the same skill.

C

ONTENT

V

ALIDITY

The extent to which a test reflects the objectives of the specifications of skills or the specific set of learning materials covered.

C

ONSTRUCT

V

ALIDITY

The extent to which the test results permit inferences about underlying traits.

F

ACE

V

ALIDITY

The extent to which a test appears to be measuring what it is supposed to measure.

P

REDICTIVE

V

ALIDITY

Refers to the correlation between scores obtained on a test and those obtained by examinees after completing study or beginning work in the language. Examines the extent to which the predictor accurately indicates future learning or job success in the field for which it was prepared.

V

ALIDATION

The process of determining that a test measures what it is supposed to measure through application of objective measures such as comparing results with other scores or criteria that have already been proven valid.

BILC Language Testing Seminar V-2

Chapter V Test Construction, Analysis, and Administration

THE ROLE OF VALIDITY IN TEST DEVELOPMENT

Evaluation is all about making decisions. High-stakes decisions – tracking, promotion, retention, and certification, for example – have profound implications for the lives of people.

Tests used for such high-stakes purposes must therefore meet professional standards of validity, reliability and practicality.

In this chapter, we examine these key concepts to provide a basis for the discussion on measurement and the psychometrics used to describe, analyze and make inferences based on test results.

V

ALIDITY

asks: “What does this test measure?” and “What meaning or conclusion can I draw from the test results?”

Test validation is an empirical evaluation of test meaning and use. A test is valid in terms of its content and intended use.

If the purpose of the test is to assess the ability to communicate in English, then it is valid only if it does actually test the ability to communicate, and not something else (e.g., a discrete-point test of grammar).

If a test is intended to assess reading ability, but also tests writing, then it may not be valid for measuring reading comprehension. However, it may be a valid test of the combined skills of reading and writing.

Having a clear purpose in mind and writing clear test specifications will maximize test validity.

In the discussion below, we will define the major components of validity and outline the major threats to validity.

Notes for further reading:

Participants should plan to read Hughes, chapter 4. Note especially what Hughes writes on page 26 about the importance of specifying the skills to be tested. This applies to proficiency levels as well.

Also note Hughes’s comments on page 34 where he mentions the need to supply validation results to the users of the test.

BILC Language Testing Seminar V-3

Chapter V Test Construction, Analysis, and Administration

TYPES OF VALIDITY

I

NTERNAL VALIDITY

refers to the qualities of the test itself. a) Face validity – the extent to which a test looks like it will test what it is intended to test.

Having face validity helps students feel that they are being fairly assessed (e.g., an interview is a face valid test of speaking). b) Content validity – the extent to which test content represents an appropriate sample of the skills, abilities and knowledge that are the goals of instruction (as in an achievement test) or that are representing necessary skills and abilities in using the language for authentic purposes and in real-life communication tasks (proficiency test).

The opinion of “experts” is often used to establish content validity. However, experts do not always agree, which points again to the importance of establishing clear and explicit test specifications, so as to minimize problems related to divergence in interpretation. c) Construct validity – A test is valid or invalid in terms of its intended use. For some writers, construct validity is the most important type of validity, which subsumes all other validity aspects. It asks the question: “What is the purpose of the test?”, “What abstract skill, attribute or domain of knowledge is being measured?”, “Does the test or do test items accurately reflect the ability I want to measure?”

Construct refers to abilities, skills, attributes that are internal. Therefore, they can only be observed indirectly through testing, which allows us to make inferences based on test results.

We have to be careful as to the inferences we make and the conclusions we draw, because testing of human abilities rests on many factors related to test validity, reliability and a certain amount of measurement error.

Suppose we want to test oral communicative ability in English, which is a broad construct, but we cannot give face-to-face oral interviews for practical reasons. So, we decide to administer a well-known standardized pencil-and-paper test. Is our test valid? In this case, the construct is under-represented. Although we may be testing some sub-skills related to the ability to speak English, we are not capturing important aspects of the construct, which is a global and integrative ability (construct under-representation).

Suppose we want to assess reading ability, but our test also assesses the quality of the students’ written answers. Is our test valid? In this case, our test measures more than the intended construct (construct irrelevance).

BILC Language Testing Seminar V-4

Chapter V Test Construction, Analysis, and Administration

E XTERNAL VALIDITY – the relationship between the test and other measures. a) Concurrent validity – the degree to which a test correlates with another test which assesses the same thing (content, skills) at the same level.

• It is essential that the criterion (the measure that is being used for comparison) be valid.

If the measure is not valid, there is no point in using it to measure another test’s validity.

For example, a teacher’s ranking might be used to test validity, but the teacher’s ranking may be affected by a number of factors that are not related to the students’ actual proficiency. One possible solution is to average the rankings of several teachers to compensate for this.

• The measure must be valid for the same purpose as the test whose validity is being considered. A grammar test cannot be used to test the concurrent validity of a reading test. If teachers’ rankings are used, one must make sure that they understand on what basis the students are to be ranked – e.g. in the case of a reading test, the teachers must understand that the examinees are to be ranked according to their reading ability, not their grammar proficiency or lack of proficiency. b) Predictive validity – the extent to which a test can be used to make predictions about future performance (on-the-job use of the language, ability to take courses in the language at university level, etc.).

T HREATS TO VALIDITY a) Invalid application of the test – a test that is highly valid for measuring general language ability may not be valid as a measure of specific classroom achievement. A test that is highly valid and reliable as a measure of achievement for first year students will probably be invalid if used to measure achievement of third year students. b) Inappropriate selection of content – in any achievement test, test content must reflect syllabus content. c) Inappropriate referent or norming population – the TOEFL, for example, was designed to screen foreign students for entry into American universities. It was normed with applicants from a variety of linguistic backgrounds. If used with a monolingual population, many of the items would have to be rejected because they will not discriminate. d) Invalid construct – a test is valid insofar as it measures the particular abilities or constructs it is meant to measure. To measure oral communicative ability in a language, we must know what abilities are included in the construct and whether our test accurately reflects that ability.

BILC Language Testing Seminar V-5

Chapter V Test Construction, Analysis, and Administration

THE ROLE OF RELIABILITY IN TEST DEVELOPMENT (1)

Reliability is the extent to which a test produces essentially the same results consistently within the same time-frame if all other conditions remain the same.

Participants should plan to read Hughes’s chapter 5.

Andrew Cohen comments in Assessing Language Ability in the Classroom (Heinle and

Heinle, 1994) that reliability concerns the precision of a test in measuring examinee competence. If a test is reliable, it should yield the same results if given to the same examinees. Cohen identifies three factors that contribute to reliability.

These are:

Test Factors: These factors include sampling of objectives in test construction, presence or absence of ambiguity, specificity of required responses, quality of test instructions, examinee familiarity with test format, and test length. Reliability tends to increase with increased test length.

Situational Factors: These factors include the characteristics of the test room including lighting and sound as well as the behavior of the test examiner and his/her presentation of instructions.

Individual Factors: Cohen identifies transient factors such as an examinee’s health and wellbeing, motivation and rapport with the test examiner. There are also more stable factors such as intelligence, mechanical skill, competence in the language of the test instructions, and familiarity with similar tests.

Cohen concludes that reliability is enhanced by clear instructions, more items, better items, a suitable environment, and examinee motivation.

BILC Language Testing Seminar V-6

Chapter V Test Construction, Analysis, and Administration

THE ROLE OF RELIABILITY IN TEST DEVELOPMENT (2)

R ELIABILITY refers to the stability or consistency of a test’s results. A test is highly reliable if a student taking it on two different occasions will get two very similar if not identical scores. So, here we ask the following questions:

• Does the test produce essentially the same results over time, if test conditions are the same and examinees have not progressed in their language study?

• Does this test and/or set of items measure the same trait or skill?

• How much measurement error do we have (error variance)?

A good test will have a reliability coefficient between .75 and .95.

In order to measure reliability, we need two sets of scores.

E STIMATING TEST RELIABILITY a) Test-retest reliability – refers to the reliability of a test over time. Here, one group of students takes the same test twice. However, problems exist with this method: there is a practice effect if the two test administrations are too close; there is a learning factor if they are too far apart. If you use two groups of examinees to establish test reliability, you must ensure they are truly parallel (i.e. at the same proficiency level). b) Parallel forms reliability – refers to score consistency. Two equivalent forms of the test are administered to the same group of students. A correlation is then calculated between the two sets of scores to estimate the degree of relationship across the two forms. The problem here is to ensure that the two forms of the test are truly parallel, e.g. they test the same thing

(content, skill, knowledge) in the same way and in the same context. c) Split-half reliability – refers to the internal consistency of the test. This involves dividing the test very carefully into two equivalent halves. The more nearly equivalent the two halves are, the more reliable the test is. The students take the test in the usual way, but each student is given two scores (one score for the first half of the test, another score for the second half).

The two sets of scores are then used to calculate test reliability, as if the test had been taken twice.

It is a practical way of obtaining 2 scores (in order to check reliability). But, for this to be meaningful, you must ensure that:

1.

2.

both halves are equivalent, you calculate test reliability in terms of the whole test, not just each half (otherwise your test reliability will be lower).

BILC Language Testing Seminar V-7

Chapter V Test Construction, Analysis, and Administration

The Spearman Brown prophecy formula is used to assess the internal consistency of a test.

It is based on the reliability coefficient (r) for each half. r (whole test) = 2 x coefficient for split halves

1 + coefficient for split halves

If you want to estimate the reliability of a test with only one form and one administration of the test, you can use the Kuder-RichardsonFormula 21, which is based on the mean and the standard deviation of the test scores.

If our test is a multiple-choice test, you can also use the Cronbach Alpha coefficient to calculate split-half reliability. It is based on the standard deviation of the odd and the evennumbered items

(For further details, see Appendix D-3 through 5 and Annex B.)

T HREATS TO RELIABILITY a) Meaningful variance (related to the purpose of the test – it is a validity issue). To minimize this, care must be taken to define carefully the concept being tested, so that the items reflect well the purpose for which the test was designed). b) Measurement error (error variance) (factors not directly related to test purpose)

Environment – space, light, noise, temperature;

Administration procedures – timing, machines;

Examinees – fatigue, motivation, health, concentration, guessing;

Scoring procedures – subjectivity, evaluator biases, errors in scoring;

Test and test items – clarity of instructions, item types, item number, item quality, test security.

Reliability is directly affected by measurement error.

A test that has a reliability coefficient of .91 has good reliability because it only has 9% measurement error (100 – 91 = 9).

A useful index of reliability is the standard error of measurement, which is related to the

unreliability of a test. This index defines a range of likely variation, or uncertainty, around the test score.

BILC Language Testing Seminar V-8

Chapter V Test Construction, Analysis, and Administration

DESCRIBING AND ANALYZING THE TEST RESULTS

You can display the data visually on a graph (bar graph, frequency polygon, histogram). But if you want to represent the data numerically, you need to use some basic descriptive statistics that will allow you to get a summary of how the examinees have done on the test, check the test’s reliability and see how dependable the test scores are.

Measures of central tendency refers to the most typical behavior of a group.

1.

The mean (the average) – you add up all the scores and divide by the number of scores.

2.

The median – refers to the point that divides the scores 50/50.

3.

The mode – refers to the most frequently occurring score(s). There can be more than one

(bimodal, tri-modal or multi-modal distributions).

Measures of dispersion (or distribution) – how the individual performances vary from the central tendency.

1.

The range – refers to the difference between the highest and lowest score +1 (one is added because the range should include the scores at both ends). If the highest score is 77 and the lowest score is 61, the range is 17 points (77 - 61). The range is not a very good measure because it is affected by the “outliers” (extreme scores).

2.

The standard deviation (SD)is a good indicator because it is the result of an averaging process. The extreme scores don’t have such a big impact as with the range. The SD reflects the distance between a score and the mean. It indicates whether the score is typical or exceptional.

3.

The variance (S

2

)is the average of the squared differences of scores from the mean.

Sources of variance: 1) those related to the purpose of the test (does the test reflect well the purpose for which it was designed), 2) those due to extraneous factors (measurement error).

See “threats to reliability”. Test reliability (r) is directly affected by error variance.

4.

The normal distribution (bell curve) is obtained with large population samples (40 or more).

In a normal distribution, the central tendency indicates the typical behavior of a group. The mean, mode and median are at the center, i.e., the same. The highest and lowest scores should be equidistant from the mean (same distance on each side of the mean).

For most normal distributions, 2/3 of the scores lie within 1 SD from the mean; that is, between 40 and 60 since the mean is 50 and the SD is 10.

Skewed distribution – When you have smaller groups, there will probably be a skewed distribution with a large percentage of scores falling at one end or the other, rather than in a bell shaped curve. A skewed distribution can indicate that the test was either too hard or too easy. If students have done so well that most scores are near the top of the scale, it will be very difficult to interpret test results.

BILC Language Testing Seminar V-9

Chapter V Test Construction, Analysis, and Administration

M EASURES OF RELATIONSHIP

Measures of central tendency and dispersion are very important, but they are not enough, in themselves, to measure test validity and reliability. You still need to see the relationship between two sets of test scores. This relationship will show up as a scatter plot along a line.

The direction of the line and the position of the scores will indicate whether the relation is positive or negative and whether two tests measure the same thing. It is assumed that each set of test scores is normally distributed. If one distribution is skewed, the value of the correlation coefficient is unpredictable.

Correlation – shows us how well two variables go together. The Pearson-Product Moment

coefficient is the statistic of choice for comparing two sets of language test scores. It can be easily calculated using the mean and the SD on each test. (See Appendix D, Annex A).

BILC Language Testing Seminar V-10

Chapter V Test Construction, Analysis, and Administration

C ALCULATING THE V ARIANCE AND THE S TANDARD D EVIATION

The variance and the standard deviation are easily calculated from the mean and the deviation from the mean of each score. The variance is the average of the squared differences of students’ scores from the mean, and the standard deviation is the square root of the variance.

HOW TO CALCULATE THE STANDARD DEVIATION

Score

(X)

69

69

69

69

68

68

67

64 n = 16

64

61

Total

Mean

1104

Variance

SD

72

70

70

77

75

72

Mean

_

(X)

69

69

69

69

69

1104 / 16 = 69

240 / 16 = 15

√ 15 = 3.87

69

69

69

69

69

69

69

69

69

69

69

Deviation

_

(X - X)

0

0

0

0

-1

3

1

1

8

6

3

-1

-2

-5

-5

-8

_

∑ ( X - X )² n

Squared deviation

_

(X - X)²

0

0

0

0

1

9

1

1

64

36

9

1

4

25

25

64

240

Chart 1

BILC Language Testing Seminar V-11

Chapter V Test Construction, Analysis, and Administration

C ALCULATING THE R ELIABILITY C OEFFICIENT

The reliability coefficient (

r

xy

) is a measure of how well two sets of scores go together. It can easily be calculated from the mean and the SD for each set of scores.

The formula is: r xy

= (dx) (dy)

N (SDx) (SDy)

Where:

N is the number of students who took both tests

Dx is the deviation (difference of students’ scores from the mean) on test X

Dy is the deviation (difference of students’ scores from the mean) on test Y

SDx is the standard deviation for test X

SDy is the standard deviation for test Y

(See example below – Chart 2)

C

ALCULATING THE

S

TANDARD

E

RROR OF

M

EASUREMENT

Test conditions with physical objects can be replicated every time but not with human beings

(see “Threats to reliability”)

So the observed score (raw score) differs from the “true” score, which, according to Classical

Test Theory, is the score that reflects an individual’s true ability.

Xo = Xt - Xe

This is why we calculate the Standard Error of Measurement (SEM) to compensate.

SEM = SD

1 - r

The smaller the SEM, the fewer fluctuations there are and therefore the more consistently the raw scores represent the students’ actual abilities.

The SEM is particularly useful in deciding the “fate” of borderline students (in high-stakes tests, for example). A test with a small SEM is more consistent than one with a large SEM.

The more reliable the test is, the closer the obtained scores will cluster around the true score mean, resulting in a smaller SEM; the less reliable the test, the greater the SEM.

The importance given to the SEM depends on the nature of the use that will be made of the test and the importance of the decisions based on test scores.

(See example below – Chart 2)

BILC Language Testing Seminar V-12

Chapter V Test Construction, Analysis, and Administration

HOW TO CALCULATE THE RELIABILITY COEFFICIENT

Students TEST 1 TEST 2 Deviation Deviation

6

7

8

3

4

5

1

2

Scores (x) Scores (y)

20

18

15

15

14

14

14

13

9

10

13

10

11 8

TOTAL 154

Test 1

Mean

20

25

19

20

19

20

18

19

18

16

15

(dx)

209

154 / 11 = 14

6

4

0

0

-1

1

1

0

-1

-4

-6

(dy)

1

6

0

1

0

1

-1

0

-1

-3

-4

Squared deviation

(dx)²

36

16

Test 2

Mean

1

16

36

108

0

0

1

1

1

0

Squared deviation

(dy)²

1

36

1

9

16

66

209 / 11 = 19

1

1

0

0

1

0

Variance 108 / 11 = 9.82

SD √ 9.82 = 3.1 r = 68

(11)(3.1)(2.4)

Variance 66 / 11

SD √ 6.0

= .83

= 6.0

= 2.4

Chart 2

Product

(dx) (dy)

6

24

0

0

0

0

1

0

1

12

24

68

BILC Language Testing Seminar V-13

Chapter V Test Construction, Analysis, and Administration

SETTING CUT-OFF SCORES

1.

Let’s assume we have a newly developed test and access to an existing validated test.

Both tests have 30 items.

2.

To determine the concurrent validity calculation between the experimental version and the validated version, follow this process. a.

Identify a minimum of forty examinees (some known Masters and some others being

Non-Masters). b.

Administer both tests to the same examinees (there should not be more than a few days between both administrations); c.

Calculate the test reliability, and standard error of measurement based on the raw scores obtained (correlation of the raw scores distribution); d.

Set up a mastery classification table based on a few cut off scores; e.

Calculate concurrent validity of test based on Phi calculation (see page V-18); f.

Select the optimal cut-off score; g.

Examine the validity and reliability of the test periodically.

3.

An illustration of this process is at Annex B.

4.

In our example we were fortunate to have a sample of 60 examinees. They took a validated test (Test 1) followed by an experimental version (Test 2).

5.

Of the 60 students, 34 were considered Masters (M) based on their results for Test 1; the remaining 26 were Non-Masters (NM). After administering Test 2, we compared examinees’ raw scores on both tests. See pages V-16 and 17. This allowed us to calculate test reliability by correlating the results. Since reliability is a necessary condition for validity, establishing test reliability is an important step in validation.

6.

Our data shows very high reliability with a correlation coefficient of 0.93 between the two tests. The Standard Error of Measurement (SEM) is also very similar.. Look at the graph on pageV-17 showing the distribution of the raw scores for each test. Of course, given the high correlation, the distribution is indeed very similar.

7.

The next step consists in setting up a mastery classification table in order to determine the optimal cut off score for the experimental test. In this case we might consider the following: a.

18/30 (same cut off score as for the validated test); b.

17/30; c.

16/30; d.

15/30; and e.

14/30.

8.

In order to determine the concurrent validity of the results of each cut-off score, we must calculate the Phi coefficient. You can find the method for calculating the Phi coefficient in the top right corner of Annex B, page V-15 and also on page V-18.

9. From the calculation, we see that the higher coefficient is obtained with 16/30.

BILC Language Testing Seminar V-14

Chapter V Test Construction, Analysis, and Administration

Annex B CONCURRENT VALIDITY

Student Test 1 Test 2

M

M

M

M

M

M

M

M

M

NM

M

Cut off

18/30

M

M

M

M

M

M

M

M

M

NM

NM

M

NM

NM

NM

NM

NM

NM

NM

NM

NM

M

M

M

M

M

M

M

NM

M

NM

M

NM

NM

NM

10 26

11 25

12 24

13 24

14 24

15 24

16 24

17 24

18 23

19 23

20 23

35 17

36 17

37 16

38 16

39 16

40 16

41 15

42 15

43 14

44 13

45 12

46 12

21 22

22 21

23 21

24 21

25 20

26 20

27 20

28 19

29 19

30 18

31 18

32 18

33 18

34 18

1 30

2 30

3 29

4 28

5 28

6 28

7 27

8 26

9 26

14

12

12

12

14

17

18

16

15

13

13

11

20

18

25

16

21

16

13

16

20

26

15

18

19

20

20

25

20

22

18

26

22

17

21

22

19

27

28

28

26

25

25

22

23

25

M

M

M

M

M

M

M

M

M

M

M

Cut off

17/30

M

M

M

M

M

M

M

M

M

NM

M

M

NM

NM

NM

NM

NM

NM

NM

NM

NM

M

M

M

M

M

M

M

NM

M

NM

M

NM

NM

NM

M

M

M

M

M

M

M

M

M

M

M

Cut off

16/30

M

M

M

M

M

M

M

M

M

NM

M

M

M

NM

NM

NM

NM

NM

NM

NM

NM

M

M

M

M

M

M

M

NM

M

M

M

M

NM

M

M

M

M

M

M

M

M

M

M

M

M

Cut off

15/30

M

M

M

M

M

M

M

M

M

NM

M

M

M

M

NM

NM

NM

NM

NM

NM

NM

M

M

M

M

M

M

M

M

M

M

M

M

NM

M

Cut off

14/30

M

M Calculation of Phi Coefficient

M Test 2

M

M

M Test 1

M

M

NM D

NM

B

M

A

C

M

M

M

M

M

NM

M

NM

NM

NM

NM

NM

M

M

M

M

M

M

M

NM

M

M

M

M

M

M

M

M Phi coeff:

M

M

M

M Cut off 18/30

SQRT((A+B)(C+D)(A+C)(B+D))

M

M

(AD)-(BC)

NM

Test 2

M Total

M

M

M

M

M

Test 1

6 28

25 1

31 29

0,78

M

NM

Total

Phi coeff:

34

26

60

BILC Language Testing Seminar V-15

15

10

5

0

0

30

25

20

Chapter V

47 12

48 12

49 12

50 11

51 11

52 11

53 11

54 11

55 10

56 10

57 8

58 8

59 6

Mean

Std dev

Median

Correlation

60 6 7

18,45 17,07

6,44 5,99

18,00 17,00

0,93

SEM 1,67 1,55

9

10

9

8

8

7

7

12

11

11

13

11

10

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

NM

Test Construction, Analysis, and Administration

RAW SCORE CORRELATION

10 20 30

Students

40 50 60

Test 1

Test 2

BILC Language Testing Seminar V-16

Chapter V Test Construction, Analysis, and Administration

If cut-off score is set at 17/30:

Phi = .78

Test 2

Test 1

M

NM

NM

5

24

M

29

2

OR

Test 1

M

NM

Test 2

NM

False Negatives

Confirmed

M

Confirmed

False Positives

If cut-off score is set at 16/30:

Phi = .83

Test 2

NM M

Test 1

M

NM

2

23

32

3

If the two tests were in perfect agreement, results would look like this:

Test 2

NM M

M 0

Test 1

NM 30

There would be no false positives and no false negatives.

30

0

BILC Language Testing Seminar V-17

Chapter V Test Construction, Analysis, and Administration

MASTERY CLASSIFICATION CONSISTENCY

The Phi coefficient (r)

1. Let’s consider the following results to two administration of the same tests.

2 nd

test

Non-Master Master

1 st

test Master

Non-Master

B = 1

D= 4

A = 4

C = 1

A + B = 5

C + D = 5

B + D = 5 A + C = 5

2. The calculation of the Phi coefficient is as following:

(AD) – (BC) r φ =

(A+B)(C+D)(A+C)(B+D) r φ =

(4*4) – (1*1) r φ =

15

25

= 0.60

(5)(5)(5)(5)

3. In this case, test stability over time is quite low, and therefore should not be used for assessing important competencies.

BILC Language Testing Seminar V-18

Chapter V Test Construction, Analysis, and Administration

ITEM ANALYSIS

Item analysis is an important step in the test development and validation process. With an item analysis you can evaluate the quality of each test item and the test as a whole. You can examine the contribution of each test item and then use the analysis to improve both the items and the over-all test. The item analysis can show how difficult each item is (facility

value), whether the items discriminate between high and low scoring examinees

(discrimination value), and whether distractors are working as planned (distractor

efficiency).

If you do not have a computer program for item analysis, try the following:

1.

Score all the items.

2.

Rank order the scores from the highest to the lowest.

3.

Divide the scores into three equal groups – the highest scores, the lowest scores, and the middle scores. Item analysis usually involves only the upper and the lower groups.

4.

For each item, create a record sheet similar to the one below on which you will mark the item facility value, the discrimination value and the distractor efficiency. Indicate the key

(i.e. the correct answer).

Example:

ITEM No.

OPTIONS

UPPER GROUP

LOWER GROUP

A B C D OMITTED FV

INDEXES

DI

H

OW TO

C

ALCULATE

1.

Facility value

The facility value of a test item is the percentage of students who answered the item correctly. The larger the percentage, the easier the item. The proportion for the item is usually denoted as p. This refers to the relative frequency with which the examinees chose the correct response. An item answered correctly by 85% of the examinees has a p value of

.85, and an item answered correctly by 20% of the examinees has a p value of .20. The higher the p value, the easier the item for the particular population sample.

BILC Language Testing Seminar V-19

Chapter V Test Construction, Analysis, and Administration

SOME EXAMPLES OF CHECKING THE FACILITY VALUE

ITEM No. 10

OPTIONS

A B C D OMITTED FV

INDEXES

DI

UPPER GROUP 2 0 17 1 0

58% 0.55

LOWER GROUP 5 4 6 3 2

23 of the 40 examinees answered correctly. That is 58%. That is the facility value or p.

The higher the percentage of examinees getting the item correct, the easier the item.

ITEM No. 24

OPTIONS

A B C D OMITTED FV

INDEXES

DI

UPPER GROUP 3 12 3 2 0

43%

LOWER GROUP 4 5 4 4 3

17 of the 40 examinees answered correctly. The p or facility value is .43

ITEM No. 28

0.35

A B

OPTIONS

C D OMITTED FV

INDEXES

DI

UPPER GROUP 9 4 3 3 1

30%

LOWER GROUP 3 4 5 5 3

12 of the 40 examinees answered correctly. The p or facility value is .30.

0.30

BILC Language Testing Seminar V-20

Chapter V Test Construction, Analysis, and Administration

The facility value of an item is characteristic of both the test item and the proficiency level of the examinees. A test item that is easy for advanced students will probably not be so easy for beginners.

The facility value has a profound effect on both the variability of the test scores and the precision with which the test items discriminate among examinees. If the item is very easy, you might get a p of 1.0 (everybody got it right); if the item is extremely difficult, you might get a p of 0.00 (nobody got it right). In both cases, there is no variability, which means that the item does not discriminate.

2.

Item Discrimination

If the test and a single item measure the same thing, one would expect people who do well on the test to answer that item correctly, and those who do not do well on the test to answer the item incorrectly. A good item discriminates between the people who score high on the whole test and those who score low.

The higher the discrimination value, the better the item because such a value indicates that the item distinguishes the higher group, which should get more items correct, from the lower group. If everybody answers the item correctly or incorrectly, the item does not discriminate and should be removed or rewritten. In this case, the DI is 0.00. If everybody in the lower group and nobody in the upper group answers an item correctly, the item is behaving in the opposite direction as the rest of the test. In this case, the DI value is –1.0. Such an item is probably ambiguous or flawed.

The item discrimination value can be computed two different ways. a.

Subtract the item facility value for the lower group from the item facility value for the upper group.

DI = FV upper

- FV lower b.

Subtract the number of examinees in the lower group who selected the correct answer from the number of examinees in the upper group who selected the correct answer and divide by half the number of examinees who took the test.

DI = N upper

- N lower

.5 N

3.

Distractor Efficiency

Analyzing the distractors (i.e. incorrect alternatives in a multiple choice item) is useful in determining the relative usefulness or efficiency of the decoys in each item. If nobody selects a particular distractor, it should be rewritten or changed altogether. Distractor analysis is done by looking at the percentage of students who choose each distractor.

BILC Language Testing Seminar V-21

Chapter V

D efficiency = N d

Nt

Test Construction, Analysis, and Administration

N d = the number of examinees who chose a particular distractor.

N t = the number of examinees who tried the item.

Distractor analysis gives interesting clues as to how the distractors behave. If a distractor is chosen by a large percentage in the upper group, the distractor might be ambiguous. You might consider revising it, so that it is more clearly incorrect. If, on the other hand a distractor attracts nobody, it might be too obviously incorrect. It therefore does not carry its weight in the process of testing.

4.

Caution with Item Analysis a.

The discrimination value does not indicate item validity. It indicates internal consistency, that is, how each item is measuring whatever the whole test is measuring. Such item analysis data can be interpreted as evidence of item validity only when the validity of the whole test has been proven or can be legitimately assumed. This is seldom possible with classroom tests, in which case we must be satisfied with a more limited interpretation of our item analysis data. b.

A low discrimination value does not necessarily indicate a defective item. The item should be looked at carefully to see if the item is measuring something different from the rest of the test. c.

Item analyses from small samples are highly tentative.

Examples:

ITEM No. 1

OPTIONS

A B C D OMITTED FV

INDEXES

DI

UPPER GROUP 8 2 0 0 0

70% .20

LOWER GROUP 6 3 0 0 1

Analysis:

• The item is fairly easy because 70% of the upper and lower groups together answered it correctly (.70p)

The item does not discriminate well between the upper and lower group (DI = .20)

In terms of distractor efficiency, we can see that C and D did not attract anybody, which suggests that the distractors are flawed (too obviously wrong, perhaps).

BILC Language Testing Seminar V-22

Chapter V

ITEM No. 2

Test Construction, Analysis, and Administration

OPTIONS

A B C D OMITTED FV

INDEXES

DI

UPPER GROUP 0 7 1 2 0

45% .50

LOWER GROUP 2 2 2 2 2

Analysis:

• This is a good item. The facility value is .45p, which means that 45% of the whole group answered correctly.

• It discriminates well. It is well-centered in terms of discrimination power (.50), which means that 50% answered correctly and 50% did not. It is an ideal item.

• It terms of distractor efficiency, it is also an ideal item because the distractors attracted an even number of examinees.

ITEM No. 3

OPTIONS

A B C D OMITTED FV

INDEXES

DI

UPPER GROUP 3 3 2 2 0

35%

– .30

LOWER GROUP 1 2 5 1 1

Analysis:

• Since only 35% of the whole group answered this item correctly, the facility value of this item is not high.

• We can see, however, that the item does not behave like the rest of the test, because the lower group chose the right answer, but not the upper group.

• In terms of distractor efficiency, all the distractors attracted an even number of examinees. However, the key (correct answer) did not attract the upper group, which suggests that it may be ambiguous.

BILC Language Testing Seminar V-23

Chapter V

Exercises:

ITEM No. 4

UPPER GROUP

LOWER GROUP

Analysis:

ITEM No. 5

UPPER GROUP

LOWER GROUP

Analysis:

OPTIONS

A B C D OMITTED

0 10 3 2 0

3 5 2 4 1

OPTIONS

A B C D OMITTED

2 2 1 10

2 2 4 5

0

2

Test Construction, Analysis, and Administration

FV

INDEXES

50%

FV

INDEXES

DI

.33

DI

BILC Language Testing Seminar V-24

Chapter V

ITEM No. 6

UPPER GROUP

LOWER GROUP

Analysis:

ITEM No. 7

UPPER GROUP

LOWER GROUP

Analysis:

OPTIONS

A B C D OMITTED

2 1 0 12 0

3 0 1 8 3

OPTIONS

A B C D OMITTED

1 0 5 3 6

2 3 5 5 0

Test Construction, Analysis, and Administration

FV

FV

INDEXES

INDEXES

DI

DI

BILC Language Testing Seminar V-25

Chapter V

ITEM No. 8

UPPER GROUP

LOWER GROUP

Analysis:

ITEM No. 9

UPPER GROUP

LOWER GROUP

Analysis:

ITEM No. 10

UPPER GROUP

LOWER GROUP

Analysis:

OPTIONS

A B C D OMITTED

0 15 0 0 0

2 6 1 4 2

OPTIONS

A B C D OMITTED

2 1 7 2 3

2 0 11 2 0

OPTIONS

A B C D OMITTED

1 1 12 1

2 3 6 3

0

1

Test Construction, Analysis, and Administration

FV

FV

FV

INDEXES

INDEXES

INDEXES

DI

DI

DI

BILC Language Testing Seminar V-26

Chapter V Test Construction, Analysis, and Administration

A SIMPLE WAY TO CONDUCT ITEM ANALYSIS

We assume that those examinees who have been identified as “Masters” by passing the criterion-referenced test under analysis or another validated test will perform well on all items at or below the criterion. However, that does not always happen.

To examine the quality of each item in a newly constructed criterion-referenced test, you may want to try this technique.

Note that: A represents “Masters” who pass;

B represents “Non-Masters” who pass;

C represents “Masters” who fail;

D represents “Non-Masters” who fail.

N represents the total number of examinees taking the test

Masters Non-Masters

B

D

Pass A

Fail

Add A and D. Divide by N:

A + D

N

A perfect result would be 1.0.

C

Any result at or above .7 would be acceptable.

BILC Language Testing Seminar V-27

Chapter V Test Construction, Analysis, and Administration

PREPARATION OF ALTERNATE TEST FORMS

Large scale testing programs require alternate forms of language tests to prevent overexposure of the test. The number of forms needed may vary from two to twenty, depending on the size of the program.

Alternate test forms may be equated through control of the content to ensure similarity and through statistics.

Controlling content begins by careful application of the Table of Specifications to ensure that the parallel items on each form follow the same specifications, objectives, and topic domains.

In addition parallel items should focus on the same level according to the descriptors. Test writers will require training to recognize these concepts.

It is not adequate to rearrange the same set of items and options prepared for one form and use the scrambled set to produce an alternate form. Different items using the same specifications and objectives must be written for each form. However, some testing organizations create an overlap of a small number of items between forms.

Although two parallel forms may follow the same specifications, there may be different statistical results. A pretest of the two forms will reveal any imbalance in the difficulty level of the two forms. This data should be used to select or rearrange items so that the difficulty levels are as close as possible. The forms should be equated so that an examinee will score at the same level on all forms.

After pretesting the forms, discard ineffective items and create new parallel forms using the items that function well. The development team will also want to ensure that the formats are the same, the test instructions are the same, and the testing conditions are the same regardless of which form is administered.

Finally, it should be remembered that reliability is increased with test length.

If pretesting determined that a large number of items functioned poorly, it is not a good idea to shorten the test. Time needs to be allocated to write and pretest additional items.

A complete validation is advised unless conditions do not permit this.

BILC Language Testing Seminar V-28

Chapter V Test Construction, Analysis, and Administration

THE TEST DEVELOPMENT PROCESS

A SUMMARY

1.

The goal of pre-testing is to examine how the items work when administered to a sample of the target population and to select the ones that are deemed adequate. In fact, this consists of doing a traditional item analysis of the candidates’ answer sheets to obtain estimates of difficulty values (or facility values) per level and indices of discrimination. This analysis also permits examination of how well the distractors function. The items for the final version of the test are selected by using these points of information and by referring to the Test

Specification Grid.

2.

In this first item analysis, the following statistics are needed: a.

the mean, standard deviation, variance for sub-tests and the whole test; b.

difficulty index per level for each item; c.

mean difficulty index per level; d.

indices of discrimination of each item; e.

mean discrimination value; f.

efficiency of distractors.

3.

The estimates of difficulty index indicate how difficult the item is for the group considered.

For Masters (upper group), an adequate mean difficulty index is between 0.70 and 0.85.

4.

The discrimination value is calculated as the difference in difficulty index between Masters and Non-Masters (upper and lower group). The value is not dependent on item difficulty but yields higher values for intermediate difficulty levels.

5.

Based on practical experience, the following guidelines are proposed by Ebel (1965) for the interpretation of the discrimination value (D) when the groups are established with the total score as the criterion: a.

if D is higher than .40, the item is functioning quite satisfactorily; b.

if D is between .30 and .39, little or no revision is needed; c.

if D is between .20 and .29, the item is marginal and needs revision; d.

if D is below .19, the item should be eliminated or completely revised.

6.

Since we use the groups Masters and Non-Masters (upper and lower groups) as criterion, the value of D should be .30 or higher; in exceptional cases, when an item is the best one to meet a requirement in the Test Specification Grid, the accepted value is lowered to .25.

7.

The item analysis should show that options of an item work in such a way that the correct answer is clearly the one chosen by most master candidates (upper group). Ideally, all distractors are chosen approximately equally by the remaining candidates. If these conditions

BILC Language Testing Seminar V-29

Chapter V Test Construction, Analysis, and Administration are not met, slight modifications to distractors may be required. When major modifications are needed, the item is shelved or discarded altogether.

8.

After analysis, a sufficient number of items should be available to form the final version of the test. If additional items are required, they are developed and pre-tested as described in previous paragraphs.

ASSEMBLING THE FINAL VERSION OF THE TEST

9.

Once the items are selected they must be put together to form the final version of the test.

The items are selected to comply with all the requirements of the Test Specification Grid and are grouped in order of increasing difficulty. Sufficient numbers of copies are produced in order to meet all test try-out opportunities.

EVALUATION OF TEST VALIDITY AND RELIABILITY

10.

The final version of the test is cross-validated with a sample 50 Masters and 50 Non-Masters in order to obtain the various estimates of the test statistics. Because of the limited size of the sample used, this analysis only provides an interim test validation. Final validation is conducted later with a new analysis, using a larger number of candidates.

11.

In order to establish the coefficient of equivalence (Phi coefficient), the same group of candidates takes the final version of the test and a parallel test which is already validated.

When administered together, both tests must be cross-validated and presented to the candidates as official, so that they answer the two tests with the same care.

12.

This analysis allows us to obtain the following estimates of the test statistics: a.

the mean, standard deviation, variance for sub-tests and for the whole test; b.

difficulty indexes per level for each item; c.

mean difficulty index per level; d.

discrimination value of each item; e.

mean discrimination value; f.

efficiency of distractors; g.

coefficient of equivalence (Phi coefficient); h.

standard error of measurement (SEM); i.

measures of internal consistency (Beta coefficient); j.

estimated probability of misclassification established from SEM; k.

index of Cohen’s kappa for decision consistency.

BILC Language Testing Seminar V-30

Chapter V Test Construction, Analysis, and Administration

13.

It is important to remember that the coefficient of equivalence (phi) and the kappa coefficient should be interpreted differently. The coefficient of equivalence is a reliability estimate presented under the form of a Phi correlation coefficient, its value should be at least .80. The kappa coefficient indicates improvement in decision consistency over chance matches resulting from using the test. For our tests, the value of kappa should be .60 or higher. The estimated probability of misclassification is based on the probability of false negative and false positive misclassification for each score. The average probability of misclassification established from SEM should not exceed 5%.

DETERMINING STANDARDS

14.

Once sufficient evidence of test validity and test reliability has been obtained, the process of determining the standards applicable to the level and the language skill considered must be carried out. The cut-off scores are determined using the two following methods: a.

Zieky and Livingston’s Contrasting Groups Method; b.

Ebel’s Standard-Setting Procedure.

15.

The Contrasting Group Method is based on adequate and inadequate performances, while the

Standard-Setting Procedure uses judgement of the items content. The test developer evaluates the cut-offs provided by these two methods and decides on the final cut-off score.

PREPARING THE TECHNICAL REPORT

16.

The technical report of the test recapitulates the different steps followed in the test development and presents evidence of the quality of the test.

17.

The technical report should be presented in a clear and concise manner. Typically, the name of the section head and of the test developer appears an the cover page and the technical report for the test includes the following information: a.

general objectives of the test; b.

test specifications grid with the number of items to be developed in test task, topic and item type columns; c.

action plan followed for the test development; d.

number and percentage of items accepted by the IRB; e.

index of item-objective congruence of Hambleton and Rovinelli for items selected for the administration to native speakers; f.

summary of modifications brought to items after they have been administered to native speakers; g.

characteristics of target population;

BILC Language Testing Seminar V-31

Chapter V Test Construction, Analysis, and Administration h.

characteristics and size of the pre-test population sample used for the interim test validation; i.

results of item analysis as per paragraph 2 (with actual final form item numbers); j.

bar chart of combined Masters and Non-Masters test results as per paragraph 10, with mean and SD; k.

results of item analysis as per paragraph 12 (with actual final form item numbers); l.

validity of the test content according to the index of item objective congruence of

Hambleton and Rovinelli calculated from the items of the final version of the test; m.

reliability of the test:

(1) coefficient of equivalence (Phi coefficient), with a scatter diagram;

(2) measure of internal consistency (Beta coefficient);

(3) estimates of probability of misclassification according to:

(a) SEM;

(b) Cohen’s Kappa. n.

test cut-off score:

(1) cut-off score established according to Zieky and Livingston’s Contrasting Group

Method;

(2) cut-off score established according to Ebel’s Standard-Setting Procedure;

(3) final cut-off score;

(4) comments and recommendations from the test developer. o.

content analysis of the final version of the test, presented in a grid with the following entries:

(1) Analysis of the content of each item according to:

(a) number of words;

(b) text type: Test A: exchange

briefing/speech/tasking

(c) test task;

(d) topic;

(e) item type.

Test B:

interview

meeting

announcement/advertisement/notice

letter/memo

message/note

comments/summary

news/magazine articles

(2) Analysis of the content of the whole test:

(a) average number of words per item;

(b) number of items and percentage for each text type, test task, topic and item type. p.

list of the right answers for the items of the final version of the test, with the percentage of right answers for each letter of the answer sheet; q.

comments or recommendations concerning the development or the quality of the test.

BILC Language Testing Seminar V-32

Chapter V

STEP #

1

Establish Test

General Objectives

2

Complete Test

Specifications Grid

Test Construction, Analysis, and Administration

DATES

Beginning & End

PRODUCT

EXPECTED

EXAMPLE OF AN ACTION PLAN

RESPONSI-

BILITY

CONTACT

PERSONNEL

ATTACHED

MATERIALS

RESOURCES

POTENTIAL

DIFFICULTIES

COMMENTS

3 Develop items

4 Submit items to IRB

5

Report decisions of

IRB

6 Edit items

7

Administer test to

Native Speakers

8 Edit Items

9

Target population trials

10 Analyse Items

SIDE A

BILC Language Testing Seminar V-33

Chapter V

STEP #

11 Revise Items

12

Arrange test – Final format

13

Estimate reliability and validity of test

14 Determine standards

15

Prepare Technical

Report of Test

16 Enter test in test file

17

Complete test validation

18

Complete Technical

Report of test

19

Produce official version of test

DATES

Beginning & End

PRODUCT

EXPECTED

RESPONSI-

BILITY

Test Construction, Analysis, and Administration

CONTACT

PERSONNEL

ATTACHED

MATERIALS

RESOURCES

POTENTIAL

DIFFICULTIES

COMMENTS

SIDE B

BILC Language Testing Seminar V-34

Chapter V

Level 1

Level 2

Level 3

Test Construction, Analysis, and Administration

SAMPLE SOURCES FOR TEXTS

newspaper announcements sales advertisements bulletin board information invitations tourist information factual descriptions narrative reports instructions directions editorials opinion columns written arguments over public issues

BILC Language Testing Seminar V-35

Chapter V Test Construction, Analysis, and Administration

INSTRUCTIONS FOR READING COMPREHENSION TEST

These are the instructions for the DLPT IV Reading Test as found on page 2 of the test booklet:

INSTRUCTIONS

This test measures your reading comprehension in (TL). In taking the test you should do the following:

1.

Read the orientation. It is the underlined short statement before each item that tells you where the passage came from.

2.

Read the (TL) passage.

3.

Read the question that follows. It may be either in the form of a question or an incomplete statement.

4.

Choose the best answer from the four options.

5.

Find the question number on your answer sheet and fill in the space that has the same letter as the answer you have chosen.

Since there is no penalty for guessing, it is to your advantage to answer all questions.

Two examples are provided on the following page. Read the examples carefully before starting the test.

BILC Language Testing Seminar V-36

Chapter V Test Construction, Analysis, and Administration

DLPT IV READING COMPREHENSION

Sample Arabic questions

DO NOT WRITE IN THIS BOOKLET

EXAMPLES

I. From a newspaper

Why would a reader contact the newspaper’s office?

(D) To register a complaint.

The correct answer is A. If you were marking your answer sheet, it would look like this:

I. A B C D

(A) To place an ad.

(B) To order a subscription.

(C) To submit an article.

II. From a newspaper

This passage states that the Red Cross

(A) is financing a project in London.

(B) has opened libraries in three different cities.

(C) has supplied Arab children with toys.

(D) will soon complete a service project in

Lebanon.

The correct answer is D. If you were marking your answer sheet, it would look like this:

II. A B C D

BILC Language Testing Seminar V-37

Chapter V Test Construction, Analysis, and Administration

THE CONCEPT OF PRACTICALITY IN TEST DEVELOPMENT

The third condition that must be met, after validity and reliability, is practicality. The institution, organization, or other user must find it practical to administer and score the test in terms of time, money, and availability of appropriate personnel.

Examples of some of the practical issues that must be addressed before developing a test can be found in Mary Finocchiaro and Sydney Sako’s Foreign Language Testing: A Practical

Approach (Regents Publishing Company, 1983). Some of the practical concerns the authors identify are:

Economy: Do examinees write in the test booklet, or can the booklet be reused? Is there a budget for reprinting booklets as needed?

Scorability: Are there native or near-native speakers available in the required numbers at the required time to score essay or short-answer questions? If so, is there adequate time and appropriate personnel to train them? And, is there a system available, for checking their work and providing on-the-spot refresher training?

If it is planned to use electronic scanning equipment to score answer sheets, is that equipment available at test sites that have large numbers of examinees?

Administrative Ease: Will it be possible to train all the test administrators? Will there be an opportunity for them to practice instructions with a sample population? Is there a mechanism for revising instructions if they are flawed?

Does the test have unusual requirements that place a burden on the test administrator – such as a sequence of timed activities that must be controlled by the administrator?

Are tapes clear and booklets legible?

BILC Language Testing Seminar V-38

Chapter V Test Construction, Analysis, and Administration

PRACTICAL CONSIDERATIONS WHEN BEGINNING

A TEST DEVELOPMENT PROJECT

1.

Test developers must clearly understand what kind of test they will develop – a proficiency test, an achievement test, a job-related test, a placement test, a diagnostic test.

Proficiency tests examine unrehearsed, unpredictable language competence, not associated with any specific program. Tests consist of authentic language (as used by native speakers for non-instructional purposes). Proficiency implies the ability to transfer skills from one situation to another.

Job-related tests examine language and communication capability within a defined domain. Tests consist of job-related language or, if appropriate, language needed for academic purposes, as well as job-related settings and scenarios. Successful performance on a job-related language test does not imply the ability to transfer skills.

2.

Test Developers and managers of testing programs must determine whether the test will be

objectively or subjectively scored. If the test is designed for scoring by human raters, then certain conditions must be met. There must be qualified, trained raters available in adequate numbers when tests are ready for scoring. There also needs to be a quality control system in place to monitor the work of raters and to provide additional training when needed.

If the test is designed for machine scoring or through use of a scoring template, training and quality control of a different type will be needed.

3.

If the test is to be based on authentic material, Test Developers will need to ensure that such material is available at the appropriate levels.

4.

Test Developers must determine whether tests will be monolingual or bilingual.

5.

Test Developers must also determine whether test instructions will be in the target language or the native language of the examinees.

6.

Test Developers will need to make decisions about test length. While we know that a longer test is more reliable, there may be limitations on the time available for testing. If there are to be tests of listening, speaking, reading, and writing, are there logistical requirements that all tests be administered the same day? If examinees must be transported to a central testing location, it may be prohibitive for them to stay for two or more days. Unfortunately, reliability is not the only consideration when deciding upon test length.

7.

Test Developers must determine if reading and writing tests will be timed or power tests.

8.

Managers of testing programs must determine how many levels will be tested – all the levels in a scale, one level, two or three levels.

9.

Test Developers and managers of testing programs must determine the number of alternate

forms to be developed. Two or more alternate forms are recommended to reduce test compromise and over-familiarity with content. However, time or budget restrictions may prohibit this. The consequences of relying on a single form of the test should at least be discussed frankly by those responsible for the program.

BILC Language Testing Seminar V-39

Chapter V Test Construction, Analysis, and Administration

RESOURCES

Personnel

Time

Liaison with Managers

Computers

Designated Work Space

Furniture

Supplies

Copiers

Computer Training

Internet Access

Statistical Programs/Training

Recording Equipment

Reproduction Facilities

Reference Books

Periodicals Safekeeping of Tests

The most important resource: The Test Development Team

Q

UESTIONS

:

Will team members be released from other duties?

How will team members be selected?

What is the optimum team size?

What kinds of skills are needed?

S

OME

A

NSWERS

:

The Test Developers should be released from other duties. Those who assist them might be able to work part time.

There are a number of ways to create a team. Ideally, supervisory personnel will identify those with ability, training, and interest. At the same time, it is preferable that the assignment be accepted willingly.

Two or three test writers would be workable. However, other personnel will be (or may be) needed. For example:

• a Project Officer an Editor (or independent reviewer of format and test appearance) additional voices for recording an illustrator if drawings are to be used

A variety of skills are needed: Strong knowledge of both the target language and the examinees’ native language. Ability to learn to use the proficiency scale. Ability to learn testing principles and write appropriate items. Good ability to work as part of a team and to give and accept constructive criticism.

BILC Language Testing Seminar V-40

CHAPTER VI

EVALUATION OF

SPEAKING PROFICIENCY

Chapter VI

WARM UP:

TASKS

WIND DOWN:

EXAMINEE’S PERFORMANCE

CAN DO:

CANNOT DO:

BILC Language Testing Seminar

TEST OBSERVATION GRID

TOPICS

Evaluation of Speaking Proficiency

LEVEL of

TASKS

LEVEL CHECK or

PROBE

WORKING LEVEL during the test

VI-1

Chapter VI Evaluation of Speaking Proficiency

BILC Language Testing Seminar VI-2

Chapter VI Evaluation of Speaking Proficiency

1.

The Oral Proficiency Test Structure

1. Four phases of the OPI and three perspectives

2. Reaching the ceiling and linguistic breakdown

3. Core of the test: Level Checks and Probes

4. The Role-play

5. Tests in 30 minutes or less

6. Efficient elicitation of a ratable sample

7. Conscious structuring of the OPI

8. Tester-Tester interaction

9. Tester stance

10. Common test administration problems

11. The OPI and culture

F

OUR PHASES OF THE

OPI

AND THREE PERSPECTIVES

The four mandatory phases of the OPI are the warm up, the level checks, the probes, and the wind down. If any of these four phases is omitted, the speech sample is not ratable. Each of the four phases has a specific purpose, which needs to be viewed from three different perspectives: the psychological, the linguistic, and the evaluative.

Phase 1: The warm up. The warm up does not take much time, but is an important part of every test. Note: The warm up should reflect the language and culture being tested. For instance, what’s appropriate in American English may not be appropriate in Japanese.

Every OPI begins with a warm up. Testers should begin speaking to the examinee at a normal rate of speech, exchange greetings and initiate a polite, informal conversation. During the warm up phase, the tester should not try to challenge the examinee. The warm up allows the examinee to ease into using the target language. Regardless of their proficiency level, some examinees will need time to get used to using the target language, especially if they do not speak it often. The warm up gives the tester a chance to establish a professional and friendly rapport with the examinee and to set the tone of the test.

From a psychological point of view, the purpose of the warm-up is to make examinees comfortable. From a linguistic point of view it allows the examinees to adjust to speaking the language and to accustom themselves to the testers’ pronunciation and way of speaking.

Finally from an evaluative perspective, it provides the first tentative evidence for the testers’ assessment of the working level. This initial phase is also a crucial first opportunity for the tester to make a mental note of topics that can be developed during subsequent phases of the

OPI.

BILC Language Testing Seminar VI-3

Chapter VI Evaluation of Speaking Proficiency

Phase 2: Level checks. From a psychological point of view, the level check is an opportunity for the examinees to demonstrate what they can do with the language. From a linguistic point of view, level checks identify the functions and content areas that examinees can handle. From an evaluative point of view, level checks establish the “floor”. Examinees demonstrate the ability to sustain the performance at that level.

The working level is the level that the testers hypothesize to be the actual proficiency level of the examinee. It is based on the evidence demonstrated during the warm-up. An initial working level will be raised or lowered during the test, depending on the examinee’s response to various probes and level checks. During the OPI, testers are required to continually verify, and change when necessary, the working level. By the end of the level checking, the working level should be the examinee’s score, subject to final verification through the formal rating process.

Phase 3: Probes. The probe is the tester’s attempt to raise the level of the examinee’s language by posing a higher-level task. From a psychological point of view, probes demonstrate to the tester what the examinees cannot do with the language. From a linguistic point of view, probes identify the tasks and content areas the examinee cannot handle, resulting in linguistic breakdown. From an evaluative point of view, there are three possible results of probing:

1.

The examinee cannot handle any of the linguistic tasks presented in the probes, confirming the level established during the level checks;

2.

The examinee handles each probe consistently and accurately, thereby establishing a new floor and working level at the next higher level;

3.

The examinee neither completely fails nor fully succeeds to perform the linguistic tasks presented by the probes. The degree to which the performance at the next higher level is sustained will determine whether the tester assigns a plus level.

Phase 4: The Wind Down. From a psychological point of view, the wind down returns the examinee to a comfortable level and ends the OPI on a positive note. From a linguistic point of view, it gives the tester the opportunity to ensure that the test process is complete. From the evaluative point of view, the wind down should not add anything to the speech sample. If it does, level checks and probes should recommence. The chart summarizes the psychological, linguistic, and evaluative perspectives of each phase of the OPI.

Four Phases

Three

Perspectives

Warm-up

Relaxes examinee

Psychological

Linguistic

Evaluative

Reacquaints examinee with language if necessary

Provides testers with preliminary indication of level of speech skills

Level Check Probes

Iterative Process

Wind Down

Proves to examinee what he or she can do

Checks for tasks and content which examinee performs with greatest accuracy

Finds the examinee’s speaking level

Proves to examinee what he or she cannot do

Checks for tasks and content which examinee performs with least accuracy

Finds level at which examinee can no longer speak accurately

Returns examinee to level at which he or she functions most accurately

Chance to check that the iterative process is complete

Gives global rating; no new information

BILC Language Testing Seminar VI-4

Chapter VI Evaluation of Speaking Proficiency

2.

3.

R EACHING THE CEILING AND LINGUISTIC BREAKDOWN

The phenomenon of reaching the ceiling or limitations in performing the required tasks is referred to as linguistic breakdown.

The term “linguistic breakdown” refers to signs in a speech sample that indicate that the examinee is ceasing to produce language appropriate for a particular level. Such breakdown can take several forms:

Loss of fluency: Frequent pauses occur because the examinee does not have the language resources to perform the linguistic task. On the other hand, pausing to organize one’s

• thoughts is not considered to be loss of fluency.

Deterioration: At levels below 3, morphology and syntax become worse.

“Dead-ending”: Words, phrases, constructions are left incomplete, because lack of language ability prevents their completion.

Substitution: The speaker uses words or phrases in a language other than the target language.

Avoidance: The examinee anticipates a linguistic difficulty and evades it by using a •

• construction that is not appropriate to the task, or avoids it altogether.

Non-verbal Indicators: Nervous laughter, loss of eye contact, blushing, playing with fingers or hair, can be physical manifestations of linguistic discomfort.

The examinee overtly admits inability to perform the assigned task. •

C ORE OF THE TEST : L EVEL C HECKS AND P ROBES

An effective tester does not think of the level checks and probes as discrete phases, to be treated separately, but frequently uses a single topic to move back and forth between the level checks and probes. This is known as the iterative process. The tester probes until the linguistic ceiling is evident, then returns to the solid floor of proficiency with another level check. The distribution of time between level checks and probes will vary from test to test.

However, level checks and probes need to be interwoven.

Elicitation/Response Cycle

Whether a tester is probing or level checking, it is important to have a clear purpose to each question that is asked or statement that is made. Testers need to evaluate the examinee’s response to each elicitation and to follow up based on this evaluation. The process of selecting the appropriate elicitation technique, listening to and evaluating the response, and formulating a follow-up elicitation can be viewed as a cycle. This process continues until the tester has “built” the ratable sample.

BILC Language Testing Seminar VI-5

Chapter VI Evaluation of Speaking Proficiency

4.

5.

Elicitation/R esponse C ycle

Allow exam inee to respond

E valuate response

Evaluate the exam inee’s response by com paring it to the original purpose of the question/statem ent

P ose question or m ake statem ent to elicit response

Ch oose topic and purpo se for statem en t or question

Pose questions naturally.

A void “teacher talk.”

Sim plify only w hen necessary

.

D eterm ine w hat inform ation you need to achieve a ratable sam ple

(functions, tasks, levels of language, etc.)

Avoid interrupting the exam inee’s thought processes during his/her attem pt to form ulate and give a response.

K eep the purpose of your question/statem ent in m ind and give the exam inee tim e to give an adequate response.

U se your evaluation of the response to determ ine w hat your next question/statem ent w ill be.

H ave a clear and exact purpose in m ind.

Y our next question/statem ent should typically follow -up on the exam inee’s response.

K eep the purpose of your question/statem ent in m ind during the exam inee’s response and your evaluation of it.

In order to be ratable, every test must contain 2-4 probes, depending how solidly the examinee is in the range. For example, if the examinee is just across the threshold, two probes are usually sufficient to establish breakdown. On the other hand, if the examinee is solidly in the range, more probes may be necessary to establish whether the examinee qualifies for plus rating or whether the working level should be raised.

T HE R OLE PLAY

The role-play is used to check whether the examinee can carry out tasks that cannot be elicited by means of a conversational exchange. It also shows how the examinee functions in a real life situation in the target culture. It can serve as a level check or a probe. It is generally introduced midway through the test.

• Role-plays are mandatory for Level 1 to Level 4. At Level 4, two role-plays are required to show evidence of tailoring language. At Level 5, if the sample does not show that the examinee commands both formal and informal register, role-plays should be used to establish that.

• For some tests, more than one role-play may be necessary.

T

ESTS IN

30

MINUTES OR LESS

Tests up to level 3 need to be conducted in 30 minutes or less. In rare instances, a test may last a few minutes longer. No test should last more than 45 minutes.

BILC Language Testing Seminar VI-6

Chapter VI Evaluation of Speaking Proficiency

6.

7.

E

FFICIENT ELICITATION OF A RATABLE SAMPLE

The primary goal of the OPI is the efficient elicitation of a ratable sample. To be ratable, a speech sample must clearly show what examinees can do, as well as what they cannot do with their language. The highest level that they can sustain throughout the OPI is known as the “floor.” The level at which their language breaks down and where they can no longer sustain their performance is known as the “ceiling.”

C ONSCIOUS STRUCTURING OF THE OPI

The careful and conscious structuring of the OPI is what distinguishes it from a nonevaluative conversation or discussion.

Using various elicitation techniques, testers obtain a speech sample based on topics that arise during the conversation. The test is structured to highlight the patterns of strengths and weaknesses, which define both the linguistic floor and the linguistic ceiling. If the structure of the OPI is not followed, an unratable sample will result.

T

ESTER

-T

ESTER

I

NTERACTION 8.

9.

Both testers are equally responsible for obtaining a ratable sample. During the entire OPI, testers take turns in building the sample. No one tester should predominate.

T ESTER S TANCE

10.

Many testers are also teachers. The skills required by the two activities, teaching and testing do not necessarily overlap. Particularly, the tester’s stance toward the examinee (how the tester approaches the examinee in the test) differs. The helpful teacher who corrects students and finishes their sentence for them is out of place in the test. A basic quality of friendliness marks both good teaching and good testing. But testing demands an objective attitude on the part of the tester, an attitude that requires examinees to prove that they can function independently in the target language.

C

OMMON

T

EST

A

DMINISTRATION

P

ROBLEMS

Several common problems can occur in administering the OPI. Most are easy to solve, but test time may be gained if testers become aware of them early in the test.

Refusal to Talk

If, for whatever reason, an examinee refused to talk, remind her that no sample means no rating. Appeal to the examinee’s motivation.

Mixing of English into the TL

At Levels 0

+

and 1 the examinee may use English words, which the tester notes as evidence of breakdown. At Level 2, an English word offers a perfect opportunity to ask for clarification or a definition in the target language.

Examinees with native languages other than English may mix in words from their languages.

BILC Language Testing Seminar VI-7

Chapter VI Evaluation of Speaking Proficiency

Responses Too Short for Maximum Ratability

At 0+ and 1 an examinee will respond briefly, because they do not have sufficient vocabulary to speak at length. Above Level 1, short responses may be due to other factors and may affect the rating.

Refusal to Discuss a Particular Topic

An examinee may refuse a particular topic occasionally. Repeated refusal to discuss topics will constrain the quality of the interview.

Refusal to Role Play

Some examinees either dislike or are unable to role-play. Role-plays however, are required as a part of the OPI.

Overuse of One Topic (Hot-house Special)

The obtaining of an adequate, ratable sample from a talkative examinee depends on the tester’s being in control. Little is gained if the examinee rambles on about one single topic

(often a favorite topic called a “hot-house special”). Testers must be able to shift topics as soon as the task has been performed adequately.

11.

Claiming to Know Too Little

This problem may appear in both low level and high level tests. In both cases the testers need to move on to other topics.

T

HE

OPI

AND

C

ULTURE

Where the OPI is a test of foreign language ability, it has generally been assumed that most examinees are Americans who have learned the TL. As a result of this culture assumption, initial test topics (warm up) are assumed to be culturally American with a gradual shift into the TL culture.

In summary, the tester must include all of the following elements, in order for the OPI to be a ratable sample:

Four Phases: Warm up, Level checks interwoven, With Probes, Wind Down

All required tasks for the relevant level with the necessary accuracy.

Two to four probes (except at Level 5).

Role-play (one at Levels 1-3, two at Level 4)

A variety of topics

BILC Language Testing Seminar VI-8

Chapter VI Evaluation of Speaking Proficiency

SAMPLE PROMPTS FOR TESTING SPEAKING PROFICIENCY

LEVEL 1

S

PEAKING TASKS

Where do you live?

How long have you lived there?

Where did you live before?

Do you have a job? Where do you work? What are your duties?

How do you get to work?

R

OLE

P

LAYS

Your family is coming to visit you for a week. You want to make hotel reservations for them.

Call the hotel. I will play the role of the reservations clerk.

Last night while you were eating dinner, you broke a tooth. Call the dentist’s office to make an appointment. I will play the role of the dentist’s receptionist.

You are in a gift shop in New York and would like to take some souvenirs back home to your family. I will play the role of a salesperson. Ask for my assistance.

BILC Language Testing Seminar VI-9

Chapter VI Evaluation of Speaking Proficiency

LEVEL 2

S PEAKING TASKS

Describe your first trip to (the United States, Canada, Spain, etc.)

What will you do after you finish this test?

Describe a typical day for a school child in your country.

R OLE P LAYS

Your teacher has just asked you to describe your language learning program to a military visitor. Provide a clear summary that includes the purpose of the program, the amount of time students spend, and the skills they develop. Answer any questions the visitor may have.

I will play the role of the visitor.

You are trying to enroll in a computer class at a local college. You can only attend class on

Monday and Wednesday evenings. Call the Department of Computer Science for information about their new course. Find out as much as possible about the course and how it fits into your schedule.

While you are vacationing in a hotel in San Francisco, your child wakes up with a fever and a headache. You go to an all-night drugstore and ask for advice as to what to do or what nonprescription medicine to buy. Explain to the drugstore clerk what is wrong and try to get help with your purchase.

BILC Language Testing Seminar VI-10

Chapter VI Evaluation of Speaking Proficiency

LEVEL 3

S PEAKING TASKS

What do you think the role of the press should be in the 21 st

century? Why?

Are you concerned about the continuing proliferation of nuclear weapons into Third World countries? Why?

If you could establish a system to categorize television programs so that parents could control their children’s viewing, how would you go about it? What would you include?

Why?

R

OLE

P

LAYS

You live in an apartment. It is 2:00 a.m. You have just accidentally washed your contact lens down the drain of the bathroom sink. Explain your problem to the apartment house manager and try to get him/her to help you retrieve the contact lens. I will play the role of the apartment house manager.

BILC Language Testing Seminar VI-11

Chapter VI Evaluation of Speaking Proficiency

LEVEL 4

S PEAKING TASKS

Are you concerned about the continuing proliferation of nuclear weapons into Third World countries? Why?

Do you think that human survival can be assumed in a world with the potential for nuclear annihilation? Reasons?

How can we evaluate the human experience on this planet if all the achievements in the arts and sciences over the centuries could be destroyed by a political decision to drop a bomb?

R OLE P LAYS

I will play the role of a colleague with whom you have shared an office for many years.

Recently this colleague’s heavy smoking in the workplace has begun to bother you. You are also concerned about this person’s health. Try to persuade him/her to give up smoking.

You are attending the monthly meeting of an English Language Club you have belonged to for many years. Suddenly, you are completely surprised by an announcement that you have received an award for your outstanding contributions as a volunteer fund-raiser during the last three years. You must stand up, go the podium, and make a short but formal acceptance speech to the group.

You are a supervisor in a California computer business that has contracts with the Federal

Government. An investigator has just informed you that one of the computer specialists has carelessly revealed confidential information to a foreign agent. There is no evidence that the employee did this for financial or political reasons. You are required to explain the gravity of the situation to the employee, dismiss him/her from employment, and offer some advice about future conduct. I will play the role of the employee.

BILC Language Testing Seminar VI-12

Chapter VI Evaluation of Speaking Proficiency

STEPS TO FOLLOW IN PREPARING SPEAKING PROMPTS

1.

Each group will develop:

• three prompts and one role play for Level 1 four prompts and one role play for Level 2

• three prompts and one role play for Level 3

2.

Plan the item development activity.

3.

Review the descriptors and the Content/Task/Accuracy statement for each level before preparing to write the prompt.

4.

Also review “Sample Prompts for Testing Speaking Proficiency” and “Basic Guidelines for Writing Role-Play Situations” found on pages VI-9~12 and VI-20~27.

5.

Write each prompt and role play and get agreement on wording within the group.

6.

Ask a facilitator or a member of another group to review the prompt.

7.

Revise, if necessary.

8.

Assign one group member to plan and manage the Warm up and the Wind down.

BILC Language Testing Seminar VI-13

Chapter VI Evaluation of Speaking Proficiency

SAMPLE INSTRUCTIONS FOR A SPEAKING TEST

The general instructions for the speaking test can be found on page 1 of the test booklet.

D O N OT WR IT E I N T H I S B OO KL E T

GENERAL INSTRUCTIONS

This is a test of your ability to speak (TL) fluently and accurately. There are several parts in the test. Directions for each part will be given in English.

Within each part of the test, the questions will range in difficulty from easy to considerably more challenging. You are not expected to be able to answer all questions with equal facility. However, you should try to speak as much and as well as possible in response to each question. The amount of time provided for your response will vary depending on the complexity of the question.

Listen carefully to the instructions and questions given in English, and answer in (TL) when you are asked to do so. Your answers will be recorded on tape, so it will be important for you to speak clearly enough and loudly enough for your voice to be properly recorded.

Do NOT turn to the next page

until you are asked to do so

FOR OFFICIAL USE ONLY

1

BILC Language Testing Seminar VI-14

Chapter VI Evaluation of Speaking Proficiency

DLPT IV SPEAKING TEST, PART I

The following are the instructions for PART I of the speaking test as found in the booklet and voiced on the tape:

PART I

In this part of the test, imagine you are with a native speaker of (TL) whom you have met recently. This person would like to know more about your background, activities, and interests. The speaker you will hear on the tape is this person.

Listen carefully to each question the (TL) speaker asks you and answer in

(TL) during the pause immediately following the question. Say as much as you

can in response to each question. A short tone signal will alert you that the speaker is about to begin the next question. Remember to respond as soon as

possible after each question and to say as much as you can about it. Now here is the (TL) speaker.

There are 12 TL questions in this part. The following are examples of PART 1 questions stated in English:

Example 1: What are you doing this weekend?

Example 2: How long have you been living in this area?

(After each question, the examinee is given 20 seconds to respond)

BILC Language Testing Seminar VI-15

Chapter VI Evaluation of Speaking Proficiency

DLPT IV SPEAKING TEST, PART II

The following are the instructions for PART II of the speaking test as found in the booklet and voiced on the tape:

PART II

In this part of the test, there are five (5) situations in which you will be asked to talk about some pictures. You will be asked to provide certain descriptions, to give directions, and to talk about two different series of events.

For each situation, you will have fifteen seconds to study the pictures. Then, when you hear the tone, start speaking. You will have 75 seconds to complete your answer. You will hear a second tone signal to alert you that you have ten (10) seconds left to finish your response. The word “STOP” will be your cue to stop speaking. Remember to say as much as you can and speak as well as you can about each situation.

The following is an example of situations in PART II:

SAMPLE SITUATION (75 seconds speaking time):

You are at a (TL nationality) lawyer’s office for an important meeting, which was scheduled for 2:30 PM. You are over 30 minutes late because you had some problems with your car on the way here and had to take a taxi. Apologize to the lawyer and, based on the series of pictures on the opposite page, explain to him the reason for your tardiness.

Take fifteen (15) seconds to prepare, then start speaking when you hear the tone.

BILC Language Testing Seminar VI-16

Chapter VI Evaluation of Speaking Proficiency

BILC Language Testing Seminar VI-17

Chapter VI Evaluation of Speaking Proficiency

DLPT IV SPEAKING TEST, PART III

The following are the instructions for PART III of the speaking test as found in the booklet and voiced on the tape:

PART III

In part III of the test, you will be asked to speak on five (5) different topics.

Imagine that you are talking to a group of (TL) friends about these topics. For each topic, you should say as much as you can and speak as well as you can.

Both fluency and accuracy will be evaluated.

For each topic, you have twenty (20) seconds to think about what you want to say. Then, when you hear the tone, start speaking. The amount of time allotted for your response is different depending on the topic and is shown in your booklet.

You will hear a second tone signal to alert you that you have ten (10) seconds left to finish your response. The word “STOP” will be your cue to stop speaking.

Remember, say as much as you can about each topic. Now here is Topic #1.

The following is an example of topics in PART III:

SAMPLE TOPIC (75 seconds speaking time):

A friend of yours from (TL country) is working in the United States. He has been advised to apply for a credit card to facilitate renting cars, paving for meals at restaurants, making purchases, etc. Since credit cards do not exist in his country, try to explain to him the best you can what a credit card is, and about the benefits of having one.

Take twenty (20) seconds to prepare, then start speaking when you hear the to

ne.

BILC Language Testing Seminar VI-18

Chapter VI Evaluation of Speaking Proficiency

DLPT IV SPEAKING TEST, PART IV

The following are the instructions for PART IV of the speaking test as found in the booklet and voiced on the tape:

PART IV

In this part of the test, you will be asked to play a role in five (5) different settings. In each case, imagine that you are actually in that situation and that you have to respond in (TL) using appropriate language. Say as much as you can and speak as well as you can as you play each role.

For each of these role-plays, you will have some time to think about what you want to say. Then, when you hear the tone, start speaking. The amount of time allotted for your response is different for each role-play, and is shown in your booklet. You will hear a second tone signal to alert you that you have ten (10) seconds left to finish your response. The word “STOP” will be your cue to stop speaking. Remember, say as much as you can in each role-play. Now here is the first role-play.

The following is an example of the role-play in PART IV:

SAMPLE ROLE-PLAY (60 seconds speaking time):

You are working in an office in (TL city). Recently a co-worker, new to the job, has been coming to work late and leaving rather early almost every day.

Fortunately, the office manager has been very busy and has not noticed it yet.

Talk to your co-worker privately, find out if there are any problems and offer your help. Tell him or her also that the office manager is very strict and that if the situation continues he is bound to know and would take disciplinary action.

Take fifteen (15) seconds to prepare, then start speaking when you hear the tone.

BILC Language Testing Seminar VI-19

Chapter VI Evaluation of Speaking Proficiency

BASIC GUIDELINES FOR WRITING ROLE-PLAY SITUATIONS

1.

Write your situations on cards.

2.

Write each situation as you intend to present it to the candidate. Since the language required to explain a situation in the target language may be at a higher level than the candidate can understand, write and present Basic Situations (Level 1) in the candidate’s native language, if possible. The instructions themselves should not become a listening test. Situations for

Levels 3-5 should be written and presented in the target language.

3.

Keep the details in each situation to a minimum. A good situation is detailed enough to seem real, but not so detailed that the candidate cannot imagine himself playing that role. A situation that is presented with too much detail will become simply a discrete-point translation exercise.

4.

Each role-play situation should present three things: a.

the candidate’s role b.

the interviewer’s role c.

the task

5.

While writing the situation, think carefully about your role. Pick one you feel comfortable with. If you can’t talk like a garage mechanic, be a doctor or a waiter. Consider what you expect the candidate to say and have scenarios and necessary information ready: e.g. prices, bus schedules, automobile problems, and so forth.

6.

You, the interviewer, should begin the role-play and also make certain it has a definite end.

7.

Don’t let situations get out of hand during the interview. Maintain control over length and content. Bring the situation to an end whenever it ceases to produce a ratable sample.

8.

If the candidate asks for a second to think about the situation, grant it. Almost any candidate needs a moment to grasp what it is you’re asking for.

9.

The role-play situations illustrated here are given only as examples. Interviewers must write their own role-play situations to make them fit their specific language and culture.

BILC Language Testing Seminar VI-20

Chapter VI Evaluation of Speaking Proficiency

WRITING SITUATIONS FOR TESTING PROFICIENCY

Illustrated Checklist

Situation Structure: play the role of the clerk.

CANDIDATE’S ROLE

You are a tourist travelling in India and need a hotel room. Make arrangements with the hotel clerk. The interviewer will

TA S K

INTERVIEWER’S ROLE

Situation Checklist:

Go over each of the situations you have written and make sure you have the following:

1.

Candidate’s role.

2.

Interviewer’s role (this begins with “The interviewer will play…”).

3.

The task.

BILC Language Testing Seminar VI-21

Chapter VI Evaluation of Speaking Proficiency

PART I

Basic Situations – Level 1

EXAMPLES: role of the hotel clerk. Get a room for yourself and your family.

You are in France and need to extend your tourist visa. The interviewer will play the role of the official who is taking your application. Provide the information requested.

You and your family are tourists in Vienna. You need a hotel room. The interviewer will play the

You are visiting Santiago. You want to buy some items to take back home as gifts for several friends. The interviewer will play the role of the sales person in a gift shop. Buy some souvenirs.

a. You are staying in a hotel in Geneva when suddenly you feel sick. The interviewer will play the role of the hotel clerk. Tell him about your problem and ask him to call the doctor. b. The other interviewer will play the role of the doctor. Tell the doctor about your illness and answer any questions.

BILC Language Testing Seminar VI-22

Chapter VI Evaluation of Speaking Proficiency

PART I

Basic Situations – Level 1 (continued)

You and a companion have just arrived in

Warsaw and need something to eat. The interviewer will play the role of the waiter in a small restaurant. Order something for both of you to eat and drink. When you are finished eating pay the bill and leave.

The Level “1” speaker should be

• able to do the following:

Get a room in a hotel

Get a taxi; talk with the driver

Make a reservation for dinner

Respond appropriately to questions at customs

• Use appropriate greetings and social clichés

Cash a check and change money

Find out about opening and closing times (public buildings, museums, offices, theatres, etc.)

You are at the train station in Cairo. You need to get to Aswan as soon as possible without spending too much money. The interviewer will play the role of the ticket clerk. Get the necessary information and buy your ticket.

Find out departure and arrival times of city transportation

Get information about a flight, bus or train trip

Arrange a meeting with someone at a specific time and place

Order a meal in a restaurant

Ask and understand directions to something in a building, city, rural area or on a highway

Make simple introductions

Tell someone that he or she is tired, sick or otherwise needs something for his or her personal comfort

Describe medical needs simply to a pharmacist, doctor or dentist

Take care of routine car needs at a service station

Report a car breakdown and request help

Buy newspapers, magazines, postcards, souvenirs, common drugstore items and so forth

Buy stamps and cards at the post office, mail packages, send telegrams and so forth

Buy basic food items at a street market (if appropriate)

Use common telephone phrases: When do you expect her? May I leave a message? Please

have him call me; etc.

BILC Language Testing Seminar VI-23

Chapter VI Evaluation of Speaking Proficiency

PART II

Situations with Complications – Level 2

You can raise a basic (Level 1) situation to Level 2. While acting out the situation, introduce a couple of problems for the candidate to deal with. (NOTE: The interviewer simply introduces these complications as part of the situation. To make it seem as realistic as possible, do not tell the candidate in advance about the complication. This means that you do not have to prepare separate Level 2 situations. You should, however, think in advance of some ways to complicate any basic situation.

EXAMPLES: (Compare with previous two pages)

1.

The type of room the candidate wants in the hotel is not available. He has to deal with the complication by trying to get a different type of room, going to another hotel, etc.

2.

The gift shop doesn’t have the items the candidate wants to buy as gifts. She has to try to get something else, try to locate another store, etc.

3.

The candidate learns that the doctor he wants to see has just been called away on an emergency. He has to visit another doctor some distance away, wait, etc.

4.

There has been a problem in the restaurant kitchen and the restaurant can only serve a very limited menu of cold food. The candidate has to find another place, order something cold, etc.

5.

Due to a very bad storm in Egypt, trains are being delayed and rerouted. Passengers may have to transfer several times, take a bus part of the way, etc.

BILC Language Testing Seminar VI-24

Chapter VI Evaluation of Speaking Proficiency

PART III

Unfamiliar Situations – Levels 3-5

1.

Unfamiliar situations test the candidate’s ability to solve real problems by dealing with the unexpected, presenting arguments and ideas, answering objections, defending one’s actions, explaining at length, and so forth.

2.

The candidate must produce the specific language to deal with non-routine events. It is the responsibility of the interviewer to guide the role-play situation toward that goal.

3.

Situations at this level and higher should be written and presented in the target language.

EXAMPLES:

You and your small child are in a large city park. Your child has just thrown your keys into the goldfish pond. The interviewer will play the role of a park policeman. Explain your problem to the policeman and ask for help.

You live in an apartment. You have just accidentally washed your contact lens down the drain of the bathroom sink. The interviewer will play the role of the apartment house manager.

Explain your problem to the manager and try to get him to help retrieve your contact lens.

You live in an apartment building. When your upstairs neighbor waters the plants on her balcony the water ends up on your balcony damaging your furniture. The interviewer will play the role of your neighbor. Go to her and discuss the problem .

You have just arrived at the San Francisco airport after a direct flight from London.

A customs official has found some herbal tea (not in the original package) in a plastic bag in your suitcase and thinks it might be a “controlled substance”. The interviewer will play the role of the customs official. Answer her questions and explain the situation.

BILC Language Testing Seminar VI-25

Chapter VI Evaluation of Speaking Proficiency

PART IV

Persuading/Convincing/Advising – Levels 4-5

One of the tasks for Level 4 and above is persuading/convincing/advising etc. Role-play situations involving this task are used to check the candidate’s ability to perform these demanding tasks while maintaining the required level of language accuracy. These situations involve the ability to talk someone into (or out of) doing (or thinking) something.

NOTE: In presenting role-plays at Levels 4 and 5 (see also T

AILORING

in the next section), you must in some way indicate the personal relationship between the participants in the situation. In addition to knowing what to do, the candidate must also know how to talk to his/her partner in the conversation.

EXAMPLES:

The interviewer will play the role of a colleague with whom you have worked in the same office for many years. Recently the colleague’s heavy smoking has begun to bother you. You are also concerned about your colleague’s health. Try to persuade her to give up smoking.

The interviewer will play the role of a close friend with a new-born baby. Try to persuade her to stay home for a year with the baby instead of returning to work full time immediately.

The interviewer will play the role of a driver with whom you have just been involved in a minor traffic accident which has caused some damage to both cars. Try to convince him to settle out of court and without involving insurance companies.

The interviewer will play the role of a 19-year old student in your class. Recently his performance has begun to slip badly. You suspect a personal problem. Try to find out what is wrong and advise him how to solve the problem.

BILC Language Testing Seminar VI-26

Chapter VI Evaluation of Speaking Proficiency

PART V

Tailoring – Levels 4-5

One of the tasks for Level 4 and above is tailoring language. Role-play situations involving this task are used to check the candidate’s ability to use the language register appropriate to the relationship between the speakers and the purpose and setting of the conversation.

EXAMPLES:

The interviewer will play the role of the garage mechanic who has worked on your cars for years. Your car has suddenly started to make a strange noise and lose power going up hills. Explain your problem to the mechanic and try to have him repair your car.

You are attending the monthly meeting of the

Korean-American Club of which you are a long-time member. Suddenly, and as a complete surprise to you, you are presented with an award for your outstanding contributions as a fundraising volunteer over the past year. You have to make a short impromptu acceptance speech.

The interviewer will play the role of your

5-year old child whose kitten has just died.

Try to offer the child consolation and a solution to the problem.

The interviewer will play the role of a woman for whom you have worked for a number of years.

She is older than you and your relationship is friendly but still somewhat formal. You have good reason to believe that one of her oldest, most trusted employees may be taking money from the company. Go to her and alert her to the situation as tactfully as possible.

BILC Language Testing Seminar VI-27

Chapter VI Evaluation of Speaking Proficiency

BILC Language Testing Seminar VI-28

Chapter VI

ORAL PROFICIENCY INTERVIEW RATING SHEET

NAME CLASS NUMBER

Evaluation of Speaking Proficiency

SOCIAL SECURITY NUMBER LANGUAGE

Consult the ILR Level Descriptions and the Rating Factor Grid on the other side before assigning your rating.

SPEAKING SCALE FINAL RATING

TESTER

DATE

REMARKS

I, the undersigned, hereby certify that I assigned this score individually and independently without any discussion or consultation with my co-tester.

SIGNED by:

SIDE A

BILC Language Testing Seminar VI-29

Chapter VI Evaluation of Speaking Proficiency

0+

1

2

3

4

5

RATING FACTOR GRID

GLOBAL TASKS AND

FUNCTIONS

Can make statements and ask questions using memorized material.

Can create sentences; begin, maintain, and close short conversations by asking and answering simple questions; satisfy simple daily needs.

Can describe people, places, and things; narrate current, past, and future activities in full paragraphs; state facts; give instructions or directions; ask and answer questions in the work place; deal with non-routine daily situations.

Can converse extensively in formal and informal situations; discuss abstract topics; support opinions; hypothesize; deal with unfamiliar topics and situations; clarify points.

Can tailor language to fit the audience; counsel, persuade, represent an official point of view, negotiate, advocate a position at length, interpret informally.

Functionally equivalent to a highly articulate, well-educated speaker.

LEXICAL CONTROL

Memorized words and phrases related to immediate survival needs.

Very limited. Covers courtesy expressions, introductions, identification, personal and accommodational needs, daily routine.

Sufficient to discuss high frequency concrete topics such as work, family, personal background and interests, travel, current events. Imprecise for less common topics.

Discourse is minimally cohesive.

Grammatical structures are usually not very elaborate and not thoroughly controlled; errors are frequent. Simple structure and basic grammatical relations are typically controlled.

Broad enough for effective formal and informal conversations on practical, social, and professional topics. Can convey abstract concepts.

Precise for representational purposes within personal and professional experience. Can elaborate concepts freely; choose appropriate words to convey nuances of meaning.

Breadth of vocabulary and idiom equivalent to that of a highly articulate, well-educated native speaker.

STRUCTURAL CONTROL

No control. Can only use memorized structures.

Structural accuracy is random or severely limited. Almost every utterance has errors in basic structures. Time concepts are vague. Can formulate some questions.

Effectively combines structure and vocabulary to convey meaning. Discourse is cohesive. Use of structural devices is flexible and elaborate. Errors occur in lowfrequency and highly complex structures, but structural inaccuracy rarely cause misunderstanding.

Organizes discourse well, using appropriate rhetorical devices and high level discourse structures.

Functionally equivalent to a highly articulate, well-educated native speaker.

SIDE B

SOCIOLINGUISTIC

COMPETENCE

Severely limited. Any knowledge of cultural appropriateness has a nonlinguistic source.

Uses greetings and courtesy expressions. Can interact with native speakers used to dealing with nonnatives.

DELIVERY

Even in memorized speech, stress, intonation, tone usually quite faulty.

Often speaks with great difficulty. Pronunciation, stress, intonation generally poor.

TEXTS

PRODUCED

Individual words and phrases.

Discrete sentences.

Satisfies routine social demands and limited work requirements. Can interact with native speakers not used to dealing with non-natives; native speakers may have to adjust to limitations.

Speaks with confidence but not facility. Can usually be understood by those not used to dealing with non-natives.

Full paragraphs.

Uses cultural references. When errors are made, can easily repair the conversation.

Uses and understands details and ramifications of target cultural references. Can set and shift the tone of exchanges with a variety of native speakers.

Speech reflects the cultural standards of country where language is natively spoken.

Speaks readily and fills pauses suitably.

Pronunciation may be obviously foreign. Flaws in stress, intonation, pitch rarely disturb the native speaker.

Extended discourse.

Speaks effortlessly and smoothly, but would seldom be perceived as a native speaker.

Functionally equivalent to a highly articulate, welleducated native speaker of a standard dialect.

Speeches, lectures, debates, conference discussions.

All texts controlled by a highly articulate, well-educated native speaker.

BILC Language Testing Seminar VI-30

Chapter VI Evaluation of Speaking Proficiency

THE 6 FACTORS OF THE RATING FACTOR GRID

1.

Global Tasks and Functions

The terms “global task” and “function” refer to what speakers are able to do with the language. Although tasks are the hallmark of the system, testers must consider all five additional factors (lexical control, structural control, sociolinguistic competence, delivery,

and text types produced) when determining whether the task has been performed at the appropriate level.

• The ILR Level Descriptions refer to functions or global tasks ranging from simple ones, such as listing and asking questions, to more complex tasks such as description, narration, to highly complex tasks such as supported opinion, hypothesis, and tailoring.

The ability to carry out global tasks constitutes a crucial element of communicative ability.

• When rating linguistic performance in a test, it is tempting to point to specific grammatical or lexical deficiencies as the determining factors in justifying a rating. In most cases, however, failure to do the global tasks occurs because they were not carried out with the accuracy required at that level.

• At each proficiency level, there are specific tasks and functions which speakers must perform in a sustained fashion in order to be considered proficient at that level.

• Tasks are associated with content areas that are specified in the ILR Level Descriptions in a global way. Content areas become OPI-specific based on the examinee’s background, life experience and interests. For that reason, while the tasks at each level stay the same, the content areas will vary.

GLOBAL TASKS AND FUNCTIONS

LEVEL 5 Functions equivalent to a highly articulate, well-educated native speaker

LEVEL 4 Tailors language to fit audience, counsels, persuades, negotiates, represents a point of view

LEVEL 3 Discusses topics extensively, supports opinions and hypothesizes, deals with a linguistically unfamiliar situation

LEVEL 2 Narrates in major time frames, describes, reports facts, gives directions, deals effectively with an unanticipated complication

LEVEL 1 Creates with language, initiates, maintains simple conversations, asks and responds to simple questions, gets through a basic survival situation

LEVEL 0+ Communicates minimally with telegraphic and memorized utterances, lists and phrases

BILC Language Testing Seminar VI-31

Chapter VI Evaluation of Speaking Proficiency

2.

Lexical Control

Lexical control is the range of vocabulary and idiomatic phrases the examinee is able to use in the target language and the facility and appropriateness with which he or she uses them.

Higher level examinees should be competent in a professional vocabulary domain in addition to a broad general vocabulary. Lexical control also refers to the use of proverbs, sayings, and idioms found in the target language.

3.

Structural Control

Structural control is the examinee’s accuracy and flexibility in using the language’s grammatical structures to produce well-formed and appropriate sentences. This also refers to the examinee’s ability to link sentences together appropriately to form cohesive discourse.

Among the elements included within this factor are: control of word order, grammatical markers such as those for tense, aspect or complementation in some languages, derivational and inflectional affixes, modification, topicalization, and coordinating and subordinating conjunctions. Structural control can be seen in the form and cohesion of sentences and connected discourse, and by the range of structures used by the examinee.

4.

Sociolinguistic Competence

Sociolinguistic competence refers to the extent to which the examinee’s use of the language is appropriate to the social and cultural context and reflects an understanding of cross-cultural communication. Includes control of language, selection of topics and choice of words appropriate to situations presented. Evidence of sociolinguistic competence occurs at all proficiency levels, but becomes important at Level 3 and crucial at Level 4.

5.

Delivery

Delivery is the examinee’s fluency and phonological accuracy in the language. Fluency refers to the ease of flow and naturalness of the examinee’s speech. Phonological accuracy refers to the examinee’s pronunciation of the individual sounds of the language, and to the patterns of intonation, including stress and pitch. Delivery is shown by the extent to which speech sounds native, is smooth flowing, and free of features that interfere with communication of meaning.

6.

Text Types Produced

Text types produced refers to the type and length of discourse produced by the examinee. At lower levels discourse will be non-native and not cohesive. Even at Level 2, discourse, though minimally cohesive, will generally be organized in a non-native fashion. At Level 3 and above, discourse is cohesive, extensive and well organized and increasingly resembles native discourse. The text types that are produced by the examinee often indicate the range in which the examinee functions.

BILC Language Testing Seminar VI-32

Chapter VI

D e f i n i t i o n o f P r o f i c i e n c y L e v e l s f o r t h e L a n g u a g e S k i l l s

Bundes sprachenam t *

Evaluation of Speaking Proficiency

Level R e q u i r em e nt L / L i st e ni n g S / S p e a k i n g R / R e a d i n g W / W r i ti n g

1 Elementary competence within a limited and familiar general framework

Communicative frame

Common phrases, unambiguous contents and remarks related to everyday matters such as public transportation, shopping or the workplace. The situation is clear and is delimited by such external factors as time and place.

Linguistic ability

The Level 1 listener: understands concrete utterances made in short, simple sentences and recognizes simple organizing signals, such as “first” and “finally”. Interlocutor must speak slowly and repeat if necessary. Listener generally understands listening texts from the media and conversations between native speakers only if the contents are completely unambiguous.

Communicative frame

Communication in typical everyday situations which are clearly delimited by external factors such as time and place, for example shopping, routine tasks at the workplace and using public transportation. This may involve asking questions and making statements.

Linguistic ability

The Level 1 speaker: can convey his basic intention in short, simple utterances, although frequent errors in pronunciation, vocabulary and grammar may distort meaning. He seldom expresses himself in a natural, fluent way.

Repetitions are common.

Communicative frame

Unambiguous texts which are directly related to everyday matters in the reader’s private life or work, for example advertisements, signs, application forms or short notes and memos.

Linguistic ability

The Level 1 reader: understands the basic meaning of simple texts

(“global reading”) and can find specific details through thorough reading or selective reading. Can frequently only comprehend texts with the help of a bilingual dictionary.

Communicative frame

Communication for simple general purposes, for example, writing lists, notes, short faxes and postcards and filling out forms or formulating simple requests.

Linguistic ability

The Level 1 writer: can convey his basic intention in writing using short, simple sentences, although errors in spelling, vocabulary and grammar are frequent. He seldom expresses himself in a natural, fluent way.

2 Limited competence within a general and professional framework

Communicative frame

Includes utterances made in a dialog or small group on familiar general or professional topics such as the environment, education or job procedures. Although the situation is clear, it is not necessarily delimited by such external factors as time and place.

Linguistic ability

The Level 2 listener: understands utterances containing explicit and some implicit information, recognizes organizing signals for more complex trains of thought, for example, “although” and

“instead of”, but does not always recognize different stylistic levels.

Occasionally asks interlocutor to repeat utterances. Media texts on unfamiliar topics and conversations among native speakers can usually only be comprehended globally.

* Translation of definitions as approved in August, 1998

Communicative frame

Communication in everyday situations and situations at the workplace where what is meant is clear, though not necessarily delimited by external factors. Topics may include, for example, the environment, education or job procedures. In these situations the speaker may describe or explain something, report on something or express a personal opinion.

Linguistic ability

The Level 2 speaker: can convey his basic intention using relatively simple sentences and avoiding difficult or unfamiliar grammar structures. Errors in pronunciation, vocabulary, and grammar may distort meaning. He generally expresses himself in a way that is appropriate to the situation, although his command of the spoken language is not always firm.

Communicative frame

Text on familiar topics from the reader’s own field, such as articles from newspapers and professional journals and job-related texts.

Linguistic ability

The Level 2 reader: understands texts containing explicit and some implicit information. With the help of a bilingual dictionary he comprehends texts globally, selectively, and in detail, though he reads very slowly compared to a native speaker.

Communicative frame

Communication in familiar general or professional areas such as formulating private letters or office correspondence, brief reports and memos.

Linguistic ability

The Level 2 writer: can convey basic intention using relatively simple sentences and avoiding difficult or unfamiliar grammar structures. Errors in spelling, vocabulary and grammar may distort meaning.

He generally expresses himself in a way that is appropriate to the occasion, although his command of the written language is not always firm.

BILC Language Testing Seminar VI-33

Chapter VI Evaluation of Speaking Proficiency

L a n g u a g e S k i l l

S / S p e a k i n g Level R e q u i r em e nt L / L i st e ni n g R / R e a d i n g W / W r i ti n g

3 Competence within a general social and professional-specialist range including not entirely familiar subject areas

Communicative frame

Includes utterances made in larger groups or in lectures on general and professional-specialist topics which may not be entirely familiar from such areas as economics, culture, science and technology as well as from the listener’s own field.

Linguistic ability

The Level 3 listener: understands utterances containing explicit and implicit information, can generally distinguish between different levels of style and often recognizes humor and irony. Rarely needs to ask an interlocutor to repeat utterances and understands utterances made in the media or in conversations between native speakers, both globally and – for the most part – in detail. Regionalisms and dialects are not always comprehended, however.

Communicative frame

Communication even in somewhat unfamiliar general social or professionalspecialist situations such as lectures, negotiations, presentations and briefings.

Topics may come from such areas as economics, culture, science and technology, as well as from the speaker’s own field. In these situations the speaker may describe, argue the case or give reasons for something, or explain something in a systematic way.

Linguistic ability

The Level 3 speaker: conveys meaning correctly and effectively in sentences which are generally well-structured.

Rarely makes errors in vocabulary, grammar or pronunciation which are serious enough to distort meaning. He expresses himself fluently and in a way that is appropriate to the situation.

Communicative frame

Texts on not entirely familiar general and professional-specialist topics taken, for example, from newspapers, magazines, and personal or job-related papers.

Topics may come from such areas as economics, culture, science and technology as well as from the reader’s own field.

Linguistic ability

The Level 3 reader: understands texts containing explicit and implicit information, can generally distinguish between different levels of style and often recognizes humor and irony. Understands globally, selectively and in detail, occasionally requiring a dictionary. Still does not read as fast as a comparable native speaker.

Communicative frame

Communication even in somewhat unfamiliar general social or professional-specialist areas such as formulating private letters or jobrelated texts, reports, position papers and the final draft of other papers. Topics may come from such areas as economics, culture, science and technology as well as from the writer’s own field.

Linguistic ability

The Level 3 writer: conveys meaning correctly and effectively in sentences which are generally wellstructured. Occasionally makes errors in spelling, vocabulary and grammar. Expresses himself fluently and in a way that is appropriate to the situation.

4 Firm competence in a general social and professional-specialist range including unfamiliar subject areas

Communicative frame

Utterances of all kinds – including those made in larger groups, in lectures and during negotiations – even on unfamiliar general or professional-specialist topics from such areas as economics, culture, science and technology as well as from the listener’s own field.

Linguistic ability

The Level 4 listener: understands utterances from a wide spectrum of complex language and recognizes nuances of meaning and stylistic levels as well as irony and humor. Understands utterances made in the media and in conversations among native speakers both globally and in detail and generally comprehends regionalisms and dialects.

Communicative frame

Communication even in unfamiliar general or professional-specialist situations such as lectures, negotiations, presentations and briefings. Topics may come from such areas as economics, culture, science and technology, as well as from the speaker’s own field. In these situations the speaker may describe, argue the case or give reasons for something, or explain something in a systematic way.

Linguistic ability

The Level 4 speaker: conveys meaning correctly and effectively and naturally in well-structured, stylistically-appropriate sentences. With his firm grasp of various levels of style he can also express shades of meaning.

Communicative frame

Text involving complex trains of thought, including texts from unfamiliar general and professionalspecialist areas. These texts may be taken from newspapers, magazines and the professional literature written for the educated reader, and may contain topics from such areas as economics, culture, science and technology, as well as from the reader’s own field.

Linguistic ability

The Level 4 reader: understands texts globally, selectively and in detail, with a firm grasp of stylistic nuances and of irony and humor.

Rarely needs a dictionary and reads almost as fast as a native speaker.

Communicative frame

Communication even in unfamiliar general or professional-specialist areas, such as formulating private letters or job-related texts, reports, position papers and the final draft of other papers. Topics may come from such areas as economics, culture, science and technology as well as from the writer’s own field.

Linguistic ability

The Level 4 writer: conveys meaning correctly, effectively and naturally in well-structured sentences. With his firm grasp of various levels of style he can also express shades of meaning.

In the “Standardized Language Profile” (SLP) the four skills Listening, Speaking, Reading and Writing are given in this order as a four-digit number, with the position of the number indicating which skill is meant, and the number itself representing the skill level. An SLP of 3321, for example, means Level 3 in Listening, Level 3 in Speaking, Level 2 in Reading and Level 1 in Writing.

BILC Language Testing Seminar VI-34

CHAPTER VII

EVALUATION OF

WRITING PROFICIENCY

Chapter VII Evaluation of Writing Proficiency

INTERPRETATION OF THE LANGUAGE PROFICIENCY LEVELS

Appendix 1 to Annex A to STANAG 6001 (Edition 2)

WRITING

LEVEL 0 (NO PROFICIENCY)

No functional writing ability.

LEVEL 1 (ELEMENTARY)

Can write to meet immediate personal needs. Examples include lists, short notes, post cards, short personal letters, phone messages, and invitations and filling out forms and applications.

Writing tends to be a loose collection of sentences (or fragments) on a given topic, with little evidence of conscious organization. Can convey basic intention by writing short, simple sentences, often joined by common linking words. However, errors in spelling, vocabulary, grammar, and punctuation are frequent. Can be understood by native readers used to nonnatives’ attempts to write.

LEVEL 2 (LIMITED WORKING)

Can write simple personal and routine workplace correspondence and related documents, such as memoranda, brief reports, and private letters, on everyday topics. Can state facts; give instructions; describe people, places, and things; can narrate current, past, and future activities in complete, but simple paragraphs. Can combine and link sentences into connected prose; paragraphs contrast with and connect to other paragraphs in reports and correspondence. Ideas may be roughly organized according to major points or straightforward sequencing of events. However, relationship of ideas may not always be clear, and transitions may be awkward. Prose can be understood by a native not used to reading material written by non-natives. Simple, high frequency grammatical structures are typically controlled, while more complex structures are used inaccurately or avoided. Vocabulary use is appropriate for high frequency topics, with some circumlocutions. Errors in grammar, vocabulary, spelling, and punctuation may sometimes distort meaning. However, the individual writes in a way that is generally appropriate for the occasion, although command of the written language is not always firm.

BILC Language Testing Seminar VII-1

Chapter VII Evaluation of Writing Proficiency

LEVEL 3 (MINIMUM PROFESSIONAL)

Can write effective formal and informal correspondence and documents on practical, social, and professional topics. Can write about special fields of competence with considerable ease.

Can use the written language for essay-length argumentation, analysis, hypothesis, and extensive explanation, narration, and description. Can convey abstract concepts when writing about complex topics (which may include economics, culture, science, technology) as well as his/her professional field. Although techniques used to organize extended texts may seem somewhat foreign to native readers, the correct meaning is conveyed. The relationship and development of ideas are clear, and major points are coherently ordered to fit the purpose of the text. Transitions are usually successful. Control of structure, vocabulary, spelling, and punctuation is adequate to convey the message accurately. Errors are occasional, do not interfere with comprehension, and rarely disturb the native reader. While writing style may be non-native, it is appropriate for the occasion. When it is necessary for a document to meet full native expectations, some editing will be required.

LEVEL 4 (FULL PROFESSIONAL)

Can write the language precisely and accurately for all professional purposes including the representation of an official policy or point of view. Can prepare highly effective written communication in a variety of prose styles, even in unfamiliar general or professionalspecialist areas. Demonstrates strong competence in formulating private letters, job-related texts, reports, position papers, and the final draft of a variety of other papers. Shows the ability to use the written language to persuade others and to elaborate on abstract concepts.

Topics may come from such areas as economics, culture, science, and technology as well as from the writer’s own professional field. Organizes extended texts well, conveys meaning effectively, and uses stylistically appropriate prose. Shows a firm grasp of various levels of style and can express nuances and shades of meaning.

LEVEL 5 (NATIVE/BILINGUAL)

Writing proficiency is functionally equivalent to that of a well-educated native writer. Uses the organizational principles and stylistic devices that reflect the cultural norms of natives when writing formal and informal correspondence, official documents, articles for publication, and material related to a professional specialty. Writing is clear and informative.

BILC Language Testing Seminar VII-2

Chapter VII Evaluation of Writing Proficiency

SAMPLE WRITING TASKS

LEVEL 1

Write a note to your colleagues inviting them to a surprise birthday party for your language teacher. Tell them about the event, including the time and place. Let them know what you want them to bring.

LEVEL 2

1.

Assume that you have just returned from a trip and are writing a letter to a close friend.

Describe a particularly memorable experience that occurred while you were travelling.

This will be one paragraph in a longer letter to your friend. This paragraph should be about

100 words in length.

You will be judged on the style and organization of this paragraph as well as vocabulary and grammar. Remember, the intended reader is a close friend.

2.

As a military officer, you were recently reassigned to another command at a different location. To welcome you, one of your new colleagues invited you to dinner Saturday evening at his home. Your colleague’s name is Carl Grant, and he is an Army captain. His wife’s name is Linda, and they have two children – a daughter, Laura, and a son, Robert.

You accepted the invitation and met Captain Grant’s family on Saturday. Write a thank you letter in English to Carl and Linda, expressing your appreciation for the dinner.

This letter should not be longer than 150 words.

You will be judged on the style and organization of this letter as well as vocabulary and grammar.

BILC Language Testing Seminar VII-3

Chapter VII Evaluation of Writing Proficiency

LEVEL 3

A professional organization to which you belong has requested that you write a paper for their quarterly newsletter.

Select one of the topics listed below, and write a paper of approximately 550 words.

You will be judged on the style, organization, logical development, and complexity of your paper as well as the richness and precision of vocabulary, accuracy of grammar and spelling, and the suitability for the intended audience.

TOPICS:

Teachers’ resistance to change.

The influence of television on language skills.

• Quality versus equality in higher education.

The move toward neutralizing gender in American English.

LEVEL 4

It has been said that resting beneath each culture is a complete mythology--the material with which society indoctrinates itself. This involves not just legends and concepts of proper behavior and ideas, but rhetorical response (e.g., propaganda and advertising), social assumptions, and casual conversation.

Apply this point of view to a contrast between two specific countries (for example, the

United States and the Czech Republic). Explain the underlying mythology of each culture and analyze the contrasts. Conclude with a discussion of the difficulties that could be predicted for a person moving from one of these cultures to the other.

Write an essay of approximately 1000 words on this topic.

You will be evaluated on the clarity of your presentation, the organization of ideas, and the appropriateness of style, vocabulary, and grammatical features to the subject matter. Your intended audience will be an educated reader of English who is interested in cultural contrasts.

BILC Language Testing Seminar VII-4

Chapter VII Evaluation of Writing Proficiency

STEPS TO FOLLOW IN PREPARING WRITING PROMPTS

1.

Each group will receive the same instructions for developing writing prompts.

2.

Each group will develop two prompts for Level 2 and two prompts for Level 3.

3.

One of the Level 2 prompts should focus on work-related writing tasks and the other should focus on a personal writing task. One of the Level 2 prompts should require the examinee to produce written narration.

4.

One of the Level 3 prompts should require a formal style of writing and the other should require an informal style of writing. One of the Level 3 prompts should require the examinee to produce an analysis or hypothesis on an abstract topic.

5.

All test instructions should explain who the intended reader will be.

6.

All test instructions should make it clear to the examinee how the writing sample will be evaluated. The group will want to refer to the accuracy statement in the level description or trisection when planning the evaluation.

7.

Follow these steps: a.

Plan the prompt development activity. b.

Review the Content/Task/Accuracy statement for the level before preparing to write the prompt. c.

Write each prompt and get agreement on wording within the group. d.

Ask a facilitator or a member of another group to review each prompt. e.

Revise, if necessary.

BILC Language Testing Seminar VII-5

Chapter VII Evaluation of Writing Proficiency

HUGHES’S MODEL

FOR A TABLE OF SPECIFICATIONS

FOR A WRITING PROFICIENCY TEST

OPERATIONS

TYPES OF TEXTS

ADDRESSEES OF TEXTS

TOPICS

DIALECT AND STYLE

LENGTH OF TEXTS

BILC Language Testing Seminar VII-6

Chapter VII Evaluation of Writing Proficiency

THE RATING OF WRITING ABILITY

In Assessing Language Ability in the Classroom, Andrew Cohen provides a review of the four basic types of rating essays. These are (1) holistic, (2) analytic, (3) primary trait, and (4) multi-trait. The following is a summary of the topic discussed fully in Cohen’s book, pages

314-323.

H OLISTIC S CORING provides a single, integrated score for the essay or writing performance.

Advantages

Minor aspects of the examinee’s performance are not emphasized

Strengths are emphasized more than weaknesses

The approach permits a system that gives greater weight to specific criteria

Disadvantages

The score will not provide diagnostic information

It is difficult to train raters

The approach minimizes and thus disguises uneven abilities

There may be a tendency for longer essays to get higher ratings

Reducing an essay score to a single number reduces reliability

A

NALYTIC

S

CORING

employs several separate scales. These may include content, organization, vocabulary, grammar, and mechanics.

Advantages

Raters will give attention to each factor and not minimize any of them

It is easier to train raters with an analytic scale

Disadvantages

In practice, raters may not evaluate the factors independently but allow rating of one factor to influence others

The holistic approach acknowledges that writing is more than the sum of its parts; however, the analytic approach does not take this into account

There may be a tendency to give higher scores to essays that fit the preconceptions of the scale

Raters may find some of the qualitative judgements difficult to make, even with training

BILC Language Testing Seminar VII-7

Chapter VII Evaluation of Writing Proficiency

P

RIMARY

T

RAIT

S

CORING

requires the test developer to determine the criteria for successful writing on a particular topic. Each topic will have different criteria.

Advantages

This approach sharpens and narrows the criteria used for a holistic rating

The rater and the examinee can focus on one issue at a time. This could be very helpful for diagnostic purposes

• This approach acknowledges that it is difficult to focus on many factors at one time

Disadvantages

Examinees may have difficulty focusing on one rating factor while writing

This approach may not be sufficiently integrative

Some aspects of writing are not significant enough to serve as a single criterion

An essay that is very well written but not fully responsive to the criteria could receive a lower rating that a weak essays that responds to every point in the criteria

M ULTI -T RAIT S CORING allows rating more than one factor (usually 3 or 4) but in less detail than analytic scoring. Multi-trait scoring, like primary trait scoring, takes greater account of context and topic.

Advantages

Factors considered can be very specific to the requirements of the assignment

This approach permits flexibility when the examinee responds in an unexpected way

Validity is stronger because the expectations are clearer

Ratings may provide diagnostic information

Disadvantages

It is not always easy to identify and validate context-specific criteria

Raters may rely on more traditional rating systems despite the stated criteria

BILC Language Testing Seminar VII-8

Chapter VII Evaluation of Writing Proficiency

EXAMPLES

(adapted from Cohen)

HOLISTIC SCORING SCALE

5 The main idea is stated very clearly, and there is a clear statement of the argument. The essay is well organized and coherent. Vocabulary choice is excellent. No grammatical errors. Spelling and punctuation are good.

4 The main idea is fairly clear, and the argument is stated. The essay is moderately well organized and is relatively coherent. Vocabulary is good. There are only minor grammatical errors. There are a few spelling and punctuation errors.

3 The main idea and the argument are indicated but not clearly. The essay is not very well organized, and it is somewhat lacking in coherence. Vocabulary is fair. Some major and minor grammatical errors. There are a large number of spelling and punctuation errors.

2 The main idea and argument are hard to identify in the essay. The essay is poorly organized and relatively incoherent. Use of vocabulary is weak. Grammatical errors are frequent. Spelling and punctuation errors are frequent.

1 The main idea and argument are not expressed. The essay is very poorly organized and generally incoherent. Vocabulary is very weak. Grammatical errors are very frequent.

Spelling and punctuation errors are very frequent.

ANALYTIC SCORING SCALE

C ONTENT

5 Excellent. Main ideas stated clearly and accurately. Argument very clear.

4 Good. Main ideas stated fairly clearly and accurately. Argument relatively clear.

3 Average. Main ideas somewhat unclear or inaccurate. Argument somewhat weak.

2 Poor. Main ideas not clear or accurate. Argument weak.

1 Very poor. Main ideas not at all clear or accurate. Argument very weak.

BILC Language Testing Seminar VII-9

Chapter VII Evaluation of Writing Proficiency

O RGANIZATION

5 Excellent. Well organized and perfectly coherent.

4 Good. Fairly well organized and generally coherent.

3 Average. Loosely organized but main ideas clear, logical but incomplete sequencing.

2 Poor. Ideas disconnected, lacks logical sequencing.

1 Very poor. No organization, incoherent.

V OCABULARY

5 Excellent. Very effective choice of words and use of idioms and word forms.

4 Good. Effective choice of words and use of idioms and word forms.

3 Average. Adequate choice of words but some misuse of vocabulary, idioms, and word forms.

2 Poor. Limited range, confused use of words, idioms, and word forms.

1 Very poor. Very limited range, very poor knowledge of words, idioms, and word forms.

G

RAMMAR

5 Excellent. No errors, full control of complex structure.

4 Good. Almost no errors, good control of structure.

3 Average. Some errors, fair control of structure.

2 Poor. Many errors, poor control of structure.

1 Very poor. Dominated by errors, no control of structure.

M

ECHANICS

5 Excellent. Mastery of spelling and punctuation.

4 Good. Few errors in spelling and punctuation.

3 Average. Several spelling and punctuation errors.

2 Poor. Frequent errors in spelling and punctuation.

1 Very poor. No control over spelling and punctuation.

BILC Language Testing Seminar VII-10

Chapter VII Evaluation of Writing Proficiency

PRIMARY TRAIT SCORING SCALE

(Note: The single criterion was ability to discuss a change of opinion)

5 Makes it very clear what the former position was, what the current position is, and why this occurred.

4 Makes it generally clear what the former position was, what the current position is, and why this occurred.

3 Makes it fairly clear what the former position was, what the current position is, and why this occurred.

2 Does not make it clear what the former position was, does not state a current position explicitly, and there is no clear indication of a change of opinion.

1 This is a fragmented response in which it is difficult to determine any position.

MULTITRAIT SCORING SCALE

M

AIN

I

DEA

/O

PINION

5 The main idea is stated very clearly, and there is a clear statement of the argument.

4 The main idea is fairly clear, and the argument is evident.

3 The main idea and the argument are indicated, but not clearly.

2 The main idea and argument are hard to identify or are lacking.

1 The main idea and argument are lacking.

R

HETORICAL

F

EATURES

5 A well-balanced and unified essay, with excellent use of transitions.

4 Moderately well-balanced and unified essay, relatively good use of transitions.

3 Not so well-balanced or unified essay, somewhat inadequate use of transitions.

2 Lack of balance and unity in essay, poor use of transitions.

1 Total lack of balance and unity in essay, very poor use of transitions.

BILC Language Testing Seminar VII-11

Chapter VII Evaluation of Writing Proficiency

L ANGUAGE C ONTROL

5 Excellent language control; grammatical structures and vocabulary well chosen.

4 Good language control and reads relatively well; structures and vocabulary generally well chosen.

3 Acceptable language control but lacks fluidity; structures and vocabulary express ideas but are limited.

2 Rather weak language control; readers aware of limited choice of language structures and vocabulary.

1 Little language control; readers are seriously distracted by language errors and restricted choice of forms.

BILC Language Testing Seminar VII-12

Chapter VII Evaluation of Writing Proficiency

1

2

3

4

5

WRITING RATING FACTOR GRID

(threshold definitions – minimum requirements for a level)

GLOBAL TASKS AND

FUNCTIONS

Limited practical needs: simple phone messages, excuses, notes to service people and simple notes to friends, making statements and asking questions.

LEXICAL CONTROL

Very familiar topics; e.g. simple biographical and personal data.

Continual errors.

STRUCTURAL CONTROL

Can create sentences although almost every sentence has errors in basic structure. Vague time concepts.

SOCIOLINGUISTIC

COMPETENCE/STYLE

Can be understood by a native reader used to dealing with foreigners attempting to write the language. Native reader must employ real world knowledge to understand even a simple message.

ORTHOGRAPHY (spelling, capitalization, punctuation)

Continual errors in spelling, capitalization and punctuation.

TEXTS PRODUCED

Can generate simple sentences. Attempts to create paragraphs results in a loose connection of sentences or fragments with no conscious organization.

Routine social correspondence, documentary materials for most limited work requirements; writes simply about current events and daily situations.

Sufficient to simply express oneself with some circumlocutions; limited number of current events and daily situations; concrete topics, personal biographical data.

Good control of morphology and most frequently used syntax.

Elementary constructions are typically handled quite accurately, though errors may be frequent. Uses a limited number of cohesive devices.

Writing is understandable to a native reader not used to dealing with foreigners. Satisfies routine social demands and limited work requirements. Native reader may have to adjust to non-native style.

Makes common errors in spelling, capitalization and punctuation, but shows some control of most common formats and punctuation.

Minimally cohesive, full paragraphs.

Able to use the language effectively in most formal and informal written exchanges for professional duties. Can write reports, summaries, research papers on particular areas of interest in order to answer objections, clarify points, justify decisions, and state and defend policy.

Broad enough for effective formal and informal written exchanges on practical, social, and professional topics. Can express abstract ideas.

Able to write precisely and accurately in a variety of prose styles pertinent to professional/ educational needs. Consistently able to tailor to suit an audience.

Precise for professional and educational needs and social issues of a general nature. Able to express subtleties and nuances. Writing adequate to express all his/her experiences.

Employs full range of structures.

Consistent control of compound and complex sentences. Control of grammar good with only sporadic errors in basic structures; occasional errors in the most complex structures. Errors virtually never interfere with comprehension. Relationship of ideas is consistently clear.

Style may be obviously foreign, although writer is able to effectively combine vocabulary and structure to convey meaning accurately and naturally to a native reader.

Although not native in style, it rarely disturbs a native reader.

Spelling, capitalization and punctuation generally controlled.

Writing is cohesive

(relationship of ideas is consistently clear – consistent control of compound and complex sentences).

Errors in grammar are rare including those in low frequency structures.

Expository prose is clearly, consistently and explicitly organized.

A variety of organizational patterns is used.

Errors are rare. Uses a wide variety of cohesive devices such as ellipsis and parallelism and subordinates in a variety of ways.

Can edit both formal and informal correspondence/official reports and documents; professional/ educational articles including writing for special purposes

(legal, technical, literary).

No non-native errors. The writing and ideas are imaginative.

No non-native errors. Writing is clear, explicit and informative. The writer uses a very wide range of stylistic devices.

No non-native errors. Writing proficiency equal to that of a well educated native.

BILC Language Testing Seminar VII-13

Chapter VII

COMPOSITION ASSESSMENT

Essay Topic:

Exam Level:

CHECKLIST

Level 2

• describes people, places and things narrates past, present and future activities uses paragraphs uses simple transitions correctly high-frequency grammatical structures controlled prose is difficult to follow vocabulary errors sometimes distort meaning frequent errors in spelling, style and writing conventions

Level 3

Evaluation of Writing Proficiency

Level 4 discusses abstract topics formulates hypotheses analyses and provides interpretative comment control of grammatical structures is adequate to convey meaning accurately vocabulary is appropriate to convey meaning complex structures and vocabulary not used consistently discourse is cohesive

• uses complex structures accurately elaborates on abstract concepts organizes extended texts well uses stylistically appropriate prose expresses nuances and shades of meaning

Comments

BILC Language Testing Seminar VII-14

Chapter VII

WRITING SAMPLES

Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-15

Chapter VII

Write a short essay on the topic of City Living, covering the following points:

What are the advantages and disadvantages of living in a city?

Cities should not exceed a population of 3 million people. Do you agree?

• Should governments try to develop country areas as well as cities?

Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-16

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-17

Chapter VII Evaluation of Writing Proficiency

Explain how your education and experience have prepared you to teach languages at the Defense

Language Institute.

BILC Language Testing Seminar VII-18

Chapter VII Evaluation of Writing Proficiency

Write an article for a local English language newspaper about how every individual in Germany can contribute to improving the environment. Please include a discussion whether individual efforts have any impact on global environmental problems.

Mühlenteich 14

Centre

31608 Marlohe

Dear Mr. Alpha,

With this letter I would answer your request for a short article which deals about the German way of recycling. I would try to give a short description about refuse disposal and waste utilization.

But first of all, I would try to explain the situation in general.

If this world didn't learn to cope with environmental problems, we would look forward into a very bad future. And secondly, if this process of pollution of the environment moves on as fast as today, we are faced with an ecological disaster.

Everyone is able to take part in the fight against this environmental problem. One possible way to reduce the ecological damage is to seperate all the different parts of rubbish.

In Germany we have to seperate the rubbish into four different groups. The first group contains all the different kinds of paper. The second group contains all the different kinds of metal. Glass belongs to the third group and plastic belongs to the fourth group. In spite of seperating the rubbish into the four groups there is a remaining stock of rubbish. First group is the so called "special rubbish", which contain poison, paint and so on. The second part is the so called compost.

But, the main point is, if we seperate the rubbish into the 4 groups like

I mentioned before, the industry will be able to use this rubbish in their production. For example: Newspapers, bottles or a lot of packaging martireal is nowadays produced out of recycling martireal.

There are a lot of advantages and disadvantages to seperate the rubbish in your own house. The disadvantages are simple and clear. First of all, you need a lot of different baskets to seperate the rubbish. Secondly you need a lot of time to seperate the rubbish and thirdly, it is a "very nice experiance" to work with all your rubbish.

But the advantages are also simple and clear. We are able to save a lot of money. We are able to decrease the pollution of the environment and we are able to increase our environmental awareness.

And this point of view, in my opinion, is the best argument. We have to ensure, that the ecological damage would be as low as possible, because our children want to live on this earth.

I hope I have been able to show my point of view to this environmental problem. I would be very gratefull if you would write or [four illegible words, could be: telephone me your opinion].

Best wishes

BILC Language Testing Seminar VII-19

Chapter VII

Write a short essay on the topic of Environment, covering the following points:

Is there a green movement in your country? If so, what has it achieved?

Can technology be used to save the environment?

• What can we all do as citizens to save the environment.

Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-20

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-21

Chapter VII Evaluation of Writing Proficiency

Explain how your education and experience have prepared you to teach languages at the Defense

Language Institute

BILC Language Testing Seminar VII-22

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-23

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-24

Chapter VII Evaluation of Writing Proficiency

Write a short essay on the topic of Healthy Eating, covering the following points:

• How important is a healthy diet? What should it consist of?

• Is the Mediterranean diet really the best? What are its benefits and disadvantages?

• Can genetically modified food be used to solve the problem of world hunger?

BILC Language Testing Seminar VII-25

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-26

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-27

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-28

Chapter VII Evaluation of Writing Proficiency

Please take issue with the following statement that a high-ranking Bundeswehr officer has made: “It would be detrimental to the further political development of Europe if defence efforts of the EU countries remain at the national level.”

Recently I read an article in your newspaper about the political situation in Europe. In this article was a statement of a German officer. He claimed that the defence efforts should not be given back to the national level.

According to the fact that I am an officer of the German Army Aviation, I would like to write you my opinion about this issue. Therefore I will take a closer look at this discussion.

One of the strongest arguments for the preservation of the defence efforts on European level is based on the financial aspects. It would be much to expensive, if the costs for new equipment were in the hand of one nation.

The costs for development of a new military product have to be devided between different countries. Every country must be responsible for a special part of the new military product. This is especially important for the German Army Aviation. The new helicopter Tiger is a good example. It is a co-production of the German and French government. The costs for the develpment of the Tiger are for one country too much. In my opinion this co-production is a good example for he preservation of the defence efforts on an European level.

Evidence to support this can be found in the wars of Bosnia and Kosovo.

These wars are typical for the future. In future the wars will be civil wars. Only NATO have been able to cope with this conflicts. I am convinced that the community of the states is very important to cope with such conflicts. If the states want to work together, they have to install common staffs. Therefore it is very important that the states also work together in peace-time. From my point of view the wars in Bosnia and

Kosovo are a further example for the preservation of the defence efforts on European level.

Against this could be argued that the different countries have their own military problems. It is not possible to compare the states. Some of them have more money to support military efforts than others. The compa [not legible, could be: comparison] between England and Spain is a good example. The military efforts of England are much higher as from other countries. It will be not fair to give special countries a higher contribution of the defence tasks. In my opinion we have to find a fair solution concerning this financial aspect.

In addition to this, the countries have to look for their own defence. We have to consider the geography of a country. Germany is a typical example for land forces. On the other hand England is an example for sea forces.

However, we can use these different types of forces for a common defence.

The English forces are more responsible for the defence of the sea and the

German forces have to cope with the conflicts in Middle Europe. I think that is a further argument for the preservation of the defence efforts on

European level.

In my opinion it will be possible to find a satisfactory answer to this issue. It seems to me that the preservation of the defence efforts on

European level is very important. The different and difficult tasks of a military action have to be divided between different nations. The wars in

Bosnia and Kosovo have shown us the success of such a co-production.

Therefore I strongly agree with the statement of the German officer.

BILC Language Testing Seminar VII-29

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-30

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-31

Chapter VII Evaluation of Writing Proficiency

Write a short essay on the topic of Third World Countries, covering the following points:

• Should we send money or professionals to help countries in need?

• What are the main causes for the problems in these countries?

What can be done to solve the problem?

BILC Language Testing Seminar VII-32

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-33

Chapter VII Evaluation of Writing Proficiency

Explain how your education and experience have prepared you to teach languages at the Defense

Language Institute.

BILC Language Testing Seminar VII-34

Chapter VII Evaluation of Writing Proficiency

BILC Language Testing Seminar VII-35

Download