Vol. 6. No. 2 R-8 September 2002 A Focus on Language Test Development: Expanding the Language Proficiency Construct Across a Variety of Tests (Technical Report #21) Thom Hudson & J. D. Brown (Eds.) (2001) Honolulu: University of Hawai'i, Second Language Teaching & Curriculum Center. Pp. xi + 230 ISBN 0-8248-2351-6 (paper) US $25.00 Eight research studies highlighting alternative approaches to language assessment are offered in this text. Systemically illustrating the entire test development process, these studies underscore some of the difficulties in designing and implementing tests. One strength of the book is that it offers a close-up view of how some non-standard tests were developed. Another is that it provides enough detail to carefully reflect on the design problems inherent in each test study. If you are looking for perfect model studies which all yield significant and meaningful results, this is not the book for you. However, if you are interested in how innovative studies of alternative forms of language assessment, this book has much to offer. 1. Evaluating Nonverbal Behavior After mentioning some of the ways nonverbal behavior impacts on communication, Nicholas Jungheim described a study assessing the correlation between oral proficiency in English and nonverbal ability. Nonverbal ability, in the context of this study, was operationalized as the extent that the head movements, hand gestures, and gaze shifts of Japanese EFL students differed from a native EFL teacher population. Linguistic proficiency was measured by the TOEFL and oral English ability by the ETS SPEAK. The results suggest that nonverbal ability is a separate construct from language proficiency as measured by TOEFL and SPEAK: the correlations between the 3-point nonverbal ability scale used in this study and the language test scores were too low to be meaningful. 2. Pragmatic Picture Response Tests Highlighting the problems with traditional discourse completion tasks, Sayoko Yamashita next described how a picture response test was developed to explore ways that apology strategies differ between (1) native English speakers speaking English (NSE), (2) native Japanese speaking Japanese (NSJ), (3) Americans speaking Japanese (JSL), and (4) Japanese speaking English (EFL). Yamashita found expressions of dismay were used by NSEs less frequently than the other groups. The NSE group also made offers of repair when apologizing more frequently than the other groups. Since perceptions of politeness partly depend on which apology act strategies are employed, these findings were meaningful. Yamashita concluded by mentioning ways picture response tests could be used to assess speaking, writing, listening, or cross-cultural proficiency skills. [-1-] 3. Assessing Cross-cultural Pragmatic Performance After noting some of the limitations of self-assessment tests in general, Thom Hudson detailed how a three-phase pragmatic performance assessment was developed. In the first phase, 25 Japanese ESL students rated how they thought they would respond to four roleplays using a 5-point scale of appropriacy. In the next phase, three native speakers rated their actual performance by the same scales. In the final phase, students viewed their performance on video and rated themselves again. Surprisingly, the three rating methods showed little variance. Hudson suggested the test population may have been too advanced and questioned whether self-assessment was the most appropriate way to evaluate pragmatic proficiency. The need for more research on both holistic and analytic self-assessment scales was emphasized at the end of this article. 4. Measuring L2 Vocabulary After mentioning the pros and cons of the leading vocabulary assessment methods, Yoshimitsu Kudo outlined a pilot study to gauge the vocabulary size of Japanese EFLstudents. A twenty-minute test consisting of 18 definition-matching style questions for five different vocabulary levels was administered. To reduce the task load and enhance speed, a translation format was selected. Kudo obtained a low reliability for the pilot test, suspecting that English neologisms may have affected his data. Kudo then revised the test by filtering out English loan words and the reliability coefficients rose for four of the five tests. Kudo concluded by cautioning vocabulary test developers to consider the impact of foreign loan words. English loan words now permeate many languages. 5. Assessing Collocational Knowledge After providing a theoretical discussion of what collocations are, and how they differ from idioms and formulaic speech, William Bonk described a study to ascertain the relation between collocational knowledge and general language proficiency. He administered a 50-item collocation test and abridged TOEFL to 98 ESL students at the University of Hawaii. Bonk found .73 correlation between collocational knowledge and general language proficiency and emphasizing the need for further collocational knowledge research. 6. Revising Cloze Tests Next J. D. Brown, Amy Yamashiro, and Ethel Ogane described three approaches to cloze test development. To see how each method differed they conducted a two-phase research study. In the first phase, cloze tests with different deletion patterns were administered to 193 Japanese university students. In the second phase, two revised versions of this test were created: one with an exact-answer scoring system and the other with a broader acceptable-answer scoring system. The researchers found the acceptable-answer scoring system superior in most respects to the exact-answer system, and concluded with some practical advice for cloze test design. 7. Task-based EAP Assessment Highlighting some of the ongoing controversies regarding performance assessment, John Norris then detailed how an EAP test at the University of Hawaii was developed. Three diverse representatives of the university community were recruited to identify criteria for an EAP performance test. An analysis of their notes revealed significant variation in ideas concerning benchmark performance criteria and in overall detail. The need for team members to be able to effectively negotiate the final test criteria was underscored. Norris concluded by listing ways to help criteria identification teams work more effectively, and mentioned some topics for further research in performance assessment. [-2-] 8. Criteria-referenced Tests In the final chapter J. D. Brown outlined how he developed a set of criterion-reference testing materials for a popular textbook series. After creating lots of test items and piloting them, those items that seemed problematic were revised or weeded out. A semester later different revised forms of the test were administered. Items that did not discriminate well or lacked validity or reliability were again discarded. The final test was sensitive to the material covered in the text series, valid for the populations tested, and fairly dependable. Despite this, Brown cautioned against basing high-stakes educational decisions on any single test score. This volume will appeal to two groups of people: those with some understanding of research design principles who wish to critically reflect on how those principles can be carried out in actual studies, and those with an interest and non-standard types of language assessment.