Tatsuhiko Matsushita LALS, Victoria University of Wellington tatsuhiko.matsushita@vuw.ac.nz Main findings • VDRJ is useful for designing curriculum (material, tests etc.) • The more domains a words is shared as AW or LAD by, the more • • • • • • • abstract the meaning of the word is. Conversation and non-academic texts contain more general words and LW Academic texts: more AW and LAD but less LW in any academic domain Wikipedia: more proper nouns and low frequency words Newspapers and academic items of Wikipedia can be a good resource for learning AW and LAD. Natural science texts contain more academic domain words at lower frequency levels than arts and social science texts Origins of academic and literary words are considerably clearly separated; 3/4 of LW originate in Japanese while 3/4 of AW and LAD originate in Chinese LAD contains more Western origin words (Gairaigo) Contents 1. Motive for this research 2. Goals of this presentation 3. Vocabulary Database for Reading Japanese 4. Tiers of Japanese vocabulary (Basic words, academic words, limited-academic domain words, literary words) 5. Text coverage by word tier 6. Proportions of word origin types by word tiers 7. Number of characters required to cover the word tiers Implications from the findings 8. Conclusion 7. 1. Motive for this research How efficiently can we learn vocabulary? • Learning burden is big! • More effective choice of target words • More efficient order for learning the words Effective choice and efficient order: to maximize the coverage of text which the learner would encounter in his/her domain = Reading comprehension and lexical density (Hu & Nation, 2000; Komori et al., 2004) Q. What words should learners learn first? And second and next? Studies on EAP vocabulary • Basic: General Service List (West, 1953) • Academic: AWL (Coxhead, 2000) UWL (Xue & Nation, 1984) • EGAP-A/S, EGAP-HM/SS etc. (Tajino, Dalsky, & Sasao, 2009) • Science-specific Word List (Coxhead & Hirsh, 2007) • Technical: e.g. Chung (2003) • Literary vocabulary? Studies on JAP vocabulary • Basic: The former JLPT list, Tamamura (1987) etc. • Academic: Butler (2010), Matsushita (2011) •? • Technical: Komiya (1995), Oka (1992) etc. • Others • No list for words between academic and technical words • Literary vocabulary? 2. Goals of this presentation To introduce I. the Vocabulary Database for Reading Japanese II. extracted domain-specific words such as Academic Words (AW), Limited-Academic-Domain Words (LAD), Literary Words (LW) To argue about IV. how the word tiers work in different types of text (register variation) V. how learner’s language background possibly affects the understanding of texts in different genres 3. Vocabulary Database for Reading Japanese • Vocabulary Database for Reading Japanese (VDRJ) (Matsushita, 2010; 2011) • Created from the Balanced Contemporary Corpus of Written • • • • • Japanese, 2009 monitor version (NINJAL, 2009) 33 million token (28 million from books and 5 million from the Internet forum sites (Yahoo Chiebukuro)) 19 million content words and 14 million function words Unit of counting: Lexeme – considerably inclusive but less inclusive than the word family (Level 6 in Bauer & Nation, 1993) in English “Short unit of lexemes” are ranked by U (usage coefficient) (Juilland & Chang-Rodrigues, 1964) Short unit of lexeme: more inclusive than “lemma”, less inclusive than “word family” Some problems of existing Japanese word frequency lists • Lack of representativeness • Too old • The corpus size is not large enough: low reliability for low frequency words • No good sub frequency data which enable us to calculate dispersion to downgrade unevenly distributed words Advantages of word lists * Various types of word lists can be created from the vocabulary database (VDRJ) Reference for developing vocabulary tests = Checking learners’ vocabulary levels B) Reference for checking vocabulary level of material = Checking vocabulary levels of materials A) C) Specify vocabulary for learners to learn and for teachers to teach For better choice of material, modification of text Cf. Nation (2011), Word profiler How to make VDRJ A) Method I. Classify all the texts into some sub corpora to see the range and dispersion cf. Nippon Decimal Classification, BCCWJ (NINJAL, 2009) II. Parse (made word segmentation of ) all the texts by a morphological analyzer with a dictionary (if the text is not segmented by space between words.) cf. MeCab, UniDic III. Make word lists by AntConc and/or AntWordProfiler Content and construct of VDRJ • Vocabulary Database for Reading Japanese • The list is for reading as it is made from written corpus of books and internet forum sites • Written and spoken languages are different in word frequency, domain and required language processing skills ⇒ A good corpus of spoken language is necessary to develop a good word list for it(, but there is no very good corpus of spoken Japanese…) The Classification of Domains and Fields (Corpus form books and internet forum sites, BCCWJ 2009 monitor version) Domain/Field Literary Works/Imaginative T exts The ten domains Code for the ten domains The 28 academic field code Literary works LW a6_G Languages, Linguistics and Philosophy LP History and Ethnology HE Notes All classified as general texts of a6 Humanities and Arts Languages and Linguistics Philosophy and Religion History Ethnology a1 a2 a3 a4 a5 Fine Arts Literature (non-imaginative texts e.g. critique) Arts and Other Humanities AH a6_T All classified as technical texts of a6 a7 Other Humanities and Arts Social Sciences Politics s1 Politics and Law PL Economics and Commerce EC Sociology, Education and Other Social Issues SE s2 Law Economics s3 s4 Commerce and Business Sociology and Social Issues Education Other Social Matters s5 Including welfare, labour, gender issues s6 Including pedagogy on each subject s7 Including transportation, media, current issues T echnological Natural Sciences Mathematics t1 Physics t2 t3 Astronomy, Earth and Planetary Science Chemistry, Metal and Mine Science and Technology ST t4 Technology (Architecture, Civil Engineering) t5 Technology (Mechanics, Electricity, Marine Engineering) t6 Other Technological Natural Sciences t7 Including information science, manufacturing, library science, part of domestic science Biological Natural Sciences Biology b1 Agriculture b2 b3 Pharmacy Medicine Biology and Medicine BM b4 Dentistry b5 Nursing b6 Other Biological Natural Sciences b7 Internet Q & A Forum (Yahoo Chiebukuro) Including forestry, fishery, animal husbandry, veterinary IF Including sports, hygienics, environmentology, part of domestic science Content of the sub corpora Types and Tokens by the Ten Domain Classification (Corpus form books and internet forum sites, BCCWJ 2009 monitor version) Domain Number of Tokens Ratio Literary Works/Imaginative Texts 8251999 25.1% Languages, Linguistics and Philosophy 2134739 6.5% History and Ethnology 3336818 10.2% Arts and Other Humanities 3020917 9.2% Politics and Law 1881012 5.7% Economics and Commerce 2209107 6.7% Sociology, Education and Other Social Issues 2996147 9.1% Science and Technology 1512784 4.6% Biology and Medicine 2251037 6.9% Internet Q & A Forum 5224852 15.9% 32819412 100.0% Total Different word rankings • The word ranking problem mainly exists in Basic Words • This is mainly due to lack of good spoken corpora • Compromise: frequency weighted to limited domains which seem to reflect basic daily needs • For International Students • For General Learners • Non-weighted (ranking for overall written Japanese) Multidimensional scaling (MDS) 10 domains 10 domains + word familiarity 4. Tiers of Japanese vocabulary (1) The concept of “word tiers” • Domain / Level • Level = general importance = frequency × dispersion Some words are frequent only in a particular domain e.g. 発送 (shipping) 振り込み (paying by bank transfer) 古墳 (tumulus / burial mound) Assumed word tiers for students Level • Basic: Top 1288 = Former JLPT Level 4 &3 • Intermediate: Ranked 1289-5000 • Advanced 1: 6K-10K • Advanced 2: 11K-15K • Super-Advanced: 15K-20K • 21K+ • Assumed Known Words (AKW) Domain *General / Academic / Literary 4. Tiers of Japanese vocabulary (2) Basic words (BW) • Feature of the corpus: formal written language similar to BNC (Nation, 2004) • No good spoken corpus for vocabulary studies • Compromise • For learners and teachers lists, the former JLPT Level 4 $ 3 vocabulary is put at the top of the list as basic words To order the basic words • Identify closer domains to word familiarity (basic needs) by Multidimensional Scaling (MDS) • Frequency in literary works and the Internet-forum sites (Yahoo-Chiebukuro) is weighted 4. Tiers of Japanese vocabulary (3) Academic domain words Extracting academic domain words • Log-likelihood ratio (LLR)(Dunning, 1993) • Target texts: Technical texts • Classified into four large academic domains • Total number of tokens: approx. 2.9 million • Reference texts: General texts in BCCWJ 2009 • Total number of tokens: approx. 29.9 million • Extract keywords shared by 4 - 1 domains • Cut off point: higher for more narrowly distributed words Number of Shared Academic Domains among the 4 academic domains Ah Ss 1 1 2 2 2 3 3 4 2 3 3 Tn 1 2 Bn 2 Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences 1 4. (3) Academic domain words • Academic words (AW): high specificity in 3+ academic domains • 4-domain words (cut off point: LLR > 0) • 3-domain words (cut off point: LLR > 0) • Limited-academic-domain words (LAD) • 2-domain words (cut off point: LLR > 1) • 1-domain words (cut off point: LLR > average value) • Eliminate the former JLPT Level 4 vocabulary (Top 700 words) • Eliminate the words ranked at 20001 or lower • Classify all the AW and LAD by word ranking levels for International Students (U=Usage Coefficient): • 5 levels: Basic / Inter. / Adv. 1 / Adv. 2 / Super-adv. The Classification of Domains and Fields (Corpus form books and internet forum sites, BCCWJ 2009 monitor version) Domain/Field Literary Works/Imaginative T exts The ten domains Code for the ten domains The 28 academic field code Literary works LW a6_G Languages, Linguistics and Philosophy LP History and Ethnology HE Notes All classified as general texts of a6 Humanities and Arts Languages and Linguistics Philosophy and Religion History Ethnology a1 a2 a3 a4 a5 Fine Arts Literature (non-imaginative texts e.g. critique) Arts and Other Humanities AH a6_T All classified as technical texts of a6 a7 Other Humanities and Arts Social Sciences Politics s1 Politics and Law PL Economics and Commerce EC Sociology, Education and Other Social Issues SE s2 Law Economics s3 s4 Commerce and Business Sociology and Social Issues Education Other Social Matters s5 Including welfare, labour, gender issues s6 Including pedagogy on each subject s7 Including transportation, media, current issues T echnological Natural Sciences Mathematics t1 Physics t2 t3 Astronomy, Earth and Planetary Science Chemistry, Metal and Mine Science and Technology ST t4 Technology (Architecture, Civil Engineering) t5 Technology (Mechanics, Electricity, Marine Engineering) t6 Other Technological Natural Sciences t7 Including information science, manufacturing, library science, part of domestic science Biological Natural Sciences Biology b1 Agriculture b2 b3 Pharmacy Medicine Biology and Medicine BM b4 Dentistry b5 Nursing b6 Other Biological Natural Sciences b7 Internet Q & A Forum (Yahoo Chiebukuro) Including forestry, fishery, animal husbandry, veterinary IF Including sports, hygienics, environmentology, part of domestic science 4. Tiers of Japanese vocabulary (3) -1 Academic words (AW) • JAWL = Japanese Academic Word List • High specificity in 3 or 4 academic domains • 4-domain words (cut off point: LLR > 0) • 3-domain words (cut off point: LLR > 0) • Level 0 - VIII 9 levels,2590 words in total • JAWL I (Intermediate): most essential for learning • Basic words contains much fewer academic words • JAWL I: 559 words Close to AWL in number and text coverage Coverage in the academic corpus used for extracting AW AWL: 10.0% JAWL I: 11.1% Academic Words: Words which are shared by 3 or 4 main academic domains Ah Ss 1 1 2 2 2 3 3 4 2 3 3 Tn 1 2 Bn 2 1 Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences Distribution and examples of JAWL JAWL Label Former JLPT Level Word Rankings for Level International Students Number of High Specificity Number Domains of Unique among the 4 Lexemes Science Domains Least Frequent 6 Words Translation of the Least Frequent 6 in Each Domain Words in Each Domain 4 31 科学 規則 割合 生産 産業 講義 science, rule, proportion, ptoduction, industry, lecture 3 39 人口 ス ク リ ーン 数学 競争 工業 地理 population, screen, mathmatics, competition, manufacture, geography 4 559 発足 半数 配分 縮小 適正 見直し inauguration, half the number, allocation, downsize, proper, reconsider JAWL II 3 542 演説 大小 実情 ス テ ージ ラ イ フ 担保 speech, size, real situation, stage, life, guarantee JAWL III 4 212 難問 能動 付随 定型 除 本稿 difficult problem, active, accompany, standard, except, this article 3 452 交錯 カ ウ ン ト 精度 一因 箇年 エ ン ド mixture, count, accuracy, one cause, -year, end 4 103 併存 親和 盛況 散在 補填 関わ り 合う coexistence, affinity, prosperity, straggle, compensation, implicated JAWL VI 3 328 帰着 編著 沿海 拮抗 常套 内情 come down to, written and edited, coastal, close competition, conventional, internal condition JAWL VII 4 56 閉 増刊 含意 複 活路 所与 closed, extra edition, implication, double-, way out, given 3 269 極小 付則 深度 概算 頒布 円錐 minimal, additional clause, depth, rough estimate, distribution (of goods/paper), cone JAWL 0 L3 679-1288 Basic JAWL I 1289-5000 Inter. 5001-10000 Adv. 1 L2 L1 JAWL V Other JAWL IV 10001-15000 Adv. 2 15000-20000 JAWL VIII Superadv. 4. (3) -1 Academic words (AW) Semantic features of AW (1) • Highly abstract, essential for operating logic i.e. • Range: 占める (occupy, account for), 特殊 (special, particular) • Relation: 属する (belong to), 依存 (rely/reliance) • Comparison/Evaluation: 後者 (the latter), 優れる (superior), • Quantitative change: 減少 (decrease), 強化 (reinforce) • Stage: 当初 (beginning), 現状 (present condition) • Development of enunciation: 取り上げる (take up [an issue]), まとめる (summarize) • Cause-effect, degree, agent, action, object, direction, goal, instrument, time etc. 4. Tiers of Japanese vocabulary (3) -1 Academic words (AW) Semantic features of AW (2) The most frequent Kanji used for AW 合 (combine, together), 定 (fix, certain), 分 (divide, minute), 一 (one), 同 (same), 数 (number), 上 (up), 体 (body), 出 (out), 大 (large) • 3-domain words: Some words have concrete meanings e.g. 署名 (signature), 保健 (health, hygiene) • 4-domain words: Few words have concrete meanings • The nature of the words are the same at all levels POS of Japanese AW (1) • Common noun: 1072 words (41.4 %) e.g. 背景 (background) • Verbal noun: 882 words (34.0 %) e.g. 連続 (establish/-ment) Adding other types of nouns together, 2104 words (81.2 %) can be a noun • Verb (excluding verbal nouns): 225 words (8.7 %) e.g. 認める (recognize/approve) 述べる (describe/mention) Adding other types of verbs together, 1107 words (42.7%) can be a verb • Adjectival noun: 95 words (3.7 %) e.g. 詳細 (detail/-ed), 平等 (equal/-ity) • Adjective:Only 9 words (0.3 %) e.g. 著しい (remarkable) POS of Japanese AW (2) • Affix: 106 words (4.1 %) e.g. -期 (period), -種 (type) substantial in Japanese academic words • Adverb: 34 words (1.3 %) e.g. しばしば (frequently) • Other (particle, auxiliary verb etc.): 22 words (0.8 %) • Remarkably many archaic words e.g. のみ (only), つつ (while doing), べし (ought to), あらゆる (every) いかなる (any), 我が (my), 漠然 (vague) • れる/られる (Passive/Potential/Spontaneous) specific in academic texts 4. (3) -2 Limited-academic-domain words (LAD) • Limited-academic-domain words (LAD) • High specificity in 2 or 1 domain(s) • 2-domain words (cut off point: LLR > 1) • 1-domain words (cut off point: LLR > average value) • Something between “academic” and “technical” • The “scams” from extracting AW? • Tiers of curriculum cf. Tajino et al. (2007) • Words correspondent to the curriculum • Basic: all the learners • Academic words: prep. to first year • Limited-academic-domain words (?): prep. to major • Technical words: major to postgrad. Number of Shared Academic Domains among the four academic domains Limited-Academic-Domain Words : Words which are shared by only 1 1 or Ah 2 main academic domain(s) Ss 2 2 3 2 2 3 2 4 3 1 3 2 Tn 1 1 Bn Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words Distribution of 2-Domain Words of Japanese Limited-Academic-Domain Words (JLAD) Level Number of Unique Lexemes in LAD of Ah & S s Number of Unique Lexemes in LAD of Ah & Tn Number of Unique Lexemes in LAD of Ah & Bn Number of Unique Lexemes in LAD of S s & Tn Number of Unique Lexemes in LAD of S s & Bn Number of Unique Lexemes in LAD of Tn & Bn Basic 15 5 4 5 6 10 45 1289-5000 Inter. 139 27 30 77 57 61 391 L2 5001-10000 Adv. 1 L1 JLAD V Other 10001-15000 Adv. 2 138 38 25 86 50 92 429 91 28 22 58 37 60 296 93 23 17 43 16 40 232 476 121 98 269 166 JLAD Label JLAD 0 JLAD I Word Former Rankings for JLPT International Level S tudents L3 679-1288 JLAD III JLAD VII 15000-20000 Total Superadv. Total 263 1393 Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words Examples of 2-Domain Words of Japanese Limited-Academic-Domain Words (JLAD) JLAD Label Word Former Rankings for JLPT International Level S tudents Level Least Least Least Least Least Least Frequent Frequent Frequent Frequent Frequent Frequent 2 Words 2 Words 2 Words 2 Words 2 Words 2 Words in LAD of in LAD of in LAD of in LAD of in LAD of in LAD of Ah & S s Ah & Tn Ah & Bn S s & Tn S s & Bn Tn & Bn 貿易 以内 ア ル コ ー ル 砂 発音 製 輸出 テ キ ス ト ス テ レ オ レ ポ ート パート テ ニ ス 孤立 オ ール 静岡 ニ ーズ 総務 ス イ ッ チ JLAD I 1289-5000 Inter. 融資 ペーパ ー 書簡 顧客 性的 液 容れる 音響 発現 閉塞 多用 本件 JLAD III L2 5001-10000 Adv. 1 教義 流布 海域 セ ク シ ョ ン 弱め る 部位 L1 落差 目付け VTR 所見 光学 JLAD V Other 10001-15000 Adv. 2 払 い 戻 し リ ハ ー サ ル ペーハ ー コ ロ ン 生長 救命 Super- 峻別 目配り 太極 パ レ ッ ト マ ン ガ ン 棒状 JLAD VII 15000-20000 adv. 公債 テ ク ノ 増量 軽微 居宅 雨水 JLAD 0 L3 679-1288 Basic ユニ バーシ テ ィ 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words Examples of 2-Domain Words of Japanese Limited-Academic-Domain Words (Translation) JLAD Label Word Forme Rankings for r JLPT International Level S tudents Level Translation of Translation of Translation of Translation of Translation of Translation of the Least the Least the Least the Least the Least the Least Frequent 2 Frequent 2 Frequent 2 Frequent 2 Frequent 2 Frequent 2 Words in LAD Words in LAD Words in LAD Words in LAD Words in LAD Words in LAD of Ah & S s of Ah & Tn of Ah & Bn of S s & Tn of S s & Bn of Tn & Bn sand pronunciation made (in) text stereo report all Shizuoka pref./city need (n.) JLAD I 1289-5000 Inter. paper epistle customer compatible accoustic manifestation this matter JLAD III L2 5001-10000 Adv. 1 doctrine circulation waters section L1 refund a drop overseer VTR JLAD V Other 10001-15000 Adv. 2 university cologne growth rehearsal Super- sharp distinction meticulous care pallet tai ji JLAD VII 15000-20000 increase in quantity adv. public bond technoslight JLAD 0 L3 679-1288 Basic trade export isolation loan within part(-timer) general affairs sexual impasse weaken remark (n.) lifesaving manganese dwelling alcohol tennis switch liquid frequent use region (of body) optics pH stick-shaped rainwater Examples of 2 domain words: Words which are shared by only 2 main academic domains Ah epistle waters growth Ss isolation doctrine refund sexual weaken lifesaving paper accoustic a drop Tn need (n.) section VTR liquid frequent use pH Bn Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words • Semantic features • More concrete and specific than academic words • Ah & Ss: Social, overlap in history and ethnology • Ss & Tn: Industrial • Ss & Bn: Social security, medical and nursing service • Tn & Bn: Scientific • Ah & Tn, Ah & Bn: not clear 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • It is merely a trial • The corpus is not the best for academic purpose, especially for natural sciences • Extracting something common across domains is much easier while extracting words by only one target corpus will require more complete target corpus • Therefore, AW (4 domain words and 3 domain words) will be more reliable than LAD (2 domain words and 1 domain words) 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words Distribution of 1 Domain Words of Japanese Limited-Academic-Domain Words (JLAD) JLAD Label Word Former Rankings for JLPT International Level S tudents Level Number Number Number Number of Unique of Unique of Unique of Unique Lexemes Lexemes Lexemes Lexemes in Ah in S s in Tn in Bn Total Basic 13 6 5 9 33 1289-5000 Inter. 104 111 46 52 313 L2 5001-10000 Adv. 1 L1 JLAD V Other 10001-15000 Adv. 2 104 127 60 68 359 71 74 48 54 247 60 55 29 53 197 352 373 188 JLAD 0 JLAD I L3 679-1288 JLAD III JLAD VII 15000-20000 Total Superadv. 236 1149 Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words Examples of 1 Domain Words of Japanese Limited-Academic-Domain Words (JLAD) JLAD Label Former JLPT Level Word Rankings for International S tudents Level JLAD 0 L3 679-1288 Basic 辞典 文法 工場 遊び 海岸 汽車 退院 柔道 1289-5000 Inter. 色彩 滋賀 紛争 犯 原子 コ ン ク リ ート 拳 杉 王家 呪術 超過 欠席 硬化 ドラッグ 臓器 左足 報国 遍歴 持ち 分 受諾 PM 蒸留 卵子 緑茶 JLAD I 5001-10000 Adv. 1 L2 L1 JLAD V Other 10001-15000 Adv. 2 JLAD III JLAD VII 15000-20000 Superadv. Least Least Least Least Frequent 2 Frequent 2 Frequent 2 Frequent 2 Words in LAD Words in LAD Words in LAD Words in LAD of Ah of S s of Tn of Bn 厳寒 鼎 卸売り プロ グ ラ ミ ン グ 引き 当て バラ ッ ク 居合 微小 • Semantic features are much clearer than 2 domain words 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words Examples of 1 Domain Words of Japanese Limited-Academic-Domain Words (Translation) JLAD Label Word Forme Rankings for r JLPT International Level S tudents JLAD 0 L3 679-1288 JLAD I Level Translation of the Least Frequent 2 Words in LAD of Ah Translation of the Least Frequent 2 Words in LAD of S s Translation of the Least Frequent 2 Words in LAD of Tn Translation of the Least Frequent 2 Words in LAD of Bn Basic dictionary grammar factory play(ing) seashore train leave hospital judo coloring Shiga (pref.) conflict offense atom concrete (n.) fist/martial art cedar royal family incantation excess absence harden(ing) drag/drug organ left leg/foot patriotic itinerancy quota acceptance PM distillation ovum green tea wholesale researve fund programming shanty iai (martial arts) micro 1289-5000 Inter. L2 5001-10000 Adv. 1 L1 JLAD V Other 10001-15000 Adv. 2 JLAD III JLAD VII 15000-20000 Superintense cold adv. three-legged vessel • Semantic features are much clearer than 2 domain words s Examples of Academic Domain Words: Words which are shared by 1, 2, 3 or 4 main academic domain(s) coloring Ss royal family conflict Ah excess isolation epistle doctrine need (n.) waters refund section growth VTR at a stroke guarantee mixture -year proper paper sexual accoustic except weaken allocation a drop lifesaving lead end size life Tn atom harden(ing) liquid frequent use pH Bn fist/martial art organ Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences POS of Japanese LAD (1) • Common noun: 1605 words (63.1 %) – more than AW (41.4%) • Verbal noun: 633 words (24.9 %) e.g. 融資 (finance) cf. AW (34.0%) Adding other types of nouns together, 2104 words (87.9 %) can be a noun – more than AW (81.2%) • Verb (excl. verbal nouns): 81 words (3.2 %) cf. AW (8.7%) e.g. 訳す (translate) 向き合う (face (v.)) Adding other types of verbs together, 714 words (28.1%) can be a verb – less than AW (42.7%) • Adjectival noun: 88 words (3.5 %) cf. AW (3.7%) e.g. フル (full), 偉大 (great) • Adjective:Only 3 words (0.1 %) cf. AW (0.3%) e.g. 硬い (stiff) POS of Japanese LAD (2) • Affix: 109 words (4.3%) cf. AW (4.1%) e.g. –犯 (offense) substantial in Japanese academic domain words • Adverb: 15 words (0.6 %) cf. AW (1.3%) e.g. 現に (surely) • Other (particle, auxiliary verb etc.): 9 words (0.8 %) cf. AW (0.8%) • Remarkably many archaic words – similar to AW e.g. なり [affirmative aux.], とも (even though), たり [affirmative aux.], ごとし (as/like), 単なる (mere), しめる(=しむ) [causative aux.], かかる (such) 4. Tiers of Japanese vocabulary (4) Literary words (LW) Extracting literary words: Words for reading literary works • Log-likelihood ratio (Keyness in AntConc) • Target corpus: literary works (identified by NDC and C-code) in • • • • • BCCWJ 2009 (NINJAL, 2009) – Over 8 million tokens 4 different reference corpus: Technical texts, general texts in arts and humanities, general texts in the other 3 academic domains, Internet forum texts (Yahoo Chiebukuro) Extract keywords shared by the four results (Cutoff point: average value) Eliminate the former JLPT Level 4 vocabulary (Top 700 words) Eliminate the words ranked at 20001 or lower Classify all the LW by word ranking levels for International Students (U=Usage Coefficient) 4. (4) Literary words (LW) Distribution and examples Distribution and examples of Japanese Literay Words (JLW) Word Former Rankings for JLPT International Level S tudents L3 JLW Label Number of Unique Lexemes of JLW 679-1288 Basic Lit. 142 ちっと も 引き 出し (not) at all drawer 1289-5000 Inter. Lit 446 戸惑う 吐き 出す 483 不吉 銀色 puzzled vent ominous silver 345 敵機 口笛 hostile aircraft whistle 200 香菜 樹海 coriander sea of trees L2 5001-10000 Adv. 1 Lit. L1 Other 10001-15000 Adv. 2 Lit. 15000-20000 Total Super-adv. Lit. 1616 Least Frequent 2 Words of JLW Translation of the Least Frequent 2 Words of JLW 4. (4) Literary words (LW) POS of LW Number of Unique Lexemes of Japanese Literary Words by POS N. (Excl. VN. & AN,) • More verbs, adverbs V. (Excl. VN.) VN AN. (Excl. VN.) Adj. Affix Adv. Others Total 49 2 1 3 10 12 142 168 21 157 20 12 8 28 32 446 199 23 163 25 13 12 28 20 483 137 19 122 15 7 2 27 16 345 58 5 8 1 21 8 200 86 549 67 41 26 114 88 1616 56 85 645 9 14 VN: Verbal Noun AN: Adjectival Noun and interjections than AW and LAD • Less verbal nouns and adjectival nouns • This inevitably means LW have less loan words but more Japanese-origin words. 4. (4) Literary words (LW) Q. How many LW overlap with AW and LAD? • Only 27 words (0.5% of academic domain words, 1.7% of LW) are • • • • • • overlapping Most of the overlapping words (24/27) overlap with 1 domain words (17 words overlap with words in biological natural science) Many physical words such as words for body parts e.g. 左手 (left hand), こぶし (fist), 血 (blood),頭上 (overhead) No LW words overlap with 4 domain words Overlapping words are mainly at the intermediate level No overlapping words in or above 11K+ Some examples of overlapping words:音 (sound), 光 (light), 棚 (shelf), 組 (class), 岩 (rock), ひざ (knee), 興奮 (excite/-ment), 全身 (whole body), 帝 (emperer), ネズミ (mouse), 帆 (sail) Word tiers: In what order should students learn them? • Basic • General • AW/LAD • LW • Intermediate • General • AW/LAD • LW • Advanced • General • AW/LAD • LW • Highly Advanced • General • AW/LAD • LW • Super-Advanced • General • AW/LAD • LW • Assumed known words • Proper names • Fillers, Signs • (Transparent compounds *) • Others 5. Text coverage by word tier • The word tier analyser: An Excel sheet where word profiling of a text can be checked automatically by cutting and pasting the result of AntWordProfiler with the word tier base word list. • Text covering efficiency High efficiency in vocabulary learning = Fewer unique lexemes cover more texts (Reciprocal Type/Token Ratio = Token/Type Ratio?) *Comparison should be made between equally-sized texts) Coverage of Japanese texts by word tier Name of Text MC UPC OB Bestseller Text Genre Conversation Total Number of Tokens for 1.13 Each Test Corpus (Million) Word Tier (*) Total # of Types in Each Tier (Lexeme) Nonbooks academic (dominant prose ly novels) BCCWJ Books & Internet Forum UYN TIS TB MTC-Ss MTC-Tn MTC-Bn Humaniti Technolo Biological es & Social Social gical Wikipedia Newspaper Natural Social Sciences Sciences Natural Sciences Sciences Sciences 2.10 2.30 32.82 5.90 5.68 0.04 0.19 0.05 0.07 0.01 % of Tokens (Overlap included) (30821) 1.7 1.3 2.4 2.0 General 13303 81.0 77.2 78.0 74.7 Academic 2590 2.7 7.6 7.2 10.9 Limited-Aca.-Dom. 2542 1.6 3.2 3.8 5.3 Literary 1616 10.8 7.4 6.5 4.6 Overlap -27 0.0 -0.2 -0.2 -0.1 21K + Others -2.2 3.5 2.2 2.6 Total 20024 100.0 100.0 100.0 100.0 AKW (**) (Proper nouns etc.) WP 3.7 64.9 14.9 7.3 1.8 -0.1 7.4 100.0 1.3 63.5 20.7 11.2 1.7 -0.1 1.7 100.0 0.9 0.4 0.3 0.6 66.1 66.0 67.2 61.1 20.7 21.3 20.9 23.2 8.9 8.9 7.7 5.9 2.0 1.6 1.6 2.3 0.0 0.0 0.0 -0.1 1.4 1.8 2.4 7.0 100.0 100.0 100.0 100.0 0.3 61.6 22.7 6.8 1.4 -0.1 7.3 100.0 * All words except 'AKW' and '21K+Others' are listed in top 20000 (01K-20K) ranked by the Word List for International Students (Matsushita, 2011) ** AKW (Assumed Known Words): Words such as proper nouns or fillers which are assumed not to require previous learning. Findings from the text coverage • Conversation and Non-academic texts: more general words and LW • Wikipedia: more proper nouns and low frequency words • Academic items of Wikipedia: 15-20% of the texts of are estimated to be covered by JAWL 1 (559 types) – encyclopaedic nature of AW? can be a good resource for learning AW • Newspapers: similar to academic texts, but contains more LAD and AW at the advanced level can be a good resource for learning AW (esp. in social sci.) • Academic texts: more AW and LAD but less LW in any academic domain • Academic texts in natural sciences: more academic domain words at lower frequency levels (technical vocabulary) than Ah. and Ss. texts – similar to Coxhead, Stevens, & Tinkle (2010) 6. Proportion of word origin types by word tier Proportion of Unique Lexemes by Word Origin and Word Tier in 01K-20K (*) (Matsushita, 2011) Word Origin Word Tier General Academic Li mi ted-a ca demi c-doma i n Literary Overlap Total Japanese (%) Chinese (%) Western & Other (%) 38.4 45.3 10.8 15.0 75.2 7.0 12.4 69.1 13.7 71.7 21.8 2.5 74.1 22.2 0.0 34.7 50.3 10.0 *Including 24 compound numerals (01K+) Mixed (%) 3.2 1.9 1.7 3.1 3.7 2.8 Proper Nouns (%) Unknown & Signs (%) 1.5 0.4 2.2 0.3 0.0 1.4 0.8 0.5 1.0 0.6 0.0 0.8 Total 100.0 100.0 100.0 100.0 100.0 100.0 Findings from the proportion of word origin types by word tier • LW: Japanese origin words are significantly dominant • AW and LAD: Chinese origin words are significantly dominant • LAD: more Western origin words (Gairaigo) Western origin words tend to appear more at lower frequency levels in academic domain words • Origins of academic and literary words are considerably clearly separated: • Academic – Chinese origin • Literary – Japanese origin 7. Implications from the findings Q. Word Tiers: In what order should students learn them? • Highly Advanced • Basic • Academic • Academic • LAD • LAD • General • General • Super-Advanced • Intermediate • Academic • Academic • LAD • LAD • General • General • Assumed known words • Advanced • Proper names • Academic • Fillers • Signs • LAD • (Transparent compounds *) • General • Others Implications for teaching and research • A vocabulary conscious curriculum should be designed and incorporated in Japanese programs depending on the learners’ needs and language backgrounds • The gap between Chinese-background learners (CBLs) and non-CBLs will be less in basic conversation and reading literary works than in reading academic texts • Good curriculum for learning academic domain words is particularly desired for non-CBLs of academic Japanese • Autonomous mode for learning vocabulary will be necessary particularly when the learners’ needs and language backgrounds are various 8. Conclusion Limitations of the word lists • Less valid in narrower domain words (2D/1D words) and less reliable in higher frequency levels Need refining by more complete academic corpus • Multi-word units not extracted • Not sensitive to different usages in different domains (polysemy) Remaining issues • Many transparent compounds in Japanese What is Kanji tier? How is it related to word tier? Download sites for VDRJ/JAWL Matsushita Laboratory for Language Learning http://www.wa.commufa.jp/~tatsum/English%20top_T atsu.html (Interface: English) Google it with “matsushita” and “language” 松下言語学習ラボ http://www.wa.commufa.jp/~tatsum/index.html (Interface: Japanese) Google it with “松下” and “言語” Main findings • VDRJ is useful for designing curriculum (material, tests etc.) • The more domains a words is shared as AW or LAD by, the more • • • • • • • abstract the meaning of the word is. Conversation and non-academic texts contain more general words and LW Academic texts: more AW and LAD but less LW in any academic domain Wikipedia: more proper nouns and low frequency words Newspapers and academic items of Wikipedia can be a good resource for learning AW and LAD. Natural science texts contain more academic domain words at lower frequency levels than arts and social science texts Origins of academic and literary words are considerably clearly separated; 3/4 of LW originate in Japanese while 3/4 of AW and LAD originate in Chinese LAD contains more Western origin words (Gairaigo) References (1) • Anthony, L. (2007). AntConc Version 3.2.1 (text analysis tool) http://www.antlab.sci.waseda.ac.jp/software.html (Version 1.0 first published in 2002) • Anthony, L. (2009). AntWordProfiler Version 1.2 w (word profiler) http://www.antlab.sci.waseda.ac.jp/software.html (Version 1.0 first published in 2008) • Beck, I. L., McKeown, M. G., & Kucan, L. (2002). Bringing Words to Life: Robust Vocabulary Instruction. Solving problems in the teaching of literacy. New York: Guilford Press. • Butler, Y. G. (バトラー後藤裕子). (2010). 小中学生のための 日本語学習語リスト(試案)(A list of Japanese academic vocabulary for elementary and junior high school students in Japan). 母語・継承語・バイリンガル教育(MHB)研究 (Studies in Mother Tongue, Heritage Language, and Bilingual Education), 6, 42-58. References (2) • Chung, T. M. (2003). Identifying technical terms. Unpublished PhD dissertation, Victoria University of Wellington. • Corson, D. J. (1995). Using English Words. Dordrecht: Kluwer Academic Publishers. • Corson, D. J. (1997). The learning and use of academic English words. Language Learning, 47(4), 671-718. • Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238. • Coxhead, A., & Hirsh, D. (2007). A pilot science-specific word list. Revue Francaise de Linguistique Appliquee, 12(2), 65-78. • Coxhead, A., Stevens, L., & Tinkle, J. (2010). Why might secondary science textbooks be difficult to read? New Zealand Studies in Applied Linguistics, 16(2), 37-52. References (3) • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61– 74. • Eldridge, J. (2008). No, there isn’t an academic vocabulary but ... TESOL Quarterly, 109-113. • Hyland, K., & Tse, P. (2007). Is there an “Academic Vocabulary”? TESOL Quarterly, 41(2), 235-253. • Hu, M. H. & Nation, P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403-430. • Juilland, A., & Chang-Rodrigues, E. (1964). Frequency Dictionary of Spanish Words. London: Mouton & Co. References (4) • Komiya, C. (小宮千鶴子). (1995). 専門日本語教育の専門語 -経済の基本的な専門語の特定を目指して- [Technical terms for teaching technical Japanese: Aiming at identifying basic technical terms for economics]. 日本語教育 [Teaching Japanese as a Foreign Language], 86, 81-92. • Komori, K. (小森和子), Mikuni, J. (三國純子), & Kondo, A. (近藤安月子). (2004). 文章理解を促進する語彙知識の量的 側面 ―既知語率の閾値探索の試み― (What percentage of known words in a text facilitates reading comprehension: a case study for exploration of the threshold of known words coverage). 日本語教育 [Teaching Japanese as a Foreign Language], 125, 83-92. References (5) • Matsushita, T. (松下達彦). (2010) What words are essential to read Japanese? Making word lists from a large corpus of books and internet forum sites [日本語を読むために必要な語彙と は? -書籍とインターネットの大規模コーパスに基づく語彙リ ストの作成-]. Proceedings for the Conference of the Society for Teaching Japanese as a Foreign Language, Spring 2010 [2010年度日本語教育学会春季大会予稿集], 335-336. • Matsushita, T. (松下達彦). (2011). 日本語を読むための語彙 データベース (The Database for Reading Japanese). Downloaded from http://www.geocities.jp/tatsum2003/, 22 May 2011 • Nation, I. S. P. (2004). A study of the most frequent word families in the British National Corpus. P. Bogaards & B. Laufer (Eds.), Vocabulary in a Second Language: Selection, Acquisition, and Testing (p 3-13). Amsterdam: John Benjamins. References (6) • Nation, I. S. P. (2011). Making and using word lists. I. S. P. Nation & Stuart Webb (Eds.), Researching and analysing vocabulary. Boston: Heinle Cengage Learning. • Oka, M. (岡 益巳). (1992). 非漢字圏の留学生のための日本 経済基本用語表 [Basic terms of the Japanese economy for non-Kanji background students]. 岡山大学経済学会雑誌 (Okayama Economic Review), 23(4), 191-229. References (7) • Tajino, A., Terauchi, H., Sasao, Y., & Maswana, S. (田地野 彰・ 寺内 一・笹尾洋介・マスワナ紗矢子). (2007). 総合研究大学 における英語学術語彙リスト開発の意義 -EAPカリキュラム 開発の観点から- (The development of academic words lists at a multi-disciplinary university in Japan: A fundamental step in EAP curriculum design). 京都大学高等教育研究 (Kyoto University Researches in Higher Education), 13. • Tajino, A., Dalsky, D., & Sasao, Y. (2009). Academic vocabulary reconsidered: An EAP curriculum-design perspective. Journal of Teaching English as a Foreign Language and Literature, 1(4), 3-21. References (8) • Tamamura, F. (玉村文郎). (1987). 日本語教育基本2570語 [Basic 2570 words for teaching Japanese as a second language]. 日本語の 語彙・意味(2) [Japanese Vocabulary and Meaning], NAFL Institute 日本語教師養成通信講座 [Training Course of Teachers of Japanese as a Second Language]. アルク (Alc). • Townsend, D., & Collins, P. (2008). Academic vocabulary and middle school English learners: an intervention study. Reading and Writing, 22(9), 993-1019. doi:10.1007/s11145-008-9141-y • Ward, J. (1999). How large a vocabulary do EAP Engineering students need? Reading in a Foreign Language, 12(2), 309-323. • West, M. (1953). A General Service List of English Words. London: Longman, Green & Co.