Assignment 3 NA3C0005-楊馨琦(Belle) 1. Analyze the five LiveDVD film subtitles in terms of the GSL 1st 1,000, GSL 2nd 1,000 and AWL 570 word lists. THE DEVIL WEARS PRADA Lexical Profile Statistics LEVEL FILE 1 1_gsl_1st_1000.txt 8850 81.39 81.39 858 48.5 48.5 2 2_gsl_2nd_1000.txt 560 5.15 86.54 232 13.11 61.61 3 3_awl_570.txt 186 1.71 88.25 107 6.05 67.66 0 Non-Level List 1277 11.74 99.99 572 32.33 99.99 TOTAL: TOKEN TOKEN% CUMTOKEN% 10873 TYPE TYPE% CUMTYPE% 1769 The Entitled Lexical Profile Statistics LEVEL FILE TOKEN TOKEN% CUMTOKEN% TYPE TYPE% CUMTYPE% 1 1_gsl_1st_1000.txt 5643 83.91 83.91 602 61.55 61.55 2 2_gsl_2nd_1000.txt 234 3.48 87.39 123 12.58 74.13 3 3_awl_570.txt 35 0.52 87.91 23 2.35 76.48 0 Non-Level List 813 12.09 100 230 23.52 100 TOTAL: 6725 978 Beyond the Blackboard Lexical Profile Statistics LEVEL FILE 1 1_gsl_1st_1000.txt 8122 84.26 84.26 803 56 56 2 2_gsl_2nd_1000.txt 502 5.21 89.47 217 15.13 71.13 3 3_awl_570.txt 79 0.82 90.29 50 3.49 74.62 0 Non-Level List 936 9.71 100 364 25.38 100 TOTAL: TOKEN TOKEN% CUMTOKEN% 9639 TYPE 1434 TYPE% CUMTYPE% A Christmas Carol Lexical Profile Statistics LEVEL FILE 1 1_gsl_1st_1000.txt 5803 81.54 2 2_gsl_2nd_1000.txt 546 3 3_awl_570.txt 0 Non-Level List TOTAL: TOKEN TOKEN% CUMTOKEN% TYPE TYPE% CUMTYPE% 81.54 715 57.57 57.57 7.67 89.21 201 16.18 73.75 67 0.94 90.15 24 1.93 75.68 701 9.85 100 302 24.32 100 7117 1242 November Christmas Lexical Profile Statistics LEVEL FILE 1 1_gsl_1st_1000.txt 6047 79.33 2 2_gsl_2nd_1000.txt 440 3 3_awl_570.txt 0 Non-Level List TOTAL: TOKEN TOKEN% CUMTOKEN% TYPE TYPE% CUMTYPE% 79.33 704 56.19 56.19 5.77 85.1 192 15.32 71.51 50 0.66 85.76 31 2.47 73.98 1086 14.25 100.01 326 26.02 100 7623 1253 The five LiveDVD films Lexical Profile Statistics LEVEL FILE TOKEN TOKEN% CUMTOKEN% TYPE TYPE% CUMTYPE% 1 1_gsl_1st_1000.txt 34465 82.1 82.1 1514 38.67 38.67 2 2_gsl_2nd_1000.txt 2282 5.44 87.54 665 16.99 55.66 3 3_awl_570.txt 417 0.99 88.53 188 4.8 60.46 0 Non-Level List 4813 11.47 100 1548 39.54 100 TOTAL: 41977 3915 2. Analyze the five LiveDVD film subtitles in terms of the NGSL and NAWL word lists. THE DEVIL WEARS PRADA Lexical Profile Statistics LEVEL 1 FILE TOKEN TOKEN% CUMTOKEN% NGSL-Headwords-Only-mvxw.txt 7668 TYPE TYPE% CUMTYPE% 70.52 70.52 815 46.07 46.07 2 NAWL_Headwords.txt 40 0.37 70.89 30 1.7 47.77 0 Non-Level List 3165 29.11 100 924 52.23 100 TOTAL: 10873 1769 The Entitled Lexical Profile Statistics LEVEL 1 FILE TOKEN TOKEN% CUMTOKEN% TYPE NGSL-Headwords-Only-mvxw.txt 4642 TYPE% CUMTYPE% 69.03 69.03 516 52.76 52.76 2 NAWL_Headwords.txt 9 0.13 69.16 6 0.61 53.37 0 Non-Level List 2074 30.84 100 456 46.63 100 TYPE% CUMTYPE% TOTAL: 6725 978 Beyond the Blackboard Lexical Profile Statistics LEVEL 1 FILE TOKEN TOKEN% CUMTOKEN% TYPE NGSL-Headwords-Only-mvxw.txt 6839 70.95 70.95 713 49.72 49.72 2 NAWL_Headwords.txt 49 0.51 71.46 29 2.02 51.74 0 Non-Level List 2751 28.54 100 692 48.26 100 TYPE% CUMTYPE% TOTAL: 9639 1434 A Christmas Carol Lexical Profile Statistics LEVEL 1 FILE TOKEN TOKEN% CUMTOKEN% TYPE NGSL-Headwords-Only-mvxw.txt 5184 72.84 72.84 614 49.44 49.44 2 NAWL_Headwords.txt 38 0.53 73.37 17 1.37 50.81 0 Non-Level List 1895 26.63 100 611 49.19 100 TOTAL: 7117 1242 November Christmas Lexical Profile Statistics LEVEL 1 FILE TOKEN TOKEN% CUMTOKEN% TYPE NGSL-Headwords-Only-mvxw.txt 4954 TYPE% CUMTYPE% 64.99 64.99 577 46.05 46.05 2 NAWL_Headwords.txt 27 0.35 65.34 18 1.44 47.49 0 Non-Level List 2642 34.66 100 658 52.51 100 TOTAL: 7623 1253 The five LiveDVD films Lexical Profile Statistics LEVEL 1 FILE TOKEN TOKEN% CUMTOKEN% NGSL-Headwords-Only-mvxw.txt 29287 TYPE TYPE% CUMTYPE% 69.77 69.77 1396 35.66 35.66 2 NAWL_Headwords.txt 163 0.39 70.16 88 2.25 37.91 0 Non-Level List 12527 29.84 100 2431 62.09 100 TOTAL: 41977 3915 3. Tokens are the total number of word forms; types are the total number of different word forms. The first part is to analyze the five LiveDVD film subtitles in terms of the GSL 1st 1,000, GSL 2nd 1,000 and AWL 570 word lists. The second part is to analyze the five LiveDVD film subtitles in terms of the NGSL and NAWL word lists. The fist film is THE DEVIL WEARS PRADA. In terms of the token, the ratio of GSL 1st 1,000 is 81.39%; the ratio of GSL 2nd 1,000 is 5.15%; the ratio of AWL 570 is 1.71%; the ratio of NGSL is 70.52%; the ratio of NAWL is 0.37%. The second film is The Entitled. In terms of the token, the ratio of GSL 1st 1,000 is 83.91%; the ratio of GSL 2nd 1,000 is 3.48%; the ratio of AWL 570 is 0.52%; the ratio of NGSL is 69.03%; the ratio of NAWL is 0.13%. The third film is Beyond the Blackboard. In terms of the token, the ratio of GSL 1st 1,000 is 84.26%; the ratio of GSL 2nd 1,000 is 5.44%; the ratio of AWL 570 is 0.99%; the ratio of NGSL is 70.95%; the ratio of NAWL is 0.53%. The fourth film is A Christmas Carol. In terms of the token, the ratio of GSL 1st 1,000 is 81.54%; the ratio of GSL 2nd 1,000 is 7.67%; the ratio of AWL 570 is 0.94%; the ratio of NGSL is 72.84%; the ratio of NAWL is 0.53%. The fifth film is November Christmas. In terms of the token, the ratio of GSL 1st 1,000 is 79.33%; the ratio of GSL 2nd 1,000 is 5.77%; the ratio of AWL 570 is 0.66%; the ratio of NGSL is 64.99%; the ratio of NAWL is 0.35%. In terms of the token in these five films, the ratio of GSL 1st 1,000 is 82.1%; the ratio of GSL 2nd 1,000 is 5.44%; the ratio of AWL 570 is 0.99%; the ratio of NGSL is 69.77%; the ratio of NAWL is 0.39%. Apparently, the ratio of GSL 1st 1,000 is about 79~84%. The accumulated ratio of GSL 1st 1,000 and GSL 2nd 1,000 is about 85~89%. The ratio is higher than the ratio of NGSL which is about 64~72%. It is an interesting finding. GSL was published in 1953 and NGSL was published in 2013. There are 1963 word families in GSL and there are 2368 in NGSL. There are much more up-to-dated word families in NGSL. However, the comparison results are totally on the different way. The ratio of GSL is almost 10% higher than the ratio of NGSL. Does NGSL include too many topics so that the word list is too broad and the words are not so general using? It need to more thoroughly survey. Otherwise, the NGSL is not a good substitute for GSL. The ration of AWL is about 0.52~1.71% and the ratio of NAWL is about 0.13~0.53%. By the same token, the ratio of NAWL is a little lower than the ratio of AWL. AWL and NAWL represent academic word lists. However, our data is subtitles of five films. The conversations in films usually contain local, casual and non-academic dialects. The low ratio of AWL and NAWL is acceptable. 4. This part I use my own data collected from the class when the professor gave lessons. This class is Nanomaterials and Synthesis of chemical engineering. I only used one professor’s lecture and it was merely transcribed one-fifth of his class context. The tables below show the comparison results with GSL, AWL and NGSL, NAWL. LEVEL FILE 1 1_gsl_1st_1000.txt 1487 78.72 78.72 249 66.4 66.4 2 2_gsl_2nd_1000.txt 132 6.99 85.71 32 8.53 74.93 3 3_awl_570.txt 88 4.66 90.37 30 8 82.93 0 Non-Level List 182 9.63 100 64 17.07 100 TOTAL: LEVEL 1 TOKEN TOKEN% CUMTOKEN% TYPE 1889 FILE TYPE% CUMTYPE% 375 TOKEN TOKEN% CUMTOKEN% NGSL-Headwords-Only-mvxw.txt 1456 TYPE TYPE% CUMTYPE% 77.08 77.08 252 67.2 67.2 2 NAWL_Headwords.txt 87 4.61 81.69 18 4.8 72 0 Non-Level List 346 18.32 100.01 105 28 100 TOTAL: 1889 375 As I saw the results of level coverage in the first table, level 1 is GSL first 1000 and the ratio is 78.7%; level 2 is GSL second 1000 and the ratio is 7.0%; level 3 is AWL and the ratio is 4.7%; level 0 is words out of these range and the ratio is 9.6%. The results of level coverage in the second table, level 1 is NGSL and the ratio is 77.08%; level 2 is NAWL and the ratio is 4.61%; the ratio of level 0 is 18.32%. The ratio of GSL fist 1000 is almost as the same as the ratio of NGSL. There are 2368 word families in NGSL. The accumulated ratio of GSL first 1000 and second 1000 is 85.71%. It is much more than the ratio of NGSL. The researchers asserted NGSL in which was collected more contemporary words but its ratio does not say so. Similarly, the ratio of AWL is almost as the same as the ratio of NAWL. There are 570 word families in AWL and there are 963 words in NAWL. The bench-mark is a little different. Nevertheless, the data only included low ratio of academic words.