Assignment 3 ... 1. Analyze the five LiveDVD film subtitles in terms...

advertisement
Assignment 3
NA3C0005-楊馨琦(Belle)
1. Analyze the five LiveDVD film subtitles in terms of the GSL 1st 1,000,
GSL 2nd 1,000 and AWL 570 word lists.
THE DEVIL WEARS PRADA
Lexical Profile Statistics
LEVEL
FILE
1
1_gsl_1st_1000.txt
8850
81.39
81.39
858
48.5
48.5
2
2_gsl_2nd_1000.txt
560
5.15
86.54
232
13.11
61.61
3
3_awl_570.txt
186
1.71
88.25
107
6.05
67.66
0
Non-Level List
1277
11.74
99.99
572
32.33
99.99
TOTAL:
TOKEN TOKEN% CUMTOKEN%
10873
TYPE
TYPE% CUMTYPE%
1769
The Entitled
Lexical Profile Statistics
LEVEL
FILE
TOKEN
TOKEN% CUMTOKEN% TYPE
TYPE%
CUMTYPE%
1
1_gsl_1st_1000.txt
5643
83.91
83.91
602
61.55
61.55
2
2_gsl_2nd_1000.txt
234
3.48
87.39
123
12.58
74.13
3
3_awl_570.txt
35
0.52
87.91
23
2.35
76.48
0
Non-Level List
813
12.09
100
230
23.52
100
TOTAL:
6725
978
Beyond the Blackboard
Lexical Profile Statistics
LEVEL
FILE
1
1_gsl_1st_1000.txt
8122
84.26
84.26
803
56
56
2
2_gsl_2nd_1000.txt
502
5.21
89.47
217
15.13
71.13
3
3_awl_570.txt
79
0.82
90.29
50
3.49
74.62
0
Non-Level List
936
9.71
100
364
25.38
100
TOTAL:
TOKEN TOKEN% CUMTOKEN%
9639
TYPE
1434
TYPE% CUMTYPE%
A Christmas Carol
Lexical Profile Statistics
LEVEL
FILE
1
1_gsl_1st_1000.txt
5803
81.54
2
2_gsl_2nd_1000.txt
546
3
3_awl_570.txt
0
Non-Level List
TOTAL:
TOKEN TOKEN% CUMTOKEN%
TYPE
TYPE% CUMTYPE%
81.54
715
57.57
57.57
7.67
89.21
201
16.18
73.75
67
0.94
90.15
24
1.93
75.68
701
9.85
100
302
24.32
100
7117
1242
November Christmas
Lexical Profile Statistics
LEVEL
FILE
1
1_gsl_1st_1000.txt
6047
79.33
2
2_gsl_2nd_1000.txt
440
3
3_awl_570.txt
0
Non-Level List
TOTAL:
TOKEN TOKEN% CUMTOKEN%
TYPE
TYPE% CUMTYPE%
79.33
704
56.19
56.19
5.77
85.1
192
15.32
71.51
50
0.66
85.76
31
2.47
73.98
1086
14.25
100.01
326
26.02
100
7623
1253
The five LiveDVD films
Lexical Profile Statistics
LEVEL
FILE
TOKEN TOKEN% CUMTOKEN%
TYPE
TYPE% CUMTYPE%
1
1_gsl_1st_1000.txt
34465
82.1
82.1
1514
38.67
38.67
2
2_gsl_2nd_1000.txt
2282
5.44
87.54
665
16.99
55.66
3
3_awl_570.txt
417
0.99
88.53
188
4.8
60.46
0
Non-Level List
4813
11.47
100
1548
39.54
100
TOTAL:
41977
3915
2. Analyze the five LiveDVD film subtitles in terms of the NGSL and NAWL
word lists.
THE DEVIL WEARS PRADA
Lexical Profile Statistics
LEVEL
1
FILE
TOKEN TOKEN% CUMTOKEN%
NGSL-Headwords-Only-mvxw.txt 7668
TYPE
TYPE% CUMTYPE%
70.52
70.52
815
46.07
46.07
2
NAWL_Headwords.txt
40
0.37
70.89
30
1.7
47.77
0
Non-Level List
3165
29.11
100
924
52.23
100
TOTAL:
10873
1769
The Entitled
Lexical Profile Statistics
LEVEL
1
FILE
TOKEN TOKEN% CUMTOKEN% TYPE
NGSL-Headwords-Only-mvxw.txt 4642
TYPE%
CUMTYPE%
69.03
69.03
516
52.76
52.76
2
NAWL_Headwords.txt
9
0.13
69.16
6
0.61
53.37
0
Non-Level List
2074
30.84
100
456
46.63
100
TYPE%
CUMTYPE%
TOTAL:
6725
978
Beyond the Blackboard
Lexical Profile Statistics
LEVEL
1
FILE
TOKEN TOKEN% CUMTOKEN% TYPE
NGSL-Headwords-Only-mvxw.txt 6839
70.95
70.95
713
49.72
49.72
2
NAWL_Headwords.txt
49
0.51
71.46
29
2.02
51.74
0
Non-Level List
2751
28.54
100
692
48.26
100
TYPE%
CUMTYPE%
TOTAL:
9639
1434
A Christmas Carol
Lexical Profile Statistics
LEVEL
1
FILE
TOKEN TOKEN% CUMTOKEN% TYPE
NGSL-Headwords-Only-mvxw.txt 5184
72.84
72.84
614
49.44
49.44
2
NAWL_Headwords.txt
38
0.53
73.37
17
1.37
50.81
0
Non-Level List
1895
26.63
100
611
49.19
100
TOTAL:
7117
1242
November Christmas
Lexical Profile Statistics
LEVEL
1
FILE
TOKEN TOKEN% CUMTOKEN% TYPE
NGSL-Headwords-Only-mvxw.txt 4954
TYPE%
CUMTYPE%
64.99
64.99
577
46.05
46.05
2
NAWL_Headwords.txt
27
0.35
65.34
18
1.44
47.49
0
Non-Level List
2642
34.66
100
658
52.51
100
TOTAL:
7623
1253
The five LiveDVD films
Lexical Profile Statistics
LEVEL
1
FILE
TOKEN TOKEN% CUMTOKEN%
NGSL-Headwords-Only-mvxw.txt 29287
TYPE
TYPE% CUMTYPE%
69.77
69.77
1396
35.66
35.66
2
NAWL_Headwords.txt
163
0.39
70.16
88
2.25
37.91
0
Non-Level List
12527
29.84
100
2431
62.09
100
TOTAL:
41977
3915
3. Tokens are the total number of word forms; types are the total number of
different word forms. The first part is to analyze the five LiveDVD film subtitles in
terms of the GSL 1st 1,000, GSL 2nd 1,000 and AWL 570 word lists. The second part
is to analyze the five LiveDVD film subtitles in terms of the NGSL and NAWL word lists.
The fist film is THE DEVIL WEARS PRADA. In terms of the token, the ratio of GSL 1st
1,000 is 81.39%; the ratio of GSL 2nd 1,000 is 5.15%; the ratio of AWL 570 is 1.71%;
the ratio of NGSL is 70.52%; the ratio of NAWL is 0.37%. The second film is The
Entitled. In terms of the token, the ratio of GSL 1st 1,000 is 83.91%; the ratio of GSL
2nd 1,000 is 3.48%; the ratio of AWL 570 is 0.52%; the ratio of NGSL is 69.03%; the
ratio of NAWL is 0.13%. The third film is Beyond the Blackboard. In terms of the
token, the ratio of GSL 1st 1,000 is 84.26%; the ratio of GSL 2nd 1,000 is 5.44%; the
ratio of AWL 570 is 0.99%; the ratio of NGSL is 70.95%; the ratio of NAWL is 0.53%.
The fourth film is A Christmas Carol. In terms of the token, the ratio of GSL 1st 1,000
is 81.54%; the ratio of GSL 2nd 1,000 is 7.67%; the ratio of AWL 570 is 0.94%; the ratio
of NGSL is 72.84%; the ratio of NAWL is 0.53%. The fifth film is November
Christmas. In terms of the token, the ratio of GSL 1st 1,000 is 79.33%; the ratio of
GSL 2nd 1,000 is 5.77%; the ratio of AWL 570 is 0.66%; the ratio of NGSL is 64.99%;
the ratio of NAWL is 0.35%. In terms of the token in these five films, the ratio of
GSL 1st 1,000 is 82.1%; the ratio of GSL 2nd 1,000 is 5.44%; the ratio of AWL 570 is
0.99%; the ratio of NGSL is 69.77%; the ratio of NAWL is 0.39%.
Apparently, the ratio of GSL 1st 1,000 is about 79~84%. The accumulated ratio of
GSL 1st 1,000 and GSL 2nd 1,000 is about 85~89%. The ratio is higher than the ratio
of NGSL which is about 64~72%. It is an interesting finding. GSL was published in
1953 and NGSL was published in 2013. There are 1963 word families in GSL and
there are 2368 in NGSL. There are much more up-to-dated word families in NGSL.
However, the comparison results are totally on the different way. The ratio of GSL is
almost 10% higher than the ratio of NGSL. Does NGSL include too many topics so
that the word list is too broad and the words are not so general using? It need to
more thoroughly survey. Otherwise, the NGSL is not a good substitute for GSL.
The ration of AWL is about 0.52~1.71% and the ratio of NAWL is about 0.13~0.53%.
By the same token, the ratio of NAWL is a little lower than the ratio of AWL. AWL
and NAWL represent academic word lists. However, our data is subtitles of five
films. The conversations in films usually contain local, casual and non-academic
dialects. The low ratio of AWL and NAWL is acceptable.
4. This part I use my own data collected from the class when the professor gave
lessons. This class is Nanomaterials and Synthesis of chemical engineering. I only
used one professor’s lecture and it was merely transcribed one-fifth of his class
context. The tables below show the comparison results with GSL, AWL and NGSL,
NAWL.
LEVEL
FILE
1
1_gsl_1st_1000.txt
1487
78.72
78.72
249
66.4
66.4
2
2_gsl_2nd_1000.txt
132
6.99
85.71
32
8.53
74.93
3
3_awl_570.txt
88
4.66
90.37
30
8
82.93
0
Non-Level List
182
9.63
100
64
17.07
100
TOTAL:
LEVEL
1
TOKEN TOKEN% CUMTOKEN%
TYPE
1889
FILE
TYPE% CUMTYPE%
375
TOKEN TOKEN% CUMTOKEN%
NGSL-Headwords-Only-mvxw.txt 1456
TYPE
TYPE% CUMTYPE%
77.08
77.08
252
67.2
67.2
2
NAWL_Headwords.txt
87
4.61
81.69
18
4.8
72
0
Non-Level List
346
18.32
100.01
105
28
100
TOTAL:
1889
375
As I saw the results of level coverage in the first table, level 1 is GSL first 1000 and the
ratio is 78.7%; level 2 is GSL second 1000 and the ratio is 7.0%; level 3 is AWL and the
ratio is 4.7%; level 0 is words out of these range and the ratio is 9.6%. The results of
level coverage in the second table, level 1 is NGSL and the ratio is 77.08%; level 2 is
NAWL and the ratio is 4.61%; the ratio of level 0 is 18.32%.
The ratio of GSL fist 1000 is almost as the same as the ratio of NGSL. There are
2368 word families in NGSL. The accumulated ratio of GSL first 1000 and second
1000 is 85.71%. It is much more than the ratio of NGSL. The researchers asserted
NGSL in which was collected more contemporary words but its ratio does not say so.
Similarly, the ratio of AWL is almost as the same as the ratio of NAWL. There are
570 word families in AWL and there are 963 words in NAWL. The bench-mark is a
little different. Nevertheless, the data only included low ratio of academic words.
Download