CMU-Statistical Language Modeling & SRILM Toolkits

advertisement
CMU-Statistical Language
Modeling & SRILM Toolkits
ASHWIN ACHARYA
ECE 5527
SEARCH AND DECODING
INSTRUCTOR: DR. V. KËPUSKA
Objective
 Use the SRILM and CMU (Carnegie Mellon
University) toolkits to built differing language
models.
 Use the four Discounting Strategies to build the
language models.
 Perform perplexity tests with both LM toolkits to
study each ones performance.
CMU-SLM Toolkit
 The CMU SLM Toolkit is a set of Unix software tools
created to aid statistical language modeling.
 The key tools provided are used to process textual data
into:




Vocabularies and word frequency lists
Bigram and trigram counts
N-gram related statistics
Various N-gram language model
 Once the language models are created they can be used to
compute:




Perplexity
Out-of-vocabulary rate
N-gram hit ratios
Back off distributions
SRILM Toolkit
 SRILM toolkit is used for building and applying statistical
language models (LMs), primarily for use in speech
recognition, statistical tagging and segmentation, and
machine translation
 SRILM consists of the following components:



A set of C++ class libraries implementing language models,
supporting data structures and miscellaneous utility functions.
A set of executable programs built on top of these libraries to
perform standard tasks such as training LMs and testing them on
data, tagging or segmenting text.
A collection of miscellaneous scripts facilitating minor related tasks.
 SRILM runs on UNIX platforms
Toolkit Environment
 The two toolkits run in a Unix Environment like
Linux
 However, Cygwin works well and can be used to
simulate this Unix environment on a windows
platform.
Download free Cygwin form following link:
http://www.cygwin.com/
Cygwin Install
 Download the Cygwin installation file
 Execute setup.exe
 Choose “Install from Internet”
 Select root install directory “ C:\cygwin”
 Choose a download site “mirrors”
Cygwin Install
 Make sure all of the following packages are
selected during the install process:


gcc: the compiler
GNU make: utility, which determines automatically which pieces of a
large program need to be recompiled, and issues the commands to recompile
them.





TCL toolkit: tool command language
Tcsh: TENEX C shell
gzip: to read/write compressed file
GNU awk(gawk): to interpret many of the utility script
Binults: GNU assembler, linker and binary utilities
SRILM Install
 Download SRILM toolkit, srilm.tgz
 Run Cygwin
 Unzip srilm.tgz by following commands:
$ cd /cygdrive/c/cygwin/srilm
$ tar zxvf srilm.tgz
 Add following lines to setup the direction in the makefile:
SRILM=/cygdrive/c/cygwin/srilm
MACHINE_TYPE=cygwin
 Run Cygwin and type following command to install SRILM:
$ Make World
CMU-SLM Install
 Download CMU toolkit,CMU-Cam_Toolkit_v2.tgz
 Run Cygwin
 Unzip srilm.tgz by following commands:
$ cd /cygdrive/c/cygwin/CMU
$ tar zxvf CMU-Cam_Toolkit_v2.tgz
 Run Cygwin and type following command to install CMU:
$ Make All
Training Corpus
 Use the corpus provided by National Language
Toolkit website.
http://en.sourceforge.jp/projects/sfnet_nltk/downloads/nltklite/0.9/nltk-data-0.9.zip/
 Containing a lexicon and two different corpora from
the Australian Broadcasting Commission.
 The first is a Rural News based corpora
 The second is a Science News based corpora
SRILM Procedure
Generate 3-gram count file by using following commands:
$./ngram-count -vocab abc/en.vocab
-text abc/rural.txt
-order 3
-write abc/rural.count
-unk
Count File






ngram-count
count N-grams and estimate language models
-vocab file
Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are
replaced with the unknown-word token. If this option is not specified all words found are
implicitly added to the vocabulary.
-text textfile
Generate N-gram counts from text file. textfile should contain one sentence unit per line.
Begin/end sentence tokens are added if not already present. Empty lines are ignored.
-order n
Set the maximal order (length) of N-grams to count. This also determines the order of the
estimated LM, if any. The default order is 3.
-write file
Write total counts to file.
-unk
Build an “open vocabulary” LM, i.e., one that contains the unknown-word token as a regular
word. The default is to remove the unknown word.
SRILM Procedure
Creating the language model :
$ ./ngram-count
-read abc/rural.count
-order 3
-lm abc/rural.lm
-read countsfile
Read N-gram counts from a file. ASCII count files contain one N-gram of words per line,
followed by an integer count, all separated by whitespace. Repeated counts for the same Ngram are added.
-lm lmfile
Estimate a backoff N-gram model from the total counts, and write it to lmfile.
Discounting
 Good-Turing Discount
$ ./ngram-count
-read abc/rural.count
-order 3
-lm abc/gtrural.lm
-gt1min 1 -gt1max 3
-gt2min 1 -gt2max 3
-gt3min 1 -gt3max 3
-gtnmin count
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that
will be included in the LM. All N-grams with frequency lower than that will effectively
be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set.
NOTE: This option affects not only the default Good-Turing discounting but the
alternative discounting methods described below as well.
 -gtnmax count
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that
are discounted under Good-Turing. All N-grams more frequent than that will receive
maximum likelihood estimates. Discounting can be effectively disabled by setting this
to 0. If n is omitted the parameter for N-grams of order > 9 is set.

Discounting
 Absolute Discounting
$ ./ngram-count
-read abc/test1.txt
-order 3
-lm abc/ruralcd.lm
-cdiscount1 0.5
-cdiscount2 0.5
-cdiscount3 0.5
 -cdiscountn discount
where n is 1, 2, 3, 4, 5, 6, 7, 8,
or 9. Use Ney's absolute
discounting for N-grams of
order n, using discount as the
constant to subtract.
 Witten-Bell
$ ./ngram-count
-read abc/test1.txt
-order 3
-lm abc/ruralwb.lm
-wbdiscount1
-wbdiscount2
-wbdiscount3
 -wbdiscountn
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Use Witten-Bell discounting for
N-grams of order n. (This is the
estimator where the first
occurrence of each word is taken
to be a sample for the “unseen”
event.)
Discounting
 Kneser-Ney Discounting
$ ./ngram-count
-read project/count.txt
-order 3
-lm knlm.txt
-kndiscount1
-kndiscount2
-kndiscount3
 -kndiscountn
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and
Goodman's modified Kneser-Ney discounting for Ngrams of order n.
Perplexity
 Perplexity is the common evaluation metric for N-gram language
models. It is relative cheaper than the end-to-end evaluation which
requires to test the recognition results of language models. The
perplexity is defined as following:
PPW   Pw1w2  wN 

Normalized by
number of words.
N
1
N
1
Pw1w2  wN 
probability of
the test set
 The better language model will assign a higher probability to the test
data which lower the perplexity. As the result, the lower the perplexity
means the higher probability of the test set according to that specific
language model.
Test Perplexity
Randomly choose three articles of news from Internet for test data.
Commands for four different 3-gram language Models:
$ ./ngram -ppl abc/test1.txt -order 3 -lm project/gtrural.lm
$ ./ngram -ppl abc/test1.txt -order 3 -lm project/cdrural.lm
$ ./ngram -ppl abc/test1.txt -order 3 -lm project/wbrural.lm
$ ./ngram -ppl abc/test1.txt -order 3 -lm project/knrural.lm
Repeat these tests for all three test data with the other language models
Results of Perplexity Test
SRILM Perplexity test with test2.txt with language model
gtrural.lm ( rural corpora with Good-Turing Discounting)
Building LM with CMU SLM toolkit
 The following is a simple flow chart which explains the steps to build a
language model using CMU language model toolkit.
Building LM with CMU SLM toolkit
 text2wfreq: compute the word unigram counts.
 wfreq2vocab: convert the word unigram counts
into a vocabulary consisting of the preset number of
common words.
 text2idngram: Convert the idngram into a binary
format language model.
 idngram2lm: Build the language model.
 evallm: compute the perplexity of the test corpus.
CMU SLM Procedure


Given a large corpus of text in a file a.text, but no specified vocabulary
Compute the word unigram counts

cat a.text | text2wfreq > a.wfreq
Convert the word unigram counts into a vocabulary consisting of the 20,000 most common words

cat a.wfreq | wfreq2vocab -top 20000 > a.vocab
Generate a binary id 3-gram of the training text, based on this vocabulary

cat a.text | text2idngram -vocab a.vocab > a.idngram
Convert the idngram into a binary format language model

idngram2lm -idngram a.idngram -vocab a.vocab -binary a.binlm
Compute the perplexity of the language model, with respect to some test text b.text
evallm -binary a.binlm
Reading in language model from file a.binlm
Done.
evallm : perplexity -text b.text
Computing perplexity of the language model with respect
to the text b.text
Perplexity = 128.15, Entropy = 7.00 bits
Computation based on 8842804 words.
Number of 3-grams hit = 6806674 (76.97%)
Number of 2-grams hit = 1766798 (19.98%)
Number of 1-grams hit = 269332 (3.05%)
1218322 OOVs (12.11%) and 576763 context cues were removed from the calculation.
evallm : quit
Perplexity Test Results
CMU Perplexity test with test1.txt with language model ruralgt.lm
( rural corpora with Good-Turing Discounting)
Perplexity Results
Test
test 1
test 2
test 3
Test
test 1
test 2
test 3
Rural Corpus
Discounting
Good-Turing
Absolute
Witten-Bell
Kneser-Ney
SRILM
CMU-LM SRILM
CMU-LM SRILM
CMU-LM SRILM
CMU-LM
383.84
978.88
411.84 1011.08
390.46
1156.11
391.48
1123.38
356.32
666.3
352.89
680.38
344.05
796.11
343.51
784.25
895.6 1656.68
910.79 1699.01
834.36
2039.16
726.24
1835.67
Science Corpus
Discounting
Good-Turing
Absolute
Witten-Bell
Kneser-Ney
SRILM
CMU-LM SRILM
CMU-LM SRILM
CMU-LM SRILM
CMU-LM
471.83
811.02
475.37
821.47
451.8
976.16
483.13
933.58
83.82
87.3
68.14
87.14
72.08
78.07
48.77
108.75
766.38 1501.31
836.04 1512.02
796.51
1885.12
733.17
1743.37
Results
2500
2000
1500
test 1
1000
test 2
test 3
500
Good-Turing
Absolute
Witten-Bell
CMU-LM
SRILM
CMU-LM
SRILM
CMU-LM
SRILM
CMU-LM
SRILM
0
Kneser-Ney
Perplexity tests based on the language model from the
Rural News Corpora
Results
2000
1800
1600
1400
1200
1000
test 1
800
test 2
600
test 3
400
200
Good-Turing
Absolute
Witten-Bell
CMU-LM
SRILM
CMU-LM
SRILM
CMU-LM
SRILM
CMU-LM
SRILM
0
Kneser-Ney
Perplexity tests based on the language model from the
Science News Corpora
Conclusion
 It can clearly be seen the SRILM toolkit performed the
better that the CMU toolkit in terms of the perplexity
tests.
 We can also notice that for specific test corpora certain
discounting strategies were better suited that the others.
For instance the Kneser-Ney discounting seemed to
perform relatively better than the others on most of the
tests.
 The CMU language models perplexity results seemed to
be a little on the higher side a fix could be to adjust the
discounting ranges.
 Results could have been improved by using some sort of
word tokenization.
References
 SRI International, “The SRI Language Modeling Toolkit”,
http://www.speech.sri.com/projects/srilm/
 Cygwin Information and Installation, “Installing and
Updating Cygwin”, http://www.cygwin.com/
 National Language Toolkit,
http://nltk.sourceforge.net/index.php/Corpora
 Carnegie Mellon University, CMU Statistical language
Model Toolkit, http://www.speech.cs.cmu.edu
Download