Slides

advertisement
Integrating Latent Semantic Analysis and Language
Model for Character Prediction in a Binary
Response Typing Interface
Seminar on Speech and Language Processing for
Augmentative and Alternative Communication
Masoud Rouhizadeh
Introduction
 Most of the word or character prediction systems make use
of word-based and/or character-based n-gram language
models.
 Some works for enriching such language models with
further syntactic or semantic information. (Wandmacher et
al. 2007 & 2008).
 Predictive powers of Latent Semantic Analysis (LSA) for
character prediction in a typing interface developed by
Brian Roark (Roark 2009)
Roark's binary switch typing interface
 Binary-switch
 Static/dynamic grid
 Different language model contributions
 Different scanning modes:
Row-Column
RSVP
Huffman
Latent Semantic Analysis (LSA)
 A technique to model semantic similarity based on co-
occurrence distributions of words
 LSA is able to relate coherent contexts to specific content
words
 Good at predicting the occurrence of a content word in the
presence of other thematically related terms
LSA, an example of documents
1. The Neatest Little Guide to Stock Market Investing
2. Investing For Dummies, 4th Edition
3. The Little Book of Common Sense Investing: The Only Way to Guarantee Your
Fair Share of Stock Market Returns
4. The Little Book of Value Investing
5. Value Investing: From Graham to Buffett and Beyond
6. Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the
Middle Class Do Not!
7. Investing in Real Estate, 5th Edition
8. Stock Investing For Dummies
9. Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of
Finding Hidden Profits Most Investors Miss
Preprocessing and tokenizing
 Tokenizing
 Removing ignored characters
 Turning everything into lowercase
 Removing stop words
Term by documents matrix
Cosine similarity
 Each word is represented as a vector
 A[0, 0, 1, 1, 0, 0, 0, 0, 0])
 B[1, 0, 1, 0, 0, 0, 0, 1, 0])
 0.4082
Integrating LSA and language model
 LSA is a bag of words model and is shown to be
reliable to predict a word within a context
 Making it more sensitive to context
 Pa is estimated by cosine similarity of w1 w2
 Pb is estimated by bigram probability of w1 w2
 P(w2|w1) = λPa + (1-λ)Pb
From word to character prediction
 In Roark's typing interface we are interested to predict
characters, rather than words.
 Sorting the upcoming words based on their
probabilities
 Evaluated by RSVP simulation
From word to character prediction
computer
association
accessories
arts
architecture
...
bags
backup
backpack
batteries
backgrounds
brands
...
_, e, a, i, c, f,
o, n, d, g, ,, t,
r, h, m, ., ", s,
l, p, b, -, u, "",
w, k, j, q , $, y,
v, x, z, :, ;
B
From word to character prediction
computer
bags
backup
backpack
batteries
backgrounds
brands
brain
...
a, r, e, a, i,
c, f, ....
A
From word to character prediction
computer
bags
backup
backpack
batteries
backgrounds
...
c, g, t, e,
a, i, c,....
C
From word to character prediction
computer
backup
backpack
backgrounds
...
k, a, e, a,
i, c, f, …
K
From word to character prediction
computer
backup
backpack
backgrounds
...
g, p, u, e,
a, i, c, f…
U
From word to character prediction
backup
p, e, a, I,
c, f, ....
computer
P
Evaluation
 Simulation mode
Trained and tested on a small part of NY Times portion of
the English Gigaword corpus
RSVP
Results
2400
2200
2000
Average key-stroke
per sentence
1800
1600
1400
1200
1000
Character frequency
scanning
LSA+Bigram
Model
17.79 % keystroke-saving per sentence
Conclusion
 Word-based language models shown to be effective in
character prediction
 Integration of LSA and bigram language model works
well in predicting upcoming words
 With larger LSA and bigram models we expect better
results
Thank you.
Download