a Scalable Open Source Package and Online

advertisement
Semantic Vectors
A Scalable Open Source Package and Online
Technology Management Application
LREC Conference
28th May, 2008
Dominic Widdows
Google, Inc.
widdows@google.com
Kathleen Ferraro
University of Pittsburgh
kaf1@pitt.edu
Natural Language Software
Engineering – Three Problems

Software is often hard to use / unreliable


t (fiddling with computers) >> t (analysing data)
Does it scale?
Moore's Law of Data – any algorithm more costly than
linear hurts more every day!


What is it for?
Systems / components
 Interesting (science) / useful (engineering)

Semantic Vector Models
Count how many times words occur in some context



Term–Document matrix
LSA (“Latent Semantic Analyis”)
Or count how many times words cooccur with one
another


HAL (“Hyperspace Analogue to Language”)
Normally we reduce dimensions somehow


SVD, NNMR, LDA.
Many uses


IR, WSD, OL / LA, DS / TDT, OCIM, DC, ... , Acronym
Resolution.
Semantic Vectors Package








http://semanticvectors.google.com/
Created by University of Pittsburgh and MAYA Design
All Java (with some Perl / Python / php wrappers)
Maintained by Google 20% project + other contributors
BSD license – you can use it.
Nearly 1000 downloads
Developer group, Wiki, mailing list, ...
“Child of Infomap” with lessons learned
Challenge 1: Make it Easy!



100% Java
Dependencies include Apache Lucene
Installation



User

Download jarfiles

Add to your $CLASSPATH

Assemble a corpus (example provided)

Type “java pitt.search.semanticvectors.BuildModel”
Developer

Install SVN, Ant

Checkout source (Google code helps)

Install JUnit for testing
We have had no reports of difficulty yet!
Challenge 2: Make it Scale!
Dimension reduction and parallelization are key
Random Projection








Geometric alternatives: SVD (orthogonality)
Probabilistic alternatives: PLSA, LDS (generative models)
Sparse Random Vectors, e.g.
[0,0,0,1,0,-1,0,0,0,-1,0,0,0,0,1,0,0]
[0,-1,0,0,0,1,0,0,0,0,1,0,0,0,0,-1,0]
On average, dot products are nearly zero, so vectors are nearly
orthogonal.
Approximate benefits of SVD, with none of the cost!
Believed to be trivially parallelizable and incremental (TODO)
Challenge 3: Make it Useful!
Hardest of the three problems
Technology Matching at UPitt






http://real.hsls.pitt.edu/
Matches technology disclosures to
documents harvested from company
websites
Traditionally needs much more than
keywords
Does your data meet your needs?
Features and Demos ...
Negation, Disjunction


“Quantum” / Vector Logic
Translation


Bilingual Vector Models
Semantic Vector Products


Direct, Tensor, Convolution, Subspace
Clustering


kMeans
Context Window Approach (HAL)


Thanks to Trevor Cohen, ASU, Biomedical
Informatics
Mathematics, Technology, Cognition

Geometry, Probability, Logic Intersection
“That one term should be included in another as in a whole is the same as for the
other to be predicated of all of the first.”
Prior Analytics (Bk I, Ch 1)

The equations work ... does the method?
“It is the mark of an educated man to look for precision in each class of things just so
far as the nature of the subject admits; it is evidently equally foolish to accept
probable reasoning from a mathematician and to demand from a rhetorician
scientific
proofs.”
Nicomachean Ethics (Bk I, Ch 3)

What do people do?
“By nature animals are born with the faculty of sensation, and from sensation memory
is produced in some of them, though not in others ... Now from memory experience
is produced in humans; for the several memories of the same thing produce finally
the capacity for a single experience.”
Metaphysics (Bk I, Ch 3)
Many Thanks ...

ELRA and the LREC conference

Developers of Java, Lucene, Ant, Junit, ...

Google, University of Pittsburgh

Harris, Firth, Van Rijsbergen, Salton, McGill, Landauer, Deerwester,
Berry, Dumais, Schutze, Lund, Burgess, Sahlgren, Kahlgren, Kaufmann,
Dorow, Cederberg, Hofmann, Kanerva, Plate, Papadimitriou, McArthur,
Bruza, ...
And finally ...


http://infomap.stanford.edu/boo
k
Introduction to Vectors,
WordSpace, Quantum Logic,
etc.

A few for sale here ... 150 ‫م‬.‫د‬.

Download the package ...
Google(Semantic Vectors)
Download