Word - Indiana University

advertisement
Word to Word
Overlaying Multiple Word Similarity Measures
Brent Kievit-Kylar
Indiana University
{bkievitk@indiana.edu}
Abstract
This paper provides an overview of a tool designed
to show multiple word relation metrics over the same
word set. The system is modular and comes built with a
large number of pre-made semantic space similarity
metrics.
Keywords--Network
Visualization,
Semantic,
Word,
1. Introduction
Language learning systems have advanced greatly in
the last few decades. We have seen the rise of statistical
language processors and ever more thorough knowledge
based systems and data sets. But how do we measure
these advances? What makes one system better than
another? Even more challenging, how can we tune
parameters within a model to determine optimum
settings? Beyond that, words can be considered similar in
many different ways, from how they are ordered in text
to the relation between the actual physical objects that
the words represent (for nouns). The thing that makes the
task of word similarities difficult is the limitation on
what can be considered to be a “golden standard”. The
only “correct” model of natural language is the human
mind or something exactly mimicking the human mind
(since language is a construct of the human mind).
Therefore, our only sense of the ability of these language
tools is to compare their results to what we as humans
would do, or expect.
There are a number of ways to compare the
results of a language tool to our own minds. There are
some standard tests such as synonyms or analogies that
we have good human data for. We can inspect various
test cases to see if the word similarities are calculated
similarly to the way we would do them ourselves (often
the preferred measure where a paper will have a list of
target words and results words labeled with similarity
distances). But none of this really shows the gestalt, the
whole picture. By ignoring the full system, we risk
optimizing for a small subset of tested languages while
leaving untested, vast realms of the semantic space. It
may be nice to know that when the system is given the
word dog, we get reasonable results but we should not
expect this to translate to a complete recognition of all
words. We believe that the field of computational
linguistics needs a set of tools to be able to judge the
effectiveness of various algorithms. A visualization
appears to be the best choice for the core of this system
as it will allow users to see a large amount of data in the
minimum possible space. This software tool set must
also be bundled with a package of the most common
algorithms available (though easily extensible) and
packaged under one programming framework.
The core of this bundle must be a strong, and
easy to use visualization that is able to be dynamically
reconfigured to display relevant words and different
word relations at the same time so the results can be
compared. The tool should be easy enough to use that
children can explore high powered linguistic algorithms
with no background knowledge, but complete enough to
allow experts to use it for research.
To this end, we demonstrate a dynamic
visualization tool to show similarities between words
based on different similarity measures. The dynamic
visualization is composed of three main parts: First; a set
of words which are nodes in the visualization. All words
in the English language must be available, though
subsets should be able to be chosen dynamically with a
useful selection method. Second; layout managers for
those nodes to group or order them in a meaningful way.
Third; relational modules that given an ordered pair of
words, returns a graded value from 0 to 1 indicating the
similarity of those two words based on a given metric.
The graphical interface comes in two flavors,
beginner and advanced, to allow users to tune their
experience to their own level of ability. In the following
three sections, we will describe the three main parts of
this visualization, word set selection, layout and
relations. Then we will discuss some examples of how it
might be used.
2. Data Model
At its core, this problem is reduced to a graph
representation where the nodes represent words and the
edges represent the relations between those words.
Each word node contains the string
representation of that word. The normalization of these
representations will depend on the filters used when
teaching the system (more in the filters section). Each
word also counts how many times it has occurred. Each
time a sentence is learned, all word counts are updated.
Each word is assigned a part of speech label from the set
{Unknown, Adjective, Adverb, Noun, Verb, or
Multiple}. This label is determined by querying
WordNet. Finally, each word also contains a coordinate
that is its location in the visualization.
Each similarity type which makes up an edge
has a function for computing word similarity between
any two words (returning 0 there is no information on the
relation between the words). A color and name is also
associated with each similarity type.
2. Selecting Words
Words in this model are stored in active and full
word lists. The full word list contains every word that the
system has any information about and the active word
list contains the words which the user is currently
interested in observing. The initial word list is drawn
from a pre-computed file that contains every word with
more than 1000 occurrences in the TASA corpus as well
as the word frequency obtained from TASA. On
creation, each word similarity module (described later) is
responsible for relating the words that it has information
on, to the main system. Thus the full word list is
managed without need of interaction by the user.
The active word set is managed directly by the user
of the system. Words can be added or removed from the
active word set through a series of tools.
Word – Add or remove a single word from the
active list.
Near – Add or remove all words in a given file from
the active list.
Count – Add or remove all words that have occurred
between min and max number of times.
All – Add or remove all words in the full words list
from the active list.
Relator – Add or remove all words that a particular
similarity metric knows about.
Random – Add or remove a random number of
words from the active list.
Words can also be added and removed by selection
on the multi-select lists shown in the left of the word
manager image above.
3. Laying Out Words
Each active word has a location specified in the
visualization. This location can be modified in a number
of ways.
The position of individual nodes can be changed by
clicking and dragging the nodes with the mouse. If a user
drags across the screen but has not started on a node,
then the screen scrolls sidewise (in a 1 to 1 movement
with the mouse). The scroll wheel on the mouse will
zoom in or out (depending on scroll direction) with the
zoom centered on the mouse.
There are also a set of layout managers that
facilitate creating a meaningful layout of the nodes. The
layout managers are as follows:
Random – All nodes are randomly laid out on the
screen.
Grid – Nodes are placed in a grid on the screen
(dimensions are determined to minimize difference in x
and y spacing)
Word – A word is chosen as the center and words
are laid out at a distance from the center node
proportional to their difference from that word (as
specified by the similarity module of choice).
MDS – Multi dimensional scaling is applied to the
nodes based on the similarity module of choice as a
distance metric.
Fit Screen – Nodes are translated and scaled to keep
their proportional distances but to fit as much of the
screen as possible.
4. Relational Modules
The core of the system is the set of relational
modules. These are based on the set of computational
language algorithms available today. Users can select an
algorithm from the set of options and choose options for
that algorithm that they see fit. The algorithms can then
be trained on training data specified by the user.
Each relation module is written as a Java class
that extends the WordRelator object in this package.
Each module must then implement the following key
functions.
/**
* Apply the learning that you have done.
* Relevant for static algorithms.
*/
public void finalizeSpace() {
finalized = true;
}
/**
* See if the space is finalized.
* @return
*/
public boolean isFinalized() {
return finalized;
}
/**
* This is the list of words that the relator knows about.
* @return
*/
public Set<String> getWords() {
return null;
}
/**
* The core of the relator is a distance metric between two
words.
* @param word1
First word.
* @param word2
Second word.
* @return Value in the range of 0 to 1 where 1 is highly
similar and 0 is non-similar.
*/
public abstract double getDistance(String word1, String word2);
/**
* Learn this sentence.
* @param sentence
*/
public void learn(String[] sentence) {
}
// These clean incoming sentences.
transient public LinkedList<SentenceCleaner> cleaners = new
LinkedList<SentenceCleaner>();
Each module also specifies a name, color and
other relevant information. Word count information is
automatically added when the learn function is called.
The SentenceCleaners provide a pre-processing
of the sentences to be learned. Each sentence is
translated by the cleaners in the order that they appear in
the list. The cleaner options are as follows:
To Lower Case
Alpha Numeric Only
Remove Web Tags
Remove Excess Whitespace
Remove stoplisted words (taken from a standard
stoplist)
The RelationLoader class provides a way to
dynamically load in new relators. This means that the
program does not even have to be recompiled in order to
add a new type of word relator. However, a large number
of word relators are provided along with the software.
Below is the wizard tool in the advanced version which
shows the available similarity metrics and a brief
description.
locations will be rendered. A full rendering can then be
forced by the user through a menu selection.
The user can hover over any node with the mouse
even if a full rendering has not been done, and see the
connections leading out of this node as well as some
information on that particular node. Rendering is done in
a separate thread and using double-buffering so it can be
interrupted easily.
Nodes are visualized as labeled circles. The circles
can be constant sized or proportional to the word's
frequency. Node coloring can either be static or
representative of that words properties (number of
connections, part of speech, etc.)
4.1. Legend
A basic legend shows word color meanings and
word size meanings dynamically as these properties are
selected.
4.2. Other Network Tools
There are a number of tools provided to describe the
relations given by the similarity modules. Different
minimum paths can be computed between words along
all or a set of the similarity modules. Discrepancy tools
show minimum and maximum discrepancy between
multiple similarity modules over given words. Basic
statistics show connection distributions for different
similarity modules.
4.3. The GUI Simple and Advanced
4. The Visualization
The visualization is shown in real time based on the
directed graph of the words as nodes and the relations as
edges.
Edges are color coded according to which similarity
metric created them and with a thickness proportional to
the strength of the similarity. Edges are rendered as
curved lines between two nodes where the curvature is
increased for each subsequent similarity metric. The
exact nature of the line can be specified by the user
(arrow at the end or middle, dot at the end, increase in
line width, etc.)
Rendering is real-time for small enough sets of
relations. If the number of relations is judged to be too
great by the visualization tool, then only the node
To make this tool both accessible and
comprehensive, we developed two versions of the
underlying graphical interface. The simple version relies
strongly on graphical depictions of the various user
actions. An options flow is specified urging the user to
first specify similarity metrics, choose words, layout the
words, choose visualization options and then view the
final visualization. Default values are used whenever
possible and not even given as options to the user.
In the advanced version, the program is more
dynamic and options can be specified in full. Wizards
help users through the more complex steps in creating
different relations or specifying word sets.
4.3. Training
For most similarity measures, training data can be
selected by the user. All of the WordNet similarity
measures come with built in training (WordNet itself)
but for all others, training is completely flexible. Users
can select text either by entering text into a text box,
inputting a website to scrape, or selecting a file. If the
user selects a file, the system will attempt to convert that
file into readable text. It has interpreters for Word, Open
Office, and PDF files.
Some of the similarity metrics work on a line by
line basis and others on a document by document basis.
Each text entry given will be considered a document and
will be parsed into sentences if required for that metric.
Each line is filtered with a series of cleaners specified by
the user. These cleaners are applied in the order that they
are added.
5. Hypotheses
It is difficult to define a single hypotheses to be
defined over this tool as it is instead a means to discover
potential relations between different similarity measures
and word databases. We are hoping that it will provide
more hypotheses than we will start out with. Instead, it
makes sense to detail a few potential cases where this
tool would be useful for uncovering some hidden
potential in various language tasks.
having to set them up and get them working (not an easy
task).
5.3. Different Documents
Probably more of interest to non-researchers,
users can teach the system on their own set of
documents. Various similarity metrics can then be used
to discover new things about their own writing. Bigrams
are easily displayed by the nGram similarity metric.
Word relations that were not explicitly entered may
become visible. For example, though you may never say
that an item is bad in your paper, you may realize that the
semantic content will show that.
6. Use Cases
The next few sections describe some example
specific uses that this visualization tool provides.
6.1 McRae Norms
In this use case, we will explore the cosine
similarity between the McRae feature norms (McRae et
al 2005). The McRae feature norms are a set of concrete
nouns that have had their features manually described by
subjects. A similarity can then be computed over the
objects by comparing the feature sets between different
objects. In this example, the words in the McRae norms
are the only ones selected. Lines are bound at a similarity
of around .6 to 1 to show the predominant structure of
the data. The data points have been clustered using MDS.
It is clear that the words are organized into a handful of
related clusters which do not connect strongly with each
other. We have a clustering of “plant”, “sea creature”,
“bird”, “insect”, and “human made objects”.
This tool has two main uses, first is comparing
the difference between different similarity metrics and
the second is comparing the difference between different
training data sets. Both of these tasks will appeal to
different types of researchers or other users.
5.1. Primary Function
The primary function is as an overview tool. A
researcher can manipulate a particular variable within a
comparator and see the results on the main nodes of the
network. In this way, they can more easily determine if
their change was beneficial or not.
5.2. Lots of Comparators
To our knowledge, this is the largest collection
of word comparison tools available in one place that
work right out of the box irrelevant of the visualization
system. This will give researchers the ability to quickly
experiment with new semantic space algorithms without
Next, we take wish to see how the McRae norm
results like up with a new experiment; the semantic
Pictionary project. In this experiment subjects were
asked to make representations of words using geometric
shapes. The similarities were then checked on the shapes
that were made for the words.
7 Conclusion
We can see that this new metric does not match up
very well with the McRae norms and determine some
possible problem words like dove and chickadee
(something we might expect as these are probably hard
objects to represent/draw for most people).
We believe that we have provided a useful tool to
both hobbyists and professionals interested in word
relations. The code and executable programs will be
made freely available at [website to be added later].
6.3 Wikipedia Dog
[1]
Here we look at a few different ways that one data
set with one similarity metric can be represented. We are
learning the relations between words with the website
http://en.wikipedia.org/wiki/Dog . We use the NGram
similarity metric.
In the first visualization we show the word centered
view for the word “the”. Since the NGram similarity
metric does not have a very strong gradient, the words
appear on two main rings around the word “the”
References
[2]
[3]
[4]
[5]
In the next visualization we can see the word count
as the radius of the circles. Words are aligned to a grid.
Finally we show the MDS layout with a different
color scheme. Many things like the background can be
changed in various options through the program. The
MDS clustering is also very interesting with the NGram
similarity since the distinctions are very discrete.
McRae, K., Cree, G. S., Seidenberg, M. S., &
McNorgan, C.(2005). Semantic feature production norms
for a large set of living and nonliving things. Behavior
Research Methods, 37, 547-559.
George A. Miller (1995). WordNet: A Lexical Database
for
English.
Communications of the ACM Vol. 38, No. 11: 39-41.
Christiane Fellbaum (1998, ed.) WordNet: An Electronic
Lexical Database. Cambridge, MA: MIT Press.
Michael N. Jones, double J. K. Mewhort (2007),
Representing Word Meaning and Order Information in a
Composite Holographic Lexicn. Psychological Review.
Reference to the word similarity metrics. Unsure how to
do this since they are based on various papers but
collected by google codes semantic spaces project.
Download