Word to Word Overlaying Multiple Word Similarity Measures Brent Kievit-Kylar Indiana University {bkievitk@indiana.edu} Abstract This paper provides an overview of a tool designed to show multiple word relation metrics over the same word set. The system is modular and comes built with a large number of pre-made semantic space similarity metrics. Keywords--Network Visualization, Semantic, Word, 1. Introduction Language learning systems have advanced greatly in the last few decades. We have seen the rise of statistical language processors and ever more thorough knowledge based systems and data sets. But how do we measure these advances? What makes one system better than another? Even more challenging, how can we tune parameters within a model to determine optimum settings? Beyond that, words can be considered similar in many different ways, from how they are ordered in text to the relation between the actual physical objects that the words represent (for nouns). The thing that makes the task of word similarities difficult is the limitation on what can be considered to be a “golden standard”. The only “correct” model of natural language is the human mind or something exactly mimicking the human mind (since language is a construct of the human mind). Therefore, our only sense of the ability of these language tools is to compare their results to what we as humans would do, or expect. There are a number of ways to compare the results of a language tool to our own minds. There are some standard tests such as synonyms or analogies that we have good human data for. We can inspect various test cases to see if the word similarities are calculated similarly to the way we would do them ourselves (often the preferred measure where a paper will have a list of target words and results words labeled with similarity distances). But none of this really shows the gestalt, the whole picture. By ignoring the full system, we risk optimizing for a small subset of tested languages while leaving untested, vast realms of the semantic space. It may be nice to know that when the system is given the word dog, we get reasonable results but we should not expect this to translate to a complete recognition of all words. We believe that the field of computational linguistics needs a set of tools to be able to judge the effectiveness of various algorithms. A visualization appears to be the best choice for the core of this system as it will allow users to see a large amount of data in the minimum possible space. This software tool set must also be bundled with a package of the most common algorithms available (though easily extensible) and packaged under one programming framework. The core of this bundle must be a strong, and easy to use visualization that is able to be dynamically reconfigured to display relevant words and different word relations at the same time so the results can be compared. The tool should be easy enough to use that children can explore high powered linguistic algorithms with no background knowledge, but complete enough to allow experts to use it for research. To this end, we demonstrate a dynamic visualization tool to show similarities between words based on different similarity measures. The dynamic visualization is composed of three main parts: First; a set of words which are nodes in the visualization. All words in the English language must be available, though subsets should be able to be chosen dynamically with a useful selection method. Second; layout managers for those nodes to group or order them in a meaningful way. Third; relational modules that given an ordered pair of words, returns a graded value from 0 to 1 indicating the similarity of those two words based on a given metric. The graphical interface comes in two flavors, beginner and advanced, to allow users to tune their experience to their own level of ability. In the following three sections, we will describe the three main parts of this visualization, word set selection, layout and relations. Then we will discuss some examples of how it might be used. 2. Data Model At its core, this problem is reduced to a graph representation where the nodes represent words and the edges represent the relations between those words. Each word node contains the string representation of that word. The normalization of these representations will depend on the filters used when teaching the system (more in the filters section). Each word also counts how many times it has occurred. Each time a sentence is learned, all word counts are updated. Each word is assigned a part of speech label from the set {Unknown, Adjective, Adverb, Noun, Verb, or Multiple}. This label is determined by querying WordNet. Finally, each word also contains a coordinate that is its location in the visualization. Each similarity type which makes up an edge has a function for computing word similarity between any two words (returning 0 there is no information on the relation between the words). A color and name is also associated with each similarity type. 2. Selecting Words Words in this model are stored in active and full word lists. The full word list contains every word that the system has any information about and the active word list contains the words which the user is currently interested in observing. The initial word list is drawn from a pre-computed file that contains every word with more than 1000 occurrences in the TASA corpus as well as the word frequency obtained from TASA. On creation, each word similarity module (described later) is responsible for relating the words that it has information on, to the main system. Thus the full word list is managed without need of interaction by the user. The active word set is managed directly by the user of the system. Words can be added or removed from the active word set through a series of tools. Word – Add or remove a single word from the active list. Near – Add or remove all words in a given file from the active list. Count – Add or remove all words that have occurred between min and max number of times. All – Add or remove all words in the full words list from the active list. Relator – Add or remove all words that a particular similarity metric knows about. Random – Add or remove a random number of words from the active list. Words can also be added and removed by selection on the multi-select lists shown in the left of the word manager image above. 3. Laying Out Words Each active word has a location specified in the visualization. This location can be modified in a number of ways. The position of individual nodes can be changed by clicking and dragging the nodes with the mouse. If a user drags across the screen but has not started on a node, then the screen scrolls sidewise (in a 1 to 1 movement with the mouse). The scroll wheel on the mouse will zoom in or out (depending on scroll direction) with the zoom centered on the mouse. There are also a set of layout managers that facilitate creating a meaningful layout of the nodes. The layout managers are as follows: Random – All nodes are randomly laid out on the screen. Grid – Nodes are placed in a grid on the screen (dimensions are determined to minimize difference in x and y spacing) Word – A word is chosen as the center and words are laid out at a distance from the center node proportional to their difference from that word (as specified by the similarity module of choice). MDS – Multi dimensional scaling is applied to the nodes based on the similarity module of choice as a distance metric. Fit Screen – Nodes are translated and scaled to keep their proportional distances but to fit as much of the screen as possible. 4. Relational Modules The core of the system is the set of relational modules. These are based on the set of computational language algorithms available today. Users can select an algorithm from the set of options and choose options for that algorithm that they see fit. The algorithms can then be trained on training data specified by the user. Each relation module is written as a Java class that extends the WordRelator object in this package. Each module must then implement the following key functions. /** * Apply the learning that you have done. * Relevant for static algorithms. */ public void finalizeSpace() { finalized = true; } /** * See if the space is finalized. * @return */ public boolean isFinalized() { return finalized; } /** * This is the list of words that the relator knows about. * @return */ public Set<String> getWords() { return null; } /** * The core of the relator is a distance metric between two words. * @param word1 First word. * @param word2 Second word. * @return Value in the range of 0 to 1 where 1 is highly similar and 0 is non-similar. */ public abstract double getDistance(String word1, String word2); /** * Learn this sentence. * @param sentence */ public void learn(String[] sentence) { } // These clean incoming sentences. transient public LinkedList<SentenceCleaner> cleaners = new LinkedList<SentenceCleaner>(); Each module also specifies a name, color and other relevant information. Word count information is automatically added when the learn function is called. The SentenceCleaners provide a pre-processing of the sentences to be learned. Each sentence is translated by the cleaners in the order that they appear in the list. The cleaner options are as follows: To Lower Case Alpha Numeric Only Remove Web Tags Remove Excess Whitespace Remove stoplisted words (taken from a standard stoplist) The RelationLoader class provides a way to dynamically load in new relators. This means that the program does not even have to be recompiled in order to add a new type of word relator. However, a large number of word relators are provided along with the software. Below is the wizard tool in the advanced version which shows the available similarity metrics and a brief description. locations will be rendered. A full rendering can then be forced by the user through a menu selection. The user can hover over any node with the mouse even if a full rendering has not been done, and see the connections leading out of this node as well as some information on that particular node. Rendering is done in a separate thread and using double-buffering so it can be interrupted easily. Nodes are visualized as labeled circles. The circles can be constant sized or proportional to the word's frequency. Node coloring can either be static or representative of that words properties (number of connections, part of speech, etc.) 4.1. Legend A basic legend shows word color meanings and word size meanings dynamically as these properties are selected. 4.2. Other Network Tools There are a number of tools provided to describe the relations given by the similarity modules. Different minimum paths can be computed between words along all or a set of the similarity modules. Discrepancy tools show minimum and maximum discrepancy between multiple similarity modules over given words. Basic statistics show connection distributions for different similarity modules. 4.3. The GUI Simple and Advanced 4. The Visualization The visualization is shown in real time based on the directed graph of the words as nodes and the relations as edges. Edges are color coded according to which similarity metric created them and with a thickness proportional to the strength of the similarity. Edges are rendered as curved lines between two nodes where the curvature is increased for each subsequent similarity metric. The exact nature of the line can be specified by the user (arrow at the end or middle, dot at the end, increase in line width, etc.) Rendering is real-time for small enough sets of relations. If the number of relations is judged to be too great by the visualization tool, then only the node To make this tool both accessible and comprehensive, we developed two versions of the underlying graphical interface. The simple version relies strongly on graphical depictions of the various user actions. An options flow is specified urging the user to first specify similarity metrics, choose words, layout the words, choose visualization options and then view the final visualization. Default values are used whenever possible and not even given as options to the user. In the advanced version, the program is more dynamic and options can be specified in full. Wizards help users through the more complex steps in creating different relations or specifying word sets. 4.3. Training For most similarity measures, training data can be selected by the user. All of the WordNet similarity measures come with built in training (WordNet itself) but for all others, training is completely flexible. Users can select text either by entering text into a text box, inputting a website to scrape, or selecting a file. If the user selects a file, the system will attempt to convert that file into readable text. It has interpreters for Word, Open Office, and PDF files. Some of the similarity metrics work on a line by line basis and others on a document by document basis. Each text entry given will be considered a document and will be parsed into sentences if required for that metric. Each line is filtered with a series of cleaners specified by the user. These cleaners are applied in the order that they are added. 5. Hypotheses It is difficult to define a single hypotheses to be defined over this tool as it is instead a means to discover potential relations between different similarity measures and word databases. We are hoping that it will provide more hypotheses than we will start out with. Instead, it makes sense to detail a few potential cases where this tool would be useful for uncovering some hidden potential in various language tasks. having to set them up and get them working (not an easy task). 5.3. Different Documents Probably more of interest to non-researchers, users can teach the system on their own set of documents. Various similarity metrics can then be used to discover new things about their own writing. Bigrams are easily displayed by the nGram similarity metric. Word relations that were not explicitly entered may become visible. For example, though you may never say that an item is bad in your paper, you may realize that the semantic content will show that. 6. Use Cases The next few sections describe some example specific uses that this visualization tool provides. 6.1 McRae Norms In this use case, we will explore the cosine similarity between the McRae feature norms (McRae et al 2005). The McRae feature norms are a set of concrete nouns that have had their features manually described by subjects. A similarity can then be computed over the objects by comparing the feature sets between different objects. In this example, the words in the McRae norms are the only ones selected. Lines are bound at a similarity of around .6 to 1 to show the predominant structure of the data. The data points have been clustered using MDS. It is clear that the words are organized into a handful of related clusters which do not connect strongly with each other. We have a clustering of “plant”, “sea creature”, “bird”, “insect”, and “human made objects”. This tool has two main uses, first is comparing the difference between different similarity metrics and the second is comparing the difference between different training data sets. Both of these tasks will appeal to different types of researchers or other users. 5.1. Primary Function The primary function is as an overview tool. A researcher can manipulate a particular variable within a comparator and see the results on the main nodes of the network. In this way, they can more easily determine if their change was beneficial or not. 5.2. Lots of Comparators To our knowledge, this is the largest collection of word comparison tools available in one place that work right out of the box irrelevant of the visualization system. This will give researchers the ability to quickly experiment with new semantic space algorithms without Next, we take wish to see how the McRae norm results like up with a new experiment; the semantic Pictionary project. In this experiment subjects were asked to make representations of words using geometric shapes. The similarities were then checked on the shapes that were made for the words. 7 Conclusion We can see that this new metric does not match up very well with the McRae norms and determine some possible problem words like dove and chickadee (something we might expect as these are probably hard objects to represent/draw for most people). We believe that we have provided a useful tool to both hobbyists and professionals interested in word relations. The code and executable programs will be made freely available at [website to be added later]. 6.3 Wikipedia Dog [1] Here we look at a few different ways that one data set with one similarity metric can be represented. We are learning the relations between words with the website http://en.wikipedia.org/wiki/Dog . We use the NGram similarity metric. In the first visualization we show the word centered view for the word “the”. Since the NGram similarity metric does not have a very strong gradient, the words appear on two main rings around the word “the” References [2] [3] [4] [5] In the next visualization we can see the word count as the radius of the circles. Words are aligned to a grid. Finally we show the MDS layout with a different color scheme. Many things like the background can be changed in various options through the program. The MDS clustering is also very interesting with the NGram similarity since the distinctions are very discrete. McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C.(2005). Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37, 547-559. George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41. Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Michael N. Jones, double J. K. Mewhort (2007), Representing Word Meaning and Order Information in a Composite Holographic Lexicn. Psychological Review. Reference to the word similarity metrics. Unsure how to do this since they are based on various papers but collected by google codes semantic spaces project.