16327 >> Danyel Fisher: I'm Danyel Fisher from the Vibe... researcher, and I'm pleased to be having a guest here...

16327 >> Danyel Fisher: I'm Danyel Fisher from the Vibe Group and an information visualization researcher, and I'm pleased to be having a guest here today, Chris Collins, from the University of Toronto. Chris works with ->> Chris Collins: Gerald Penn. >>: Gerald Penn and Sheila Carbendale from University of Calgary. He just completed an internship at IBM Research, working with the Martin and Fernanda team of information visualization researchers. And he's coming out here to talk both about some of his past research and some of his summer work. Please welcome Chris. [applause] >> Chris Collins: Thank you very much. I was in the neighborhood. I thought I'd get in touch with people working on similar things to me. This worked out very well. So thanks for having me. So what am I going to talk about today? It's going to be a quick overview of my thesis research that's currently in progress. I'm going to be done in about eight months, I hope. One of those things where you never know, but it's around eight months from now. So some of the projects that I'm presenting today are still in progress, but I would love to get your advice on how the completion of those would go. So, of course, visualization has been around for a long time. People have been talking about it, using visual externalizations for many different purposes things. So Galileo writing in his notebook about the progressions of moon to Jupiter. Allen Turain [phonetic] trying to figure out, using it as a cyfer analysis during the World War. Today, it's been popularized even more when we're looking at text online. We've got lots of examples. Things like looking at the changes of text of the word usages over the length of a document from beginning to end. This is one of my own projects I'll talk about today. And a lot of different uses of tag clouds on the web where you can upload your own text and look at it. But if we're just looking at a particular sentence, of course, we can just read this sentence and understand it. This is one sentence from one of the letters from Emily Dickinson to her sister-in-law, Susan Gilbert. And we can understand a single sentence just by reading it for ourselves. If we get to a particular, one of the letters people want to do a deep literary analysis of this, you might want to taking some notes. So you might make a map of what are the different relationships between the entities in this letter and how was she expressing different sentiments about each of these things. And then we can move on to looking at several of the letters. Then what do you do? One of the traditional techniques that was done for many years now is to take, for example, a word of interest and look at the concordance lines across that word. So line up a particular word and look at the context words on the left and right of the word of interest. More recently, visualization has been used for analyzing this kind of text. So this is a project from the Monk Project, Kathryn Plasant's [phonetic] group looking at understanding the sentiment in these Emily Dickinson letters, what's the eroticism in the Victorian writing, using a visualization and computer analysis. What am I getting at here? Talking about the fact that externalization or using both your internal abilities and some external tools to perform cognitive tasks helps you do some computational offloading and the ability to do actually more with the text than you would be able to do without the assistance. So I've come up with a bit of a categorization for my thesis how visualization has been used in the past or in linguistic analysis. So starting with nontextual linguistic data, things like voice without transcribing the text. Using linguistic and nonlinguistic data together, things like looking at Wikipedia and the social network that goes with Wikipedia, how those two things relate to one another. Using visualization for information exploration of documents and text collections, actually using it for deeper text analysis, literary analysis, linguistic analysis, legal analysis. And then also actually surfacing some of the underpinnings of an MLP algorithm through an interface. And I think this is a rough categorization, but here we have a stronger relationship to the MLP research as we move more to the right and a greater sophistication in the linguistics of the data processing. Like I said, this is a cloudy thing. But what I'm going to talk about today are some projects I've been working on in the last three of these areas. So that's an introduction. We're going to go through the information exploration projects that I've been working on, a little bit on linguistic analysis. One project on MLP interfaces, and then I'll finish up. So information exploration. The first project I want to talk about today is looking at document visualization. We have so many documents online right now and a lot of our libraries are becoming digital. We used to be able to walk through the halls of a library and grab a book off the shelf, flip through the pages and figure what it's about. Now we're presented with something like this, where we get a title, a picture that's not, relatively not very useful, and perhaps a listing of something like 25,000 search results. What we're looking at now is talking to people who are using these digital libraries and asking them questions like: What do they dislike or what do they dislike about a regular library and like about a regular library and how can we bring those things into the digital library realm and what could be made easier or more enjoyable with existing interfaces. One thing we've come up with in this area of document visualization is looking at the growing number of text documents online, how we can do some analysis of those documents to help people understand them in context. So the questions we're trying to answer, then, what is the text particularly about, what sections of the text are relevant to you if you have a interest in that long document? So the project we were working on is called DocuBurst. And we've seen some relatively simple linguistic techniques to analyze the text of interest. Starting with a book, we would take the words, and we would stem them. Take off the endings. Games becomes game, taken becomes take. Count the verbs, only nouns and verbs right now. The reason we're only looking at nouns and verbs is because we then plug in the WordNet ontology as a resource. So that's something that categorizes words such as game is a type of activity. Chair is a type of furniture is a type of household object. Using that hierarchy and those counts, then we create this picture of the context of the text. And I'll explain a little bit about the details of what's in there. So the DocuBurst graph is a radial space filling graph that contains the particular term that you've asked for, term of interest in the center. And then radiating outward, you have the more specific terms about that. So here we have insect is a type of arthropod is an invertebrate is an animal. The coloring on this graph, then, represents the number of times that that word or that concept appeared in the text. So the darker the color, the more times that concept occurs in the text of interest. Here we're looking at a general science textbook from high school. Then we've also got, on the side here, it's hard to see because it's grayed out, but underneath the orange box there's a browser of the paragraphs of the document. We can do different things with this. We can look at paragraphs or we can do some text tiling and look at automatically detected roughly semantically similar segments of the document. Then, of course, there's some picking controls that were never used in the user interface. So there's a little video of how it works. So if you want to actually select a term of interest, once you click that, you'll actually get a listing of all the different paragraphs in which that term occurs. And then you can click one of those particular paragraphs to get to the occurrences of that term. So here's an example of what that looks like when we look for under energy and search for the word electricity. We can see pretty quickly now that electricity seems to be located in a chapter at the end of the book, and slightly scattered throughout the rest of the group. There must be some sort of physics chapter here. If we select one of the paragraphs, we get the result of the original text and we can see there's many occurrences of that term in a particular section of the book. I just did a quick analysis of the 2008 Presidential Debate with this. I wasn't able to separate it out into the different speakers which would be really interesting. This is the entire text thrown into the system all in one piece. So, of course, we have things, like if we look under substance we get things like oil being really prominent compared to -- I've actually removed a lot of the zero words here. So all the things that would be light gray are taken out. So everything here is non-zero in the text, in the transcript. This also reveals one of the problems with this system in that we can't do really good word sense disambiguation on the level of WordNet senses. So we've got oil being counted as a lipid, which is fine, but then it's also being counted here as a synonym of edible fat. So there's a bit of a problem here. We end up with things, in the general science textbook, we have homonyms being really big because it talks about man's relationship with the environment and things like this. So one of the things we're looking at doing with some colleagues in another university is instead of using WordNet, using a more coarse grained ontology to do this graph. We look at something like the types of conditions that are discussed, the human condition. Healthcare is the most important one in the text of this transcript. If we look at the word "issue" and the different senses of the word "issue," of course children are important. And people are saying what about the children, this kind of thing. You have to say that at the debate or the parents won't be very happy. Of course, the key word of change. These are all the different change verbs that were mentioned in the transcript. So what we're looking at now is how can we bring this back and do some experiments on it? So can we put this into something like a library interface where we can actually have these small glyphs and actually allow people to compare across them. If you populate a graph with a particular term of interest how does it look in the coloration across different documents? And the impact we're hoping for here is something that will help you search through large amounts of large documents, long documents. We're not talking about web pages here. Doing inter-document comparison across those glyphs. And also maybe helping some people do multi-document analysis of the types of words that are particularly common to a particular author or particular research paper, genre, whatever. So multi-document visualization. Leading into that multi-document visualization. I want to talk a little bit about a project that I did this summer with IBM. I was working with Martin Wattenberg and Fernanda Viegas from Cambridge, I just came from there this week. And we were looking at visualizing, instead of a particular document, we were looking at 628,000 documents at once. I know there's some interesting work that's been done here as well on this kind of thing, the Blues project and other work. We're looking at large volumes of text. We're not really interested in similaring and clustering. What we want to do is help discern differences across different facets of text. If we subdivide the text up based on some characteristics of the documents, how do we compare one subset to another. So what we were looking at, for an idea, then, is the U.S. Federal Court decisions. And in that collection we had a lot of messy data. So we were just given raw text. The same raw text that companies like West Law and Lexis/Nexus take. They've used some markup on it. We had the raw text and we did the same thing. There's thousands and thousands of errors. Things like dates are marked as courts, that kind of thing, where we had to try and fix those automatically, versus obviously a business value to doing this kind of work. But we went with the free resource and planning to provide the data back to the community again, because ease of access through these decisions is important for the country. So we're looking at, for me this is a challenge because I had been looking only at visualization side before. Now suddenly we're looking at issues of scale with these 628,000 documents. We've got something like 48,000 copies of Dante's Inferno or maybe more familiar with 48,000 copies of my own master's thesis. So the process we undertook here was basically dividing the internship up into two projects. The first thing to do was to actually do the text analysis and try to come up with a database we were going to use to drive this visualization. First we met with the legal experts. We worked a little bit on acquiring the data and analyzing it. We made some initial designs, and we went back to our colleagues at the Harvard Law School and talked to them about the visualization, did some revisions, and now we're at the process of starting an evaluation more formally with a group of lawyers. So though many are from law school, and I'm from Canada, I understand it, it's going to be okay. The U.S. court system divided up into three courts: The Supreme Court, the Federal Courts of Appeals and District Courts. We're looking in particular at the Supreme Court and the Federal Courts of Appeals. And the Courts of Appeals are divided into 11 districts across the country, multi-state districts, and then the Federal Court which deals with things like the patent cases and the DC circuit. So we were then interested in dividing that up across the different districts. So we've got all of these things from the 1750s and then the Courts of Appeals were only from the 1950s. They're authored by a particular justice or a judge, and within that others may concur or dissent with a decision. So within a particular decision you may have multiple decisions. And then there's also the issue of precedent, upholding or overturning cases. So what we're interested in finding here in this large set of documents. In the legal domain we were interested in something called forum shopping. So people can bring a particular case, for example, a company like Microsoft or IBM that has offices in many different jurisdictions, could choose to bring their case in a jurisdiction that's more favorable to them. And we were interested in trying to find if there were any differences across different jurisdictions in what cases or seen or heard in these areas. Also, the idea of stare decisis. So what's the precedent. So keeping things the same over time. And then agenda setting. What issues are happening in a particular jurisdiction and how do they spread from that jurisdiction to another. So do we see transfer from the Supreme Court downward or the other way around over time. So the project that I'm going to show you right now was particularly looking at the forum shopping issue. So, this is one of the cases we were looking at. And there are many different types of data that we annotated out of this case. And don't need to go through all of them but there are lots of things we automatically detected from each of the cases. From this we created this text analysis infrastructure where we can detect and separate the parts of the text, view the same sort of simple techniques I did with the DocuBurst project, removing soft words and stemming. We wanted to explore the characteristics of sets of documents while retaining always quick access to the original text. This is something I really believe in when you're doing visualization of large massive texts or a particular document, you really always need to have a connection back to the original text. The visualization is great for showing trends or themes, but people actually need to read the text in the end. And there's a lot of work that's done that doesn't do this. So then we have this infrastructure ready now to go ahead and start doing the visualization. What did we use? We used a standard XML parser, regular expression processing to detect the different sections of the documents. The dates were a mess. There were 15, 16 different types of dates used across the courts and we really needed to know what's the date of the case. That was one of the hardest things to do was to figure out what are the date formats used here and trying to actually get them into correct format. And it also had to be fast. So we used Lucene, industry standard search engine open source to do search over the texts. And we also found some problems with Lucene. So actually joined up the open source community for the first time and helped them fix it. We extended it for allowing to capture for stemming. While we were doing indexing, we were doing stemming. And we can reverse the stemming after the fact because we didn't want to have surface stems on the visualization. Everything is stored in the post press database and preview everything off line using multi-threaded technology and the result is we have this infrastructure for handling large text corpra. We can do some analysis of this. Outside of the visualization now we have this data we can start looking at different statistical characteristics of the court system. This is of interest to the lawyers we were working with. Things like average time to make a decision, we have two dates for most of the cases, the time it was heard and the time it was handed down. It was interesting to see things like the Court of Appeals had this huge jump in 1990s, early 2000, and the amount of time it actually takes to have your case decided. And I think that might reflect, the lawyers were saying it might reflect an overloading of the court system right now. Other things that were well known to the legal community were surfaced in this work. So, for example, the average number of citations per case for the Supreme Court, it's well known that during the time that Warren Buffet was the Supreme Court Justice ->>: Warren Berger. >> Chris Collins: Sorry, Buffet is the really rich guy. Thank you very much. So Justice Berger was the chief justice at this time. And under his ->>: Warren. >> Chris Collins: Warren, that's what I'm looking for. Was the chief justice under this time. And under his rein people were citing less of the previous cases because they were changing the law at the time. Thank you, it was Chief Justice Warren. For example, a particular term of interest, looking at it from all different courts, we were interested in this Roe vs. Wade case, the classic abortion case in the Supreme Court, was considered to be bringing in the discussion of personal privacy to the legal system. We actually found out there was an increase of the discussion of privacy in the years leading up to the Roe versus Wade case. So that was interesting to see. Given the statistical background and these analyses we can do, they were sort of fun, now we're going to go forward and look at the differences in the corpus. If we can divide the corpus across one of these meta facets, across the courts, across the authoring judge, over time, over a particular years, can we see this issue of forum shopping coming up? The measure we used was is called, it's a Dunning log likelihood ratio. You're probably familiar with it. It's a standard ki squared measure of the word frequency, we're looking at the expected occurrence of the term in the document, the reference documents, and the overall set of texts and how does that differ -- what's the probability that the ratio is different from the expectation. So that actually gives us a significance value that we can use in our visualization. And then we took out soft words dynamically. So we have the short stop word list, but then we also took out the top and bottom, common and least common words from the list. So in our visualization design, we actually show the top end words of a particular circuit and the top end words by the Dunning Hood likelihood ratio score. The size then is the significance value that the probability that that score is different from the expectation or higher, only higher. The order is alphabetic. And there's a reason for that that I'll show in a moment. We actually show edges that I'll show in the next slide. There's one column per circuit. And the color random to allow for differentiation across the different words. That's random within a color scheme so it doesn't look too crazy. So given a particular court, we then found what we would have expected is that we see the justices' names appearing strongly, place names. So, of course, these are things that differentiate one circuit from another, right? Because that judge works there or it's in that particular district. Then we had to do some name entity recognition and remove those things. We knew we were getting the right thing. We took all those things out. And we got something that was a bit more interesting in terms of its content fullness. So this is what it looks like. It is a little bit of a mess, but you end up, it's interactive and you can explore this. On a larger screen it's more usable. What we have here, then, is the 1st to the 11th Circuit. I've removed from this view the DC and Supreme Court because that works a little bit better in a wide screen format. So what did we find? We found things like if we look at all the drug terms that are surfacing. We were surprised to see -- I was surprised, but people from the U.S. weren't surprised to see that there's this interesting trend of these sort of amphetamine drugs, pseudo ephedrine and methamphetamine cases being seen in the west and then more prevalence of marijuana and narcotics up in the New York region. So we're seeing -- this visualization now is a reflection of the cultural differences or behavioral differences across the country. So this video will show how the system works. And we'll see some of that. So here we're mousing over the particular circuits, and the edges that are shown are connecting words that are appearing in more than one circuit. So they're significantly differentiating in more than one circuit. Copyright is obviously in the New York district, and this is a tool tip that shows the particular score, the score details for every one of the particular columns. And in this case copyright was only significant in the second district. We can see that the 4th District is strongly connected to the 6th here with many, many lines connecting across them. If we go up and explore that, we can see things like the word pulmonary and pneumoconiosis. They're both strong and have the same profile. And here as well, the same profile across the 4th and 6th Districts. If we select a particular term, then, we actually get back out these documents glyphs that show the -- pause that for a second -- that show the documents that the terms occur in. So the word "coal" appears in many, many documents. Some of the documents it occurs more in than others. The larger the bubble size, the more occurrences of that word in the document. So then I clicked the 4th circuit here and activated that. Over here, now, all of the terms that are -- all of the cases that are from the 4th circuit that contain the word "coal" are highlighted in red. Okay. So we can see that this one large case is not from the 4th circuit. It's from the 10th circuit because the circuit lights up when you mouse over it. If I go then and choose this, which is the black lung disease that coal miners get, we can see that there are many cases that contain both terms, this little pie chart, but not this large case. But the cases that do contain both terms like this one are from the 6th or 4th circuit, which are areas where coal mining is common. >>: [Inaudible]. >>: Chris Collins: Yes. So one of the things we thought about doing was trying to -- well, the first thing I did was actually order them by the score. So you'd have the strongest things on the top and then going downward. What we ended up then is these edges became a lot more Chris crossed, because we're actually trying to show differences. So the columns are often different. These edges actually represent when the differences in each circuit from the full set of documents is the same. So, for example, this word occurs oddly highly in both of these circuits. Another thing we thought about doing was trying to reorder them on trying to minimize the edge crossings, by reordering them that way. And that's something that we didn't get around to doing but it's certainly something we should try. It's also questionable whether or not the edges are actually useful for anything. So we also are considering just turning them off altogether. So I think what's most interesting with the edges is all that people want to see is when something occurs in more than one. We have the edges so faded out in the background when you're not interacting with it because they were so messy, that it's not really probably that useful. But when you actually mouse over something it is important to be able to see that edge occur, appear. So the last thing that we actually saw here, so I'm just going to deselect those terms. And I was showing you before we look at things like the drugs. So here's some of the drug terms. And we can see, if you spend some time, you could read the detail about what the scores are and the particular normalized term frequency for each of those terms across the different circuits. And we actually did the scores both by the occurrence counts. So the number of times a particular term occurs the number in any document. And we did it by the number of documents that it occurs in at least once. That gives us a little bit subtle difference. So we end up able to discern between a word that occurs really strongly in one document that's really long, might actually show up really highly in the occurrence score. But it may only occur in one document so it won't show up at all in the cases score. But we actually found they showed different enough things that we provide for both. So the first row here is the scores based on the number of cases that the term occurs in at least once. And the second row is the number of occurrences that occur overall, complete. And many times they're very similar. So here we have, for example, in the 8th and 9th circuit and the 10th circuit there's a significant value for the score for this term methamphetamine in both the cases and the occurrences. I've actually selected that now, we can actually see how many cases occur. So methamphetamine is a very popular thing to talk about in the courts. So we turn that off again. And the last thing I wanted to show you is we can actually change the time range pretty quickly. This is from 1990 to 2000 that we were looking at right now. I'm going to change that to 1980 to 1990 and look at that. So we see the red here now are things that are new that didn't occur before, and we have things like, for example, this word homosexual, I think the mouse is going to go up there, it's like a social issue that was discussed more in the '80s than it was in the '90s or things like marijuana in the 11th circuit that doesn't appear in the later time period but in the earlier time period it was important, whereas methamphetamine doesn't occur at all anymore here in the '80s. Also, when we showed this to the lawyers, they found some interesting things like this word here "ostrich", and they were like why is ostrich showing up in the 7th circuit and only the 7th circuit? It barely occurs at all in the other circuits. And it turns out that ostrich is a particularly interesting legal term where they use an argument called the Ostrich Argument, which means they ask the jury to put their head in the sand and ignore what they just heard. And so we actually can click 1 of those documents, then, and try and figure out where the word actually occurs and we see here it's the ostrich instruction that they're talking about. It's really interesting, though, because this is such a widely known legal term, it was really interesting for our lawyer colleagues to see that that was so strongly used in the 7th circuit versus the others. And it turns out that it's not only strongly used in the seventh circuit, it's strongly used in the seventh circuit by all the different judges it's not just 1 particular judge who likes to use that term. >>: [Inaudible] are you able to correlate all these terms to actual decisions? >>: Chris Collins: Not right now. >>: Decisions, mistrials. >>: Chris Collins: We don't have that right now. We do have the outcomes of the decisions. And we do have the terms. So we could, but it's not an overt right now to do that. So one interesting thing to do would be to actually look at things like one of the interesting things I wanted to do was look at the dissenting opinions versus the main opinions and what kind of language is used in the dissenting versus the main opinions. So all that really would mean would be dividing up the corpus across a different facet of data and instead of the columns being the circuits, the columns would be the different types of decisions. Again, we could also look at things like mistrial versus not a mistrial. So some other work I've been doing in the area of linguistic analysis. That was my work in information exploration. So many of you probably published in the ACL proceedings, the computational linguistic conference. We actually gave a tutorial there this year myself, myself and my supervisors, on using visualization for computational linguistics. And before that I actually did a little bit of a review. In the ACL proceedings we actually found 42 percent of the ones we looked at from 2006 had what we considered to be novel visualizations that were beyond just parch trees and matrix diagrams. These were really interesting things that were made for the presentation of the results. And then an additional 29 percent had parch trees in these standard CL techniques for explaining data. And then for the most part it seemed like visualization was used for presenting the results and not actually for the research. So here's some examples, looking at some really interesting kinds of drawings and standard matrix diagram and these really crazy antecedent precedent diagrams. So we're working right now then to try to continue to understand how the computational linguistics community is using visualization doing some ethnographic study of a group working at the University of Southern California. So we're actually going there to try and understand how they work with and without information visualization in their pre-existing -- in their existing environment. So this gets a little bit towards the work that I think the vibe group does here in actually trying to understand work in context. So how does the teamwork together, what are the specific nuances of their information use versus what we would have expected before actually going down there. So I actually did do that. We're interested here in trying to make a holistic description of the kind of use of information that they do. So I went down there, followed them around. Actually did a little bit of training in social science stuff before I went. Just informally with some colleagues who are using these techniques more often than me. We did some contextual interviews. I actually participated in the types of analysis they do on their machine translation, so this is a phase-based machine translation research group. So they actually put me to work on some of their things. I actually went around and analyzed looked at their white boards and notebooks and things like this to try and understand. We transcribed those interviews and I used a technique called open coding to come up with a code set that we used to then go through the interviews and transcribe them and actually code where we saw a particular thing. So we were interested in things like what triggered the use of a visualization, if they used one. Was it only for one person to use or did they share it with their team. Was it used only once or many times? If it was only used once -- if it was only used once why was it only used once. These kinds of questions we were tagging our transcripts for those kind of indications. We came up with this interesting surprising thing where we saw way more visualization in use in the research process than we had expected. And we're calling this visualization in the wild. And the info community I don't think is realizing this is actually happening in the scientific community outside of InfoBiz. So some of the visualizations they had were in common use. So they had large binders of printed visualizations of parch trees that they would go through by hand and annotate and try and figure out where these problems were happening that were causing the translation to go off. They were designing interactive ad hoc visualizations that they were using at their own desk and also sharing those things with their friends rarely. There was a real periodicity to the analysis process. So there's a lot of coding, a lot of refining of the algorithms then they'd do a big run of everything and check everything. So, of course, this is just one particular research group. But I think it gives us a little bit of an interesting insight into how the computational linguistic research process works. And then when I would ask them questions like have you ever used an information visualization, they would say no, I've never -- I don't even -- they don't use that kind of terminology. So it was really interesting to actually open my mind to that. I mean I've come from both backgrounds. So my masters is actually in computational linguistics but I'm away from it long enough now that I've been steeped in this other realm. And it's so interesting for me to see that what's being done in these groups is actually what would be considered to be cutting edge InfoBiz research but it's just ad hoc stuff that they do so they can do their work. What we're interested in now is trying to help them do that work with more interactive visualization. So we saw that sort of idea of external visualization that I brought by in the beginning. So if they do things in a simple way, they're doing it, they're analyzing it themselves, there's some sketching going on on white boards when things become a little bit more complex. Then when it becomes really large they make these print outs for their binders that they're actually analyzing on a periodic basis. So that's that work. It's ongoing. I'm actually in the process right now of developing a prototype that will go back down to them to address some of the things that I observed there. And then we'll go back and do a longitudinal study of how they used it in the end. So the second project related to working with linguists that I have done is something that some of you may have seen at the InfoBiz conference a couple of years ago. I know it's highly related to Tim's work here in the back. So credit to him for that. So we were looking at trying to compare word similarity measures. So a particular pair of words may have a score that says how similar they are. There's many different types of these scores. They can be based on the WordNet ontology I was talking about earlier, so some form of structure over language. They can be based on the word usage patterns. So that's the kind of thing that we were interested in. From the visualization side, we felt that the use of space was the most powerful visual dimension we have to work with. So we came up with an idea trying to allow for the reuse of the spatial visual dimension. So the overview, we called the Project Biz Link and it links multiple two-dimensional visualizations in a 3-D, restricted interaction 3-D space. So adjacent planes are connected by these bundles connect a particular word that occurs on one side to a word that occurs on the other. We tried looking at the idea of working in 3-D was really intimidating, because I've really been quite negative about 3-D most of the time. Because I think a lot of times the mouse is not the right input device to use for a 3-D device. That's what we were working with. We tried to restrict the view and restrict the interaction to some simple shortcuts that provide for the preferred views of the space. You can look at it from the front or the top, or the side. But free navigation is discouraged. So existing techniques in this field of looking at multiple different types of layouts of visualizations we can do things like looking at them side by side or printed or on the screen. Interactive linking, if you mouse over a particular visualization, you would see the current of the term on the visualization which would be highlighted. Here we're looking at things you can have a social network graph, who is friend with who in the company and on this side you could actually have the formal hierarchy of the organization. So related data underlying but different structural data. And you can also mash them both together. So things like taking one of those structures drawing the graph and just drawing the other one on top of it. So the system we developed actually allows us to do all of these things. So we have these planes and we can move them around. So we have two planes side by side, they're interactively linked we can put them on top of each other and mash them together and we can pull that apart accordion style and look at them and look at the links in between. So to build that, we took existing, pre-existing interactive two dimensional visualizations of word similarity that we already had. So this is the WordNet ontology, and this is a word similarity measure based force directed layout of word similarity. And we threw those into that 3-D space and connected them up. So what we're interested in looking at, then, is the pattern of the lines between the two visualizations. So if we see a lot of Chris crossing, we're expecting to see things that are occurring as neighbors in one side not occurring as neighbors in the other side. So a difference in the similarity scores. So for the lexical example we looked at is the hierarchy and the similarity measure and what's the differences between them. So you can query now across the visualizations. So from this side, if we selected a particular -- there's no synonymy information here. So if we want that information from the other side, we have to do an interactive query. So we have alphabetic organization and synonym information, how can we get from one to the other and provide the information on the other side. So first we would select a term on the clustered side, propagate an edge over to the side that actually has which words or synonyms with which other words. And then take all those synonyms and propagate the edges back to the other side. So we can actually see what shows up on the first plane now lights up it means that thing is a synonym on the other side. So here's a short video illustrating that technique. So here if we look at particular syn set and we select it, the edges propagate back and forth and what we see now is that we -- this is a little bit of detail on the view that you can get. So we provide always an equivalent to 2-D views. This is the equivalent of 2-D view on the first side. We can see this term is highlighted even though we didn't select it. We selected this. But the edges have been propagated over and everything on this side that matched was propagated back. So we're able to sort of query the two visualizations simultaneously by clicking on one and using the relationships between them. So the work that we're continuing to do on this now is actually working with Ted Peterson at UMD and his post doc Seph Mohammed to try to apply this to his collection or he's created a library that collects a lot of different word similarity measures together, how can we apply that library now to try and visualize the sort of, summarize the differences across those different measures. And then we've had some interest about applying this to other types of linguistic and nonlinguistic data like comparing different parse true algorithms across the space. Finally, this is something that I was telling Michael about in the hallway earlier this morning. We're looking at taking existing NLP algorithms and surfacing what's going on underneath them to try and help understand how they're working in the end. So, for example, looking at the uncertainty. So if we know we have a statistical output, how can we know whether or not we should trust it? So you're all familiar with examples like this. The avant garde movie the press has spoken about has been defamed by the critics in spite of an original advertising campaign. And I am from Canada but my French is terrible. So we will just look at this. You can if you know French you can see that this is -- it's an ambiguous term, the word "press" has been translated incorrectly here. And then we get the English translation back. We have the film of avant garde that the pressure spoke approximately was defamed. So we get this jumbled mess and how do we know if this is correct or not, because most of the time all we see is just this one best output. So actually our idea was actually to built on constant research by Stacey Scott in Tabletop Community actually on decision making and think about how can we put human into the loop of doing this decision making. So traditional statistical processing systems just give us one single hypothesis quickly. But then we have this issue of quality. So we're thinking about can we put a human back into that decision making process so that we have a few options presented to the end user and they can actually realize which one is correct using their previous knowledge of the language. So here's an illustration of this. Of course, the group, many of the people here will be familiar with these kind of systems. So some training data. Some tune parameters. You get a particular input and then you get out a particular specific particular output. What's going on inside that black box? So many of these algorithms are using things like a lattice diagram to trace the hypothesis, track the hypothesis that it has about solution over in the entire solutions base. So how can we expose that internal workings and provide some sort of uncertainty annotated output to the end user? What we came up with was literally looking at the lattices that are underlying the translation system. So one of these particular lattices may have several different encoded possible hypotheses about the translation. So, for example, speculation in Tokyo was the end could rise because of the realignment. Then if you take another path from left to right in the same diagram, you get a slightly different hypothesis about the translation. And, again, the same thing. So you have many, many different things encoded into this diagram. So we're looking at different types of uncertainty mappings that we could use to try and intuitively give an idea of uncertainty. So two that we came up with was to use this fuzzy border and the fuzzier it was the more uncertain the algorithm was about the underlying term and the other one was to use this cloudy sort of similar but, similar in its design in that the lighter the color, the more uncertain and the tighter and darker the more certain it was. We found that people were saying that this was more intuitive, but there was a readability issue with the gradient causing some problems with the reading. So we ended up going with the top one. So there's an example. When we had -- we embedded this into a translation system. And when they were out of vocabulary words in the translation we added a little bit on where we went out and searched for photos from the web that represented that term. So here, for example, the word Banff wasn't in the translation dictionary so it actually found some pictures of skiing and this kind of stuff. So there's that. So the retained on the eurail [phonetic] corpus. It was a very small training set. We had a lot of uncertainty. A million sentences, Spanish, French, English and German. And we used an open source decoder and we modified it just to allow us to actually see those loudest translation scores. And there's the chat client. So we actually provided this at the CSCW conference in 2006 as a chat client that people could use to talk to one another across languages. So one person could write in German and the other in English, and what you would see as the result would be one of these lattices where the translation that was most preferred by the algorithm is shown across the bottom. Alternative translations are shown above. And actually you can click any particular path and relocate this green line which shows what's preferred and then the path that you know is correct, because you speak the language and you know the other one's ridiculous will actually be what's placed in the transcription that's recorded after the fact. So speech recognition actually often will use a similar sort of algorithm with the same sort of data in the background. So we applied this to 213 lattices of a speech recognition corpus which were appearing to show only the 50 best paths. And we removed all the nulls and silences. So this was completely decoupled already from the speech recognition signal. We were given the data and we used it with a decoder. And we found some interesting things in this example. So in the machine translation project, we saw large segments, large phrases that were uncertain and there were many alternatives for them. In the speech case, we saw things like short words that had very little vocal power, were really uncertain. So there were many different options provided for these particular words, and then other things were, there were fewer long distance uncertainties. So in this project we were just finishing this up now. So we're looking at instead of visualizing the uncertainty on the nodes themselves we want to actually look at uncertainty on the edges between the nodes. So how do you trace a path across, through the lattice instead of through particular nodes, which will help, I think, may help better make a decision, at a decision point where there's more than one edge leaving a particular node. >>: So did you find that users actually display all the apparent -- it should be apparent to the user. Do they need the confidence of the system to help them? >>: Chris Collins: We didn't do that evaluation. So I can't say with any confidence whether or not we found that. This was sort of a proof of concept kind of idea. And I think it still requires a more thorough user evaluation. I think -- I expect what we'll see is probably that this presentation is better than using a list of alternatives. But I'm not sure -- you're right I'm not sure if actually showing that uncertainty is necessary. It just might be enough to show the graph itself. >>: Do you have a limit on the number of [inaudible]. >>: Chris Collins: Yeah, 50. 50 paths. >>: 50? >>: Chris Collins: 50 paths total. Yeah. So 50 paths is often a very small graph. >>: A graph. >>: Chris Collins: 50 paths. >>: So a lot of these alternatives aren't always interesting. Did you do any sort of attempt to pull out just [inaudible] or... >>: Chris Collins: No we didn't but that's an interesting idea. Because sometimes you'll have shake and shook or something like that. Yeah, we didn't do that. But that's cool. I didn't think about doing that. So what have I shown you today? So we've looked at the types of visualization that I've been doing in my thesis research across three different aspects of the five that I've defined as areas of linguistic visualization, and in particular projects that I'm still continuing to work on are here in the text analysis realm with the group of the ISI, and finishing up this project that I've just done with IBM. I'd be happy to talk more about this stuff with anybody who wants to talk about it, because I am writing my thesis starting now. So some ideas I have for future plans, because I am finishing up soon. So I think there's two directions we can go here. So the CL expertise that can be leveraged from the InfoBiz community. For example, looking at this document visualization problem. So the InfoBiz community is interested in this. People on the web go crazy for it. One of my colleagues at IBM created a tag cloud called Wurdle. It had 100,000 documents visualized within the first few days. And it got coverage on television. It was just crazy. And all it was was just a tag client, word counts. So if we can do things like incorporating word sense disambiguation, or looking at a document and its comparison document in another language or how does a document not only differ across the words that are in the document, but actually the meanings of the words that are in the document, because that's like the problem with the DocuBurst project it only looks at the surface forms of the words and not the meanings of the words. So this gives us a lot of opportunity here to work together, for the two communities to work together. And then the reverse, the InfoBiz community to look at things like helping the computational linguistic and NLP communities to do things like incorporate for corpus control and investigating things for different annotator agreements and the training data. Things like showing if I turn this parameter in my machine translation algorithm how much is it going to change the result in advance of actually having to do that. And there have been some work in the InfoBiz community on some works called scented widgets and other things where we can actually show we have these interactive widgets that give a preview of what might happen if you actually changed the parameter. I think if those two things came together we can do some better exploratory data analysis which is actually a growing area within the MLP community. And then more work on the idea of understanding what's actually happening in these black box MLP systems, looking at how an [inaudible] working when we're doing dialogue system construction or nondeterministic analysis. The uncertainty in these kind of parametric models that we use on a regular basis, how are the decisions made when we're doing a chart printing or a beam search on a parsing system, things like that. So I'm actually interested in discussing this, and we've actually started a wiki page on the InfoBiz wiki to try and open this discussion between the two communities so we can talk about how visualization and computational linguistics and natural language processing can benefit from one another. And through that tutorial we offered this year we've actually started to have a bit of discussion with people from the computational linguistics community. So that's it. Thank you for your attention. [Applause]

16327 >> Danyel Fisher: I'm Danyel Fisher from the Vibe... researcher, and I'm pleased to be having a guest here...

Related documents

Products

Support

16327 &gt;&gt; Danyel Fisher: I'm Danyel Fisher from the Vibe... researcher, and I'm pleased to be having a guest here...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

16327 >> Danyel Fisher: I'm Danyel Fisher from the Vibe... researcher, and I'm pleased to be having a guest here...