16327 >> Danyel Fisher: I'm Danyel Fisher from the Vibe... researcher, and I'm pleased to be having a guest here...

advertisement
16327
>> Danyel Fisher: I'm Danyel Fisher from the Vibe Group and an information visualization
researcher, and I'm pleased to be having a guest here today, Chris Collins, from the University of
Toronto. Chris works with ->> Chris Collins: Gerald Penn.
>>: Gerald Penn and Sheila Carbendale from University of Calgary. He just completed an
internship at IBM Research, working with the Martin and Fernanda team of information
visualization researchers. And he's coming out here to talk both about some of his past research
and some of his summer work. Please welcome Chris.
[applause]
>> Chris Collins: Thank you very much. I was in the neighborhood. I thought I'd get in touch
with people working on similar things to me. This worked out very well. So thanks for having me.
So what am I going to talk about today? It's going to be a quick overview of my thesis research
that's currently in progress. I'm going to be done in about eight months, I hope.
One of those things where you never know, but it's around eight months from now. So some of
the projects that I'm presenting today are still in progress, but I would love to get your advice on
how the completion of those would go.
So, of course, visualization has been around for a long time. People have been talking about it,
using visual externalizations for many different purposes things. So Galileo writing in his
notebook about the progressions of moon to Jupiter. Allen Turain [phonetic] trying to figure out,
using it as a cyfer analysis during the World War.
Today, it's been popularized even more when we're looking at text online. We've got lots of
examples. Things like looking at the changes of text of the word usages over the length of a
document from beginning to end.
This is one of my own projects I'll talk about today. And a lot of different uses of tag clouds on the
web where you can upload your own text and look at it.
But if we're just looking at a particular sentence, of course, we can just read this sentence and
understand it. This is one sentence from one of the letters from Emily Dickinson to her
sister-in-law, Susan Gilbert.
And we can understand a single sentence just by reading it for ourselves. If we get to a
particular, one of the letters people want to do a deep literary analysis of this, you might want to
taking some notes. So you might make a map of what are the different relationships between the
entities in this letter and how was she expressing different sentiments about each of these things.
And then we can move on to looking at several of the letters. Then what do you do? One of the
traditional techniques that was done for many years now is to take, for example, a word of
interest and look at the concordance lines across that word. So line up a particular word and look
at the context words on the left and right of the word of interest.
More recently, visualization has been used for analyzing this kind of text. So this is a project from
the Monk Project, Kathryn Plasant's [phonetic] group looking at understanding the sentiment in
these Emily Dickinson letters, what's the eroticism in the Victorian writing, using a visualization
and computer analysis.
What am I getting at here? Talking about the fact that externalization or using both your internal
abilities and some external tools to perform cognitive tasks helps you do some computational
offloading and the ability to do actually more with the text than you would be able to do without
the assistance.
So I've come up with a bit of a categorization for my thesis how visualization has been used in the
past or in linguistic analysis. So starting with nontextual linguistic data, things like voice without
transcribing the text. Using linguistic and nonlinguistic data together, things like looking at
Wikipedia and the social network that goes with Wikipedia, how those two things relate to one
another.
Using visualization for information exploration of documents and text collections, actually using it
for deeper text analysis, literary analysis, linguistic analysis, legal analysis. And then also
actually surfacing some of the underpinnings of an MLP algorithm through an interface.
And I think this is a rough categorization, but here we have a stronger relationship to the MLP
research as we move more to the right and a greater sophistication in the linguistics of the data
processing.
Like I said, this is a cloudy thing. But what I'm going to talk about today are some projects I've
been working on in the last three of these areas.
So that's an introduction. We're going to go through the information exploration projects that I've
been working on, a little bit on linguistic analysis. One project on MLP interfaces, and then I'll
finish up.
So information exploration. The first project I want to talk about today is looking at document
visualization. We have so many documents online right now and a lot of our libraries are
becoming digital.
We used to be able to walk through the halls of a library and grab a book off the shelf, flip through
the pages and figure what it's about. Now we're presented with something like this, where we get
a title, a picture that's not, relatively not very useful, and perhaps a listing of something like
25,000 search results.
What we're looking at now is talking to people who are using these digital libraries and asking
them questions like: What do they dislike or what do they dislike about a regular library and like
about a regular library and how can we bring those things into the digital library realm and what
could be made easier or more enjoyable with existing interfaces.
One thing we've come up with in this area of document visualization is looking at the growing
number of text documents online, how we can do some analysis of those documents to help
people understand them in context.
So the questions we're trying to answer, then, what is the text particularly about, what sections of
the text are relevant to you if you have a interest in that long document?
So the project we were working on is called DocuBurst. And we've seen some relatively simple
linguistic techniques to analyze the text of interest.
Starting with a book, we would take the words, and we would stem them. Take off the endings.
Games becomes game, taken becomes take. Count the verbs, only nouns and verbs right now.
The reason we're only looking at nouns and verbs is because we then plug in the WordNet
ontology as a resource. So that's something that categorizes words such as game is a type of
activity. Chair is a type of furniture is a type of household object.
Using that hierarchy and those counts, then we create this picture of the context of the text.
And I'll explain a little bit about the details of what's in there. So the DocuBurst graph is a radial
space filling graph that contains the particular term that you've asked for, term of interest in the
center. And then radiating outward, you have the more specific terms about that.
So here we have insect is a type of arthropod is an invertebrate is an animal. The coloring on this
graph, then, represents the number of times that that word or that concept appeared in the text.
So the darker the color, the more times that concept occurs in the text of interest. Here we're
looking at a general science textbook from high school.
Then we've also got, on the side here, it's hard to see because it's grayed out, but underneath the
orange box there's a browser of the paragraphs of the document. We can do different things with
this. We can look at paragraphs or we can do some text tiling and look at automatically detected
roughly semantically similar segments of the document.
Then, of course, there's some picking controls that were never used in the user interface. So
there's a little video of how it works.
So if you want to actually select a term of interest, once you click that, you'll actually get a listing
of all the different paragraphs in which that term occurs. And then you can click one of those
particular paragraphs to get to the occurrences of that term.
So here's an example of what that looks like when we look for under energy and search for the
word electricity. We can see pretty quickly now that electricity seems to be located in a chapter at
the end of the book, and slightly scattered throughout the rest of the group. There must be some
sort of physics chapter here. If we select one of the paragraphs, we get the result of the original
text and we can see there's many occurrences of that term in a particular section of the book.
I just did a quick analysis of the 2008 Presidential Debate with this. I wasn't able to separate it
out into the different speakers which would be really interesting. This is the entire text thrown into
the system all in one piece.
So, of course, we have things, like if we look under substance we get things like oil being really
prominent compared to -- I've actually removed a lot of the zero words here. So all the things that
would be light gray are taken out. So everything here is non-zero in the text, in the transcript.
This also reveals one of the problems with this system in that we can't do really good word sense
disambiguation on the level of WordNet senses.
So we've got oil being counted as a lipid, which is fine, but then it's also being counted here as a
synonym of edible fat. So there's a bit of a problem here. We end up with things, in the general
science textbook, we have homonyms being really big because it talks about man's relationship
with the environment and things like this.
So one of the things we're looking at doing with some colleagues in another university is instead
of using WordNet, using a more coarse grained ontology to do this graph.
We look at something like the types of conditions that are discussed, the human condition.
Healthcare is the most important one in the text of this transcript. If we look at the word "issue"
and the different senses of the word "issue," of course children are important. And people are
saying what about the children, this kind of thing. You have to say that at the debate or the
parents won't be very happy.
Of course, the key word of change. These are all the different change verbs that were mentioned
in the transcript.
So what we're looking at now is how can we bring this back and do some experiments on it? So
can we put this into something like a library interface where we can actually have these small
glyphs and actually allow people to compare across them.
If you populate a graph with a particular term of interest how does it look in the coloration across
different documents? And the impact we're hoping for here is something that will help you search
through large amounts of large documents, long documents. We're not talking about web pages
here.
Doing inter-document comparison across those glyphs. And also maybe helping some people do
multi-document analysis of the types of words that are particularly common to a particular author
or particular research paper, genre, whatever. So multi-document visualization.
Leading into that multi-document visualization. I want to talk a little bit about a project that I did
this summer with IBM. I was working with Martin Wattenberg and Fernanda Viegas from
Cambridge, I just came from there this week.
And we were looking at visualizing, instead of a particular document, we were looking at 628,000
documents at once. I know there's some interesting work that's been done here as well on this
kind of thing, the Blues project and other work.
We're looking at large volumes of text. We're not really interested in similaring and clustering.
What we want to do is help discern differences across different facets of text. If we subdivide the
text up based on some characteristics of the documents, how do we compare one subset to
another.
So what we were looking at, for an idea, then, is the U.S. Federal Court decisions. And in that
collection we had a lot of messy data. So we were just given raw text. The same raw text that
companies like West Law and Lexis/Nexus take. They've used some markup on it. We had the
raw text and we did the same thing. There's thousands and thousands of errors. Things like
dates are marked as courts, that kind of thing, where we had to try and fix those automatically,
versus obviously a business value to doing this kind of work. But we went with the free resource
and planning to provide the data back to the community again, because ease of access through
these decisions is important for the country.
So we're looking at, for me this is a challenge because I had been looking only at visualization
side before. Now suddenly we're looking at issues of scale with these 628,000 documents.
We've got something like 48,000 copies of Dante's Inferno or maybe more familiar with 48,000
copies of my own master's thesis.
So the process we undertook here was basically dividing the internship up into two projects. The
first thing to do was to actually do the text analysis and try to come up with a database we were
going to use to drive this visualization. First we met with the legal experts. We worked a little bit
on acquiring the data and analyzing it. We made some initial designs, and we went back to our
colleagues at the Harvard Law School and talked to them about the visualization, did some
revisions, and now we're at the process of starting an evaluation more formally with a group of
lawyers.
So though many are from law school, and I'm from Canada, I understand it, it's going to be okay.
The U.S. court system divided up into three courts: The Supreme Court, the Federal Courts of
Appeals and District Courts. We're looking in particular at the Supreme Court and the Federal
Courts of Appeals.
And the Courts of Appeals are divided into 11 districts across the country, multi-state districts,
and then the Federal Court which deals with things like the patent cases and the DC circuit.
So we were then interested in dividing that up across the different districts. So we've got all of
these things from the 1750s and then the Courts of Appeals were only from the 1950s. They're
authored by a particular justice or a judge, and within that others may concur or dissent with a
decision. So within a particular decision you may have multiple decisions.
And then there's also the issue of precedent, upholding or overturning cases. So what we're
interested in finding here in this large set of documents. In the legal domain we were interested
in something called forum shopping. So people can bring a particular case, for example, a
company like Microsoft or IBM that has offices in many different jurisdictions, could choose to
bring their case in a jurisdiction that's more favorable to them.
And we were interested in trying to find if there were any differences across different jurisdictions
in what cases or seen or heard in these areas.
Also, the idea of stare decisis. So what's the precedent. So keeping things the same over time.
And then agenda setting. What issues are happening in a particular jurisdiction and how do they
spread from that jurisdiction to another. So do we see transfer from the Supreme Court
downward or the other way around over time.
So the project that I'm going to show you right now was particularly looking at the forum shopping
issue. So, this is one of the cases we were looking at. And there are many different types of data
that we annotated out of this case. And don't need to go through all of them but there are lots of
things we automatically detected from each of the cases.
From this we created this text analysis infrastructure where we can detect and separate the parts
of the text, view the same sort of simple techniques I did with the DocuBurst project, removing
soft words and stemming. We wanted to explore the characteristics of sets of documents while
retaining always quick access to the original text. This is something I really believe in when
you're doing visualization of large massive texts or a particular document, you really always need
to have a connection back to the original text.
The visualization is great for showing trends or themes, but people actually need to read the text
in the end. And there's a lot of work that's done that doesn't do this.
So then we have this infrastructure ready now to go ahead and start doing the visualization.
What did we use? We used a standard XML parser, regular expression processing to detect the
different sections of the documents. The dates were a mess. There were 15, 16 different types
of dates used across the courts and we really needed to know what's the date of the case. That
was one of the hardest things to do was to figure out what are the date formats used here and
trying to actually get them into correct format.
And it also had to be fast. So we used Lucene, industry standard search engine open source to
do search over the texts. And we also found some problems with Lucene. So actually joined up
the open source community for the first time and helped them fix it. We extended it for allowing to
capture for stemming. While we were doing indexing, we were doing stemming. And we can
reverse the stemming after the fact because we didn't want to have surface stems on the
visualization.
Everything is stored in the post press database and preview everything off line using
multi-threaded technology and the result is we have this infrastructure for handling large text
corpra.
We can do some analysis of this. Outside of the visualization now we have this data we can start
looking at different statistical characteristics of the court system. This is of interest to the lawyers
we were working with. Things like average time to make a decision, we have two dates for most
of the cases, the time it was heard and the time it was handed down. It was interesting to see
things like the Court of Appeals had this huge jump in 1990s, early 2000, and the amount of time
it actually takes to have your case decided. And I think that might reflect, the lawyers were
saying it might reflect an overloading of the court system right now.
Other things that were well known to the legal community were surfaced in this work. So, for
example, the average number of citations per case for the Supreme Court, it's well known that
during the time that Warren Buffet was the Supreme Court Justice ->>: Warren Berger.
>> Chris Collins: Sorry, Buffet is the really rich guy. Thank you very much. So Justice Berger
was the chief justice at this time. And under his ->>: Warren.
>> Chris Collins: Warren, that's what I'm looking for. Was the chief justice under this time. And
under his rein people were citing less of the previous cases because they were changing the law
at the time. Thank you, it was Chief Justice Warren.
For example, a particular term of interest, looking at it from all different courts, we were interested
in this Roe vs. Wade case, the classic abortion case in the Supreme Court, was considered to be
bringing in the discussion of personal privacy to the legal system. We actually found out there
was an increase of the discussion of privacy in the years leading up to the Roe versus Wade
case. So that was interesting to see.
Given the statistical background and these analyses we can do, they were sort of fun, now we're
going to go forward and look at the differences in the corpus. If we can divide the corpus across
one of these meta facets, across the courts, across the authoring judge, over time, over a
particular years, can we see this issue of forum shopping coming up?
The measure we used was is called, it's a Dunning log likelihood ratio. You're probably familiar
with it. It's a standard ki squared measure of the word frequency, we're looking at the expected
occurrence of the term in the document, the reference documents, and the overall set of texts and
how does that differ -- what's the probability that the ratio is different from the expectation.
So that actually gives us a significance value that we can use in our visualization.
And then we took out soft words dynamically. So we have the short stop word list, but then we
also took out the top and bottom, common and least common words from the list.
So in our visualization design, we actually show the top end words of a particular circuit and the
top end words by the Dunning Hood likelihood ratio score. The size then is the significance value
that the probability that that score is different from the expectation or higher, only higher.
The order is alphabetic. And there's a reason for that that I'll show in a moment. We actually
show edges that I'll show in the next slide. There's one column per circuit. And the color random
to allow for differentiation across the different words. That's random within a color scheme so it
doesn't look too crazy.
So given a particular court, we then found what we would have expected is that we see the
justices' names appearing strongly, place names. So, of course, these are things that
differentiate one circuit from another, right? Because that judge works there or it's in that
particular district. Then we had to do some name entity recognition and remove those things.
We knew we were getting the right thing. We took all those things out. And we got something
that was a bit more interesting in terms of its content fullness.
So this is what it looks like. It is a little bit of a mess, but you end up, it's interactive and you can
explore this. On a larger screen it's more usable. What we have here, then, is the 1st to the 11th
Circuit. I've removed from this view the DC and Supreme Court because that works a little bit
better in a wide screen format.
So what did we find? We found things like if we look at all the drug terms that are surfacing. We
were surprised to see -- I was surprised, but people from the U.S. weren't surprised to see that
there's this interesting trend of these sort of amphetamine drugs, pseudo ephedrine and
methamphetamine cases being seen in the west and then more prevalence of marijuana and
narcotics up in the New York region.
So we're seeing -- this visualization now is a reflection of the cultural differences or behavioral
differences across the country. So this video will show how the system works. And we'll see
some of that. So here we're mousing over the particular circuits, and the edges that are shown
are connecting words that are appearing in more than one circuit.
So they're significantly differentiating in more than one circuit. Copyright is obviously in the New
York district, and this is a tool tip that shows the particular score, the score details for every one
of the particular columns. And in this case copyright was only significant in the second district.
We can see that the 4th District is strongly connected to the 6th here with many, many lines
connecting across them.
If we go up and explore that, we can see things like the word pulmonary and pneumoconiosis.
They're both strong and have the same profile. And here as well, the same profile across the 4th
and 6th Districts. If we select a particular term, then, we actually get back out these documents
glyphs that show the -- pause that for a second -- that show the documents that the terms occur
in. So the word "coal" appears in many, many documents. Some of the documents it occurs
more in than others. The larger the bubble size, the more occurrences of that word in the
document.
So then I clicked the 4th circuit here and activated that. Over here, now, all of the terms that
are -- all of the cases that are from the 4th circuit that contain the word "coal" are highlighted in
red.
Okay. So we can see that this one large case is not from the 4th circuit. It's from the 10th circuit
because the circuit lights up when you mouse over it.
If I go then and choose this, which is the black lung disease that coal miners get, we can see that
there are many cases that contain both terms, this little pie chart, but not this large case. But the
cases that do contain both terms like this one are from the 6th or 4th circuit, which are areas
where coal mining is common.
>>: [Inaudible].
>>: Chris Collins: Yes. So one of the things we thought about doing was trying to -- well, the first
thing I did was actually order them by the score. So you'd have the strongest things on the top
and then going downward.
What we ended up then is these edges became a lot more Chris crossed, because we're actually
trying to show differences. So the columns are often different. These edges actually represent
when the differences in each circuit from the full set of documents is the same.
So, for example, this word occurs oddly highly in both of these circuits. Another thing we thought
about doing was trying to reorder them on trying to minimize the edge crossings, by reordering
them that way. And that's something that we didn't get around to doing but it's certainly
something we should try.
It's also questionable whether or not the edges are actually useful for anything. So we also are
considering just turning them off altogether.
So I think what's most interesting with the edges is all that people want to see is when something
occurs in more than one. We have the edges so faded out in the background when you're not
interacting with it because they were so messy, that it's not really probably that useful. But when
you actually mouse over something it is important to be able to see that edge occur, appear.
So the last thing that we actually saw here, so I'm just going to deselect those terms. And I was
showing you before we look at things like the drugs.
So here's some of the drug terms. And we can see, if you spend some time, you could read the
detail about what the scores are and the particular normalized term frequency for each of those
terms across the different circuits. And we actually did the scores both by the occurrence counts.
So the number of times a particular term occurs the number in any document. And we did it by
the number of documents that it occurs in at least once.
That gives us a little bit subtle difference. So we end up able to discern between a word that
occurs really strongly in one document that's really long, might actually show up really highly in
the occurrence score. But it may only occur in one document so it won't show up at all in the
cases score.
But we actually found they showed different enough things that we provide for both. So the first
row here is the scores based on the number of cases that the term occurs in at least once. And
the second row is the number of occurrences that occur overall, complete.
And many times they're very similar. So here we have, for example, in the 8th and 9th circuit and
the 10th circuit there's a significant value for the score for this term methamphetamine in both the
cases and the occurrences.
I've actually selected that now, we can actually see how many cases occur. So
methamphetamine is a very popular thing to talk about in the courts.
So we turn that off again. And the last thing I wanted to show you is we can actually change the
time range pretty quickly. This is from 1990 to 2000 that we were looking at right now. I'm going
to change that to 1980 to 1990 and look at that.
So we see the red here now are things that are new that didn't occur before, and we have things
like, for example, this word homosexual, I think the mouse is going to go up there, it's like a social
issue that was discussed more in the '80s than it was in the '90s or things like marijuana in the
11th circuit that doesn't appear in the later time period but in the earlier time period it was
important, whereas methamphetamine doesn't occur at all anymore here in the '80s.
Also, when we showed this to the lawyers, they found some interesting things like this word here
"ostrich", and they were like why is ostrich showing up in the 7th circuit and only the 7th circuit? It
barely occurs at all in the other circuits.
And it turns out that ostrich is a particularly interesting legal term where they use an argument
called the Ostrich Argument, which means they ask the jury to put their head in the sand and
ignore what they just heard. And so we actually can click 1 of those documents, then, and try and
figure out where the word actually occurs and we see here it's the ostrich instruction that they're
talking about. It's really interesting, though, because this is such a widely known legal term, it
was really interesting for our lawyer colleagues to see that that was so strongly used in the 7th
circuit versus the others. And it turns out that it's not only strongly used in the seventh circuit, it's
strongly used in the seventh circuit by all the different judges it's not just 1 particular judge who
likes to use that term.
>>: [Inaudible] are you able to correlate all these terms to actual decisions?
>>: Chris Collins: Not right now.
>>: Decisions, mistrials.
>>: Chris Collins: We don't have that right now. We do have the outcomes of the decisions. And
we do have the terms. So we could, but it's not an overt right now to do that.
So one interesting thing to do would be to actually look at things like one of the interesting things I
wanted to do was look at the dissenting opinions versus the main opinions and what kind of
language is used in the dissenting versus the main opinions.
So all that really would mean would be dividing up the corpus across a different facet of data and
instead of the columns being the circuits, the columns would be the different types of decisions.
Again, we could also look at things like mistrial versus not a mistrial. So some other work I've
been doing in the area of linguistic analysis. That was my work in information exploration. So
many of you probably published in the ACL proceedings, the computational linguistic conference.
We actually gave a tutorial there this year myself, myself and my supervisors, on using
visualization for computational linguistics. And before that I actually did a little bit of a review. In
the ACL proceedings we actually found 42 percent of the ones we looked at from 2006 had what
we considered to be novel visualizations that were beyond just parch trees and matrix diagrams.
These were really interesting things that were made for the presentation of the results.
And then an additional 29 percent had parch trees in these standard CL techniques for explaining
data.
And then for the most part it seemed like visualization was used for presenting the results and not
actually for the research. So here's some examples, looking at some really interesting kinds of
drawings and standard matrix diagram and these really crazy antecedent precedent diagrams.
So we're working right now then to try to continue to understand how the computational linguistics
community is using visualization doing some ethnographic study of a group working at the
University of Southern California. So we're actually going there to try and understand how they
work with and without information visualization in their pre-existing -- in their existing environment.
So this gets a little bit towards the work that I think the vibe group does here in actually trying to
understand work in context. So how does the teamwork together, what are the specific nuances
of their information use versus what we would have expected before actually going down there.
So I actually did do that. We're interested here in trying to make a holistic description of the kind
of use of information that they do. So I went down there, followed them around. Actually did a
little bit of training in social science stuff before I went. Just informally with some colleagues who
are using these techniques more often than me.
We did some contextual interviews. I actually participated in the types of analysis they do on their
machine translation, so this is a phase-based machine translation research group. So they
actually put me to work on some of their things.
I actually went around and analyzed looked at their white boards and notebooks and things like
this to try and understand. We transcribed those interviews and I used a technique called open
coding to come up with a code set that we used to then go through the interviews and transcribe
them and actually code where we saw a particular thing.
So we were interested in things like what triggered the use of a visualization, if they used one.
Was it only for one person to use or did they share it with their team. Was it used only once or
many times?
If it was only used once -- if it was only used once why was it only used once. These kinds of
questions we were tagging our transcripts for those kind of indications.
We came up with this interesting surprising thing where we saw way more visualization in use in
the research process than we had expected. And we're calling this visualization in the wild. And
the info community I don't think is realizing this is actually happening in the scientific community
outside of InfoBiz.
So some of the visualizations they had were in common use. So they had large binders of
printed visualizations of parch trees that they would go through by hand and annotate and try and
figure out where these problems were happening that were causing the translation to go off.
They were designing interactive ad hoc visualizations that they were using at their own desk and
also sharing those things with their friends rarely.
There was a real periodicity to the analysis process. So there's a lot of coding, a lot of refining of
the algorithms then they'd do a big run of everything and check everything. So, of course, this is
just one particular research group. But I think it gives us a little bit of an interesting insight into
how the computational linguistic research process works.
And then when I would ask them questions like have you ever used an information visualization,
they would say no, I've never -- I don't even -- they don't use that kind of terminology.
So it was really interesting to actually open my mind to that. I mean I've come from both
backgrounds. So my masters is actually in computational linguistics but I'm away from it long
enough now that I've been steeped in this other realm.
And it's so interesting for me to see that what's being done in these groups is actually what would
be considered to be cutting edge InfoBiz research but it's just ad hoc stuff that they do so they
can do their work.
What we're interested in now is trying to help them do that work with more interactive
visualization. So we saw that sort of idea of external visualization that I brought by in the
beginning.
So if they do things in a simple way, they're doing it, they're analyzing it themselves, there's some
sketching going on on white boards when things become a little bit more complex.
Then when it becomes really large they make these print outs for their binders that they're
actually analyzing on a periodic basis.
So that's that work. It's ongoing. I'm actually in the process right now of developing a prototype
that will go back down to them to address some of the things that I observed there. And then
we'll go back and do a longitudinal study of how they used it in the end.
So the second project related to working with linguists that I have done is something that some of
you may have seen at the InfoBiz conference a couple of years ago. I know it's highly related to
Tim's work here in the back. So credit to him for that.
So we were looking at trying to compare word similarity measures. So a particular pair of words
may have a score that says how similar they are. There's many different types of these scores.
They can be based on the WordNet ontology I was talking about earlier, so some form of
structure over language. They can be based on the word usage patterns. So that's the kind of
thing that we were interested in.
From the visualization side, we felt that the use of space was the most powerful visual dimension
we have to work with.
So we came up with an idea trying to allow for the reuse of the spatial visual dimension. So the
overview, we called the Project Biz Link and it links multiple two-dimensional visualizations in a
3-D, restricted interaction 3-D space.
So adjacent planes are connected by these bundles connect a particular word that occurs on one
side to a word that occurs on the other.
We tried looking at the idea of working in 3-D was really intimidating, because I've really been
quite negative about 3-D most of the time. Because I think a lot of times the mouse is not the
right input device to use for a 3-D device. That's what we were working with. We tried to restrict
the view and restrict the interaction to some simple shortcuts that provide for the preferred views
of the space. You can look at it from the front or the top, or the side. But free navigation is
discouraged.
So existing techniques in this field of looking at multiple different types of layouts of visualizations
we can do things like looking at them side by side or printed or on the screen. Interactive linking,
if you mouse over a particular visualization, you would see the current of the term on the
visualization which would be highlighted.
Here we're looking at things you can have a social network graph, who is friend with who in the
company and on this side you could actually have the formal hierarchy of the organization.
So related data underlying but different structural data. And you can also mash them both
together. So things like taking one of those structures drawing the graph and just drawing the
other one on top of it.
So the system we developed actually allows us to do all of these things. So we have these
planes and we can move them around. So we have two planes side by side, they're interactively
linked we can put them on top of each other and mash them together and we can pull that apart
accordion style and look at them and look at the links in between.
So to build that, we took existing, pre-existing interactive two dimensional visualizations of word
similarity that we already had. So this is the WordNet ontology, and this is a word similarity
measure based force directed layout of word similarity.
And we threw those into that 3-D space and connected them up. So what we're interested in
looking at, then, is the pattern of the lines between the two visualizations. So if we see a lot of
Chris crossing, we're expecting to see things that are occurring as neighbors in one side not
occurring as neighbors in the other side. So a difference in the similarity scores.
So for the lexical example we looked at is the hierarchy and the similarity measure and what's the
differences between them. So you can query now across the visualizations. So from this side, if
we selected a particular -- there's no synonymy information here. So if we want that information
from the other side, we have to do an interactive query.
So we have alphabetic organization and synonym information, how can we get from one to the
other and provide the information on the other side.
So first we would select a term on the clustered side, propagate an edge over to the side that
actually has which words or synonyms with which other words. And then take all those synonyms
and propagate the edges back to the other side.
So we can actually see what shows up on the first plane now lights up it means that thing is a
synonym on the other side.
So here's a short video illustrating that technique. So here if we look at particular syn set and we
select it, the edges propagate back and forth and what we see now is that we -- this is a little bit
of detail on the view that you can get. So we provide always an equivalent to 2-D views. This is
the equivalent of 2-D view on the first side.
We can see this term is highlighted even though we didn't select it. We selected this. But the
edges have been propagated over and everything on this side that matched was propagated
back.
So we're able to sort of query the two visualizations simultaneously by clicking on one and using
the relationships between them.
So the work that we're continuing to do on this now is actually working with Ted Peterson at UMD
and his post doc Seph Mohammed to try to apply this to his collection or he's created a library
that collects a lot of different word similarity measures together, how can we apply that library
now to try and visualize the sort of, summarize the differences across those different measures.
And then we've had some interest about applying this to other types of linguistic and nonlinguistic
data like comparing different parse true algorithms across the space.
Finally, this is something that I was telling Michael about in the hallway earlier this morning.
We're looking at taking existing NLP algorithms and surfacing what's going on underneath them
to try and help understand how they're working in the end.
So, for example, looking at the uncertainty. So if we know we have a statistical output, how can
we know whether or not we should trust it?
So you're all familiar with examples like this. The avant garde movie the press has spoken about
has been defamed by the critics in spite of an original advertising campaign.
And I am from Canada but my French is terrible. So we will just look at this. You can if you know
French you can see that this is -- it's an ambiguous term, the word "press" has been translated
incorrectly here. And then we get the English translation back. We have the film of avant garde
that the pressure spoke approximately was defamed. So we get this jumbled mess and how do
we know if this is correct or not, because most of the time all we see is just this one best output.
So actually our idea was actually to built on constant research by Stacey Scott in Tabletop
Community actually on decision making and think about how can we put human into the loop of
doing this decision making.
So traditional statistical processing systems just give us one single hypothesis quickly. But then
we have this issue of quality. So we're thinking about can we put a human back into that decision
making process so that we have a few options presented to the end user and they can actually
realize which one is correct using their previous knowledge of the language.
So here's an illustration of this. Of course, the group, many of the people here will be familiar
with these kind of systems. So some training data. Some tune parameters. You get a particular
input and then you get out a particular specific particular output.
What's going on inside that black box? So many of these algorithms are using things like a lattice
diagram to trace the hypothesis, track the hypothesis that it has about solution over in the entire
solutions base. So how can we expose that internal workings and provide some sort of
uncertainty annotated output to the end user? What we came up with was literally looking at the
lattices that are underlying the translation system.
So one of these particular lattices may have several different encoded possible hypotheses about
the translation. So, for example, speculation in Tokyo was the end could rise because of the
realignment. Then if you take another path from left to right in the same diagram, you get a
slightly different hypothesis about the translation.
And, again, the same thing. So you have many, many different things encoded into this diagram.
So we're looking at different types of uncertainty mappings that we could use to try and intuitively
give an idea of uncertainty. So two that we came up with was to use this fuzzy border and the
fuzzier it was the more uncertain the algorithm was about the underlying term and the other one
was to use this cloudy sort of similar but, similar in its design in that the lighter the color, the more
uncertain and the tighter and darker the more certain it was.
We found that people were saying that this was more intuitive, but there was a readability issue
with the gradient causing some problems with the reading. So we ended up going with the top
one.
So there's an example. When we had -- we embedded this into a translation system. And when
they were out of vocabulary words in the translation we added a little bit on where we went out
and searched for photos from the web that represented that term. So here, for example, the word
Banff wasn't in the translation dictionary so it actually found some pictures of skiing and this kind
of stuff.
So there's that. So the retained on the eurail [phonetic] corpus. It was a very small training set.
We had a lot of uncertainty. A million sentences, Spanish, French, English and German. And we
used an open source decoder and we modified it just to allow us to actually see those loudest
translation scores.
And there's the chat client. So we actually provided this at the CSCW conference in 2006 as a
chat client that people could use to talk to one another across languages. So one person could
write in German and the other in English, and what you would see as the result would be one of
these lattices where the translation that was most preferred by the algorithm is shown across the
bottom. Alternative translations are shown above.
And actually you can click any particular path and relocate this green line which shows what's
preferred and then the path that you know is correct, because you speak the language and you
know the other one's ridiculous will actually be what's placed in the transcription that's recorded
after the fact.
So speech recognition actually often will use a similar sort of algorithm with the same sort of data
in the background. So we applied this to 213 lattices of a speech recognition corpus which were
appearing to show only the 50 best paths.
And we removed all the nulls and silences. So this was completely decoupled already from the
speech recognition signal. We were given the data and we used it with a decoder. And we found
some interesting things in this example. So in the machine translation project, we saw large
segments, large phrases that were uncertain and there were many alternatives for them. In the
speech case, we saw things like short words that had very little vocal power, were really
uncertain. So there were many different options provided for these particular words, and then
other things were, there were fewer long distance uncertainties.
So in this project we were just finishing this up now. So we're looking at instead of visualizing the
uncertainty on the nodes themselves we want to actually look at uncertainty on the edges
between the nodes. So how do you trace a path across, through the lattice instead of through
particular nodes, which will help, I think, may help better make a decision, at a decision point
where there's more than one edge leaving a particular node.
>>: So did you find that users actually display all the apparent -- it should be apparent to the
user. Do they need the confidence of the system to help them?
>>: Chris Collins: We didn't do that evaluation. So I can't say with any confidence whether or not
we found that. This was sort of a proof of concept kind of idea. And I think it still requires a more
thorough user evaluation. I think -- I expect what we'll see is probably that this presentation is
better than using a list of alternatives. But I'm not sure -- you're right I'm not sure if actually
showing that uncertainty is necessary. It just might be enough to show the graph itself.
>>: Do you have a limit on the number of [inaudible].
>>: Chris Collins: Yeah, 50. 50 paths.
>>: 50?
>>: Chris Collins: 50 paths total. Yeah. So 50 paths is often a very small graph.
>>: A graph.
>>: Chris Collins: 50 paths.
>>: So a lot of these alternatives aren't always interesting. Did you do any sort of attempt to pull
out just [inaudible] or...
>>: Chris Collins: No we didn't but that's an interesting idea. Because sometimes you'll have
shake and shook or something like that. Yeah, we didn't do that. But that's cool. I didn't think
about doing that.
So what have I shown you today? So we've looked at the types of visualization that I've been
doing in my thesis research across three different aspects of the five that I've defined as areas of
linguistic visualization, and in particular projects that I'm still continuing to work on are here in the
text analysis realm with the group of the ISI, and finishing up this project that I've just done with
IBM.
I'd be happy to talk more about this stuff with anybody who wants to talk about it, because I am
writing my thesis starting now. So some ideas I have for future plans, because I am finishing up
soon.
So I think there's two directions we can go here. So the CL expertise that can be leveraged from
the InfoBiz community. For example, looking at this document visualization problem. So the
InfoBiz community is interested in this. People on the web go crazy for it. One of my colleagues
at IBM created a tag cloud called Wurdle. It had 100,000 documents visualized within the first
few days. And it got coverage on television. It was just crazy. And all it was was just a tag client,
word counts.
So if we can do things like incorporating word sense disambiguation, or looking at a document
and its comparison document in another language or how does a document not only differ across
the words that are in the document, but actually the meanings of the words that are in the
document, because that's like the problem with the DocuBurst project it only looks at the surface
forms of the words and not the meanings of the words.
So this gives us a lot of opportunity here to work together, for the two communities to work
together. And then the reverse, the InfoBiz community to look at things like helping the
computational linguistic and NLP communities to do things like incorporate for corpus control and
investigating things for different annotator agreements and the training data.
Things like showing if I turn this parameter in my machine translation algorithm how much is it
going to change the result in advance of actually having to do that. And there have been some
work in the InfoBiz community on some works called scented widgets and other things where we
can actually show we have these interactive widgets that give a preview of what might happen if
you actually changed the parameter.
I think if those two things came together we can do some better exploratory data analysis which is
actually a growing area within the MLP community.
And then more work on the idea of understanding what's actually happening in these black box
MLP systems, looking at how an [inaudible] working when we're doing dialogue system
construction or nondeterministic analysis. The uncertainty in these kind of parametric models that
we use on a regular basis, how are the decisions made when we're doing a chart printing or a
beam search on a parsing system, things like that.
So I'm actually interested in discussing this, and we've actually started a wiki page on the InfoBiz
wiki to try and open this discussion between the two communities so we can talk about how
visualization and computational linguistics and natural language processing can benefit from one
another. And through that tutorial we offered this year we've actually started to have a bit of
discussion with people from the computational linguistics community.
So that's it. Thank you for your attention.
[Applause]
Download