>> Lee Dirks: Thank you, everyone. We're going... portion of the schedule. Thanks. Hope you guys...

advertisement
>> Lee Dirks: Thank you, everyone. We're going to reconvene the afternoon
portion of the schedule. Thanks. Hope you guys had a nice lunch and a chat
with your colleagues.
For the next session, I'd like to introduce Rafael Sidi. He's the vice president of
product management for Science Direct, an expert in research productivity. He's
been with Elsevier since 2001, where he's been instrumental in developing
Engineering Village and launching Illuminate.
He was also the publisher for the Compendex database. Before joining Elsevier,
Rafael was director of E-Commerce operations at Bolt, a teenage social
networking portal. He holds an MA from Brandeis University and a BS in
electrical engineering from Bosphorus University in Istanbul, Turkey.
He's also joined by his colleague, Dave Marques, who we're very proud to claim
here in the pacific northwest. Although he works for Elsevier, he sits here in
Seattle. We're very glad to have him here.
I'll hand it over to Rafael.
[applause].
>> Rafael Sidi: Thank you, Lee. First of all, I'm going to -- we are doing this
presentation with David Marques together. And one of the topic that I want to
introduce to you and show to you is a new product that we launch. It's all about
visualization.
And I was going to present this product, but David is the guru of this product. He
lives here, so we are going to do this as a team.
I'm going to look at most of the product [inaudible] so we are talking about
multimedia visualization. For me whenever we introduce any multimedia and
visualization one thing that I keep in mind is how this going to help researcher
and scientist in their daily outcome.
Le Iovrnal des Scavans started in 1665. This was the early scientific journal.
Since that time, as publishing we changed tremendously. And today, right now, if
you go to different publishers, you see graphic abstract. So we are expressing
the journals now in a more visual perspective. So you can see the graphic
abstract and you can get the quick overview have the article.
Then we started playing wit text and caption. So you can look at -- you can get
inside from the articles.
Now, the multimedia, introducing multimedia is very important for us because we
are trying to provide more insight to the researcher and scientist. We are trying
to provide more intelligence from the article. So we don't want the articles to be
aesthetic articles, we want the articles to be interactive articles.
And what we have been doing is we are introducing, we are asking the authors to
submit the videos, any supplementary data so we are showing this to our users.
And here is how the users who come to our websites they can really understand
what's happening and they can learn about an operation. Again, it's not
something great to watch after lunch, but still you know, you need to understand
that this is how you're going to communicate with the scientists and research
community.
We certain about the datasets, the importance of the datasets this morning. So
you want to also to provide more context to our users when they are looking to
an article. So we partnered with Pangaea, it's in Germany. So what we are
doing, we are introducing the datasets. We are connect canning the article and
then linking to the dataset this is available. This is our way much visualizing the
datasets within the article.
Then this morning we heard about [inaudible]. So this is the way that you want
the article to be interactive so people can get the information easily. And then we
have pleased to have [inaudible] submission coming with our articles through our
authors.
And the beauty of this is that you are connecting the user from the article to the
other external relevant sources that they need to go and they need to do more in
depth search and discovery.
We partnered with EMBL. This is researchers in Germany. So what they do,
they extract the protein structures and they visualize that within the articles. So
we are trying to link the users contextually to relevant information wherever they
are in the article.
Another visualization or image multimedia that we are using is we are asking the
author to submit keywords for the structure. So when we get the structure
keywords, we are linking structure and we are presenting the structure to our
users so they can go to other databases, they can go through their searches in
other databases.
And we will always looking to the search results. So search results most of the
time they have been textual. So what we are doing is it is for the engineers and
some -- and a medical researchers to look at the images to search the images is
very critical. So you sort of just providing search results in a textual format. We
are also providing search results you can do image search and you can get the
images in your search results.
Now, in 2005 I looked at what -- how the search results are presented. And the
way that I see it in today's search result, some of the let's say Google scholar
views a lot. And are these results dumb search results? Where is the
intelligence? How people can get inside from the search results?
In this building I think this is down in China Microsoft Research. Again, if you
compared the Google results with these results, there are more insights in these
results. At least you can see some author names, prolific names. So another
way for us in terms of visualization is okay, how we can leverage facets, how we
can leverage the content, the keywords that you have so we can provide more
insight using facets to the end users. Again, in Microsoft what they are doing is
they are taking the facetted search to the extreme. This is in my mind faceted
search inserts. So you can get the visual way, you can really get insights looking
at to this table. For me, you know, [inaudible] introduced faceted search to the
search industry. By thinking what Microsoft is doing here, they are taking faceted
search is like they are creating the next version of faceted search where you can
get -- you can see much, much better information. You can easily interact with
the results.
And this is for us, you know, this is an excellent way for me for search and
discovery, specifically in science. A technical field is a great way to see
information.
Again, there are great tools that we integrated to our journal articles so we are
trying to bring the data, where trying to bring the content to life. So this kind of
visualization is helping our researchers, scientists, students to understand the
content much, much better.
Now, we are talking about video. And video is going to be very, very important in
the future for scientific and technical research and retrieval. And what we have
seen right now the important of video also is introducing journals based on video.
So this is a completely different transition just journal article. Now you have
journal -- video journals.
And not only that, we want to also provide to our users the videos -- the protocols
in videos. So there are new startup companies who are leveraging and
presenting the protocols in video format. So this is an excellent way for our
researchers and scientists to learn new things.
Now, recently we launch another product and then here we took two subjects in
health sciences and we created again leveraging the images, the video that we
have, we created a brand new product. And we are leveraging here not just
image search but also the facets.
And then what you see here, again, sorry -- oops. This is a video where you can
really learn how to do the search. So that's basically we are sharing this kind of
information with the scientists, with user and researcher to understand what's
happening here.
And you can take this body and then you can look at -- you can visualize and you
can see all the figures or all the videos that you have in a book. So that's -- we
are making the content more interactive. So you can really learn what's
happening.
You saw -- we don't want to just provide text but you want to provide your
multimedia visual so that you can improve what you are in your daily work.
Again, these are some additional examples of how we are presenting the
content.
We have another product that we are leveraging visualization is geofacets. So
you can -- you can map all the vessels that you have and then you can look at
and you can did a search in this specific person. And what you are doing is you
are getting the image that is are in the document and then you can do more in
depth search if you like it just images you can get the images or if you want to
look at the document, you can access the document. And not only that, we are
thinking about the user's workflow, how we can take that image map and you can
put in your other map workflow tools that you are using.
Again, there are great tools, great visuals that you can see. It looks great, it
looks very cool by what you can do with that. So that's the important stuff. You
know, there are great visualization but okay, how am I going to use this? What
am I going to do with that?
But in this case, it's a workflow problem because they have the map, see the
map in the journal but they cannot take that map and put that in their workflow
solution. Again, another way that we looked at to information is that how I can
provide more insight from all the content that we are putting together? How we
can categorize it? How we can provide a simple visual image that that shows
what the trend is.
And how I can compare two companies' output just using visuals? So if I want to
look at what's happening in the electric car, if I want to compare General Motors
and Chrysler, this is easy. I can create a visual diagram. I can take that. I can
put in my Excel. I can present to my managers in the presentations that you are
to go.
You need always to think about multimedia visualization, how it integrates with
the scientists and researchers workflow.
Key factor in the assign field is how we measure authors? Again, you can
visualize this. And you can see an author publication by source. In this case, we
are looking to Ben Shneiderman. And I can see where he's publishing. And I
can see also what kind of subjects he's publishing.
So that gives me an intelligence in a very, very quick manner.
Again, one of the best search engines that we have right now in the scholar
[inaudible] is what Microsoft is doing in China with academic research. This is a
helpful tool for me to figure out the co-author network of any author. So I can
look at this and I can see that Ben Shneiderman is connected to all those people.
But if I want to find out if Ben Shneiderman is connected to Jim Hendler
[phonetic], again, I can use this tool to figure out how they are connected and I
can figure out whom to go to figure out how they are connected.
So we need to understand again on the visualization how it is helping me so
solve their problem.
Journal analyzer. We all try to analyze our journals. But the way that you can do
that is you can create an image of that and you can visualize all the stuff.
Another important research is at University of Maryland with paper lens. If you
look at this, it's giving you insight about what's happening in a topic, who are the
most prolific author, and you can get insight on this and you can deep down drill
down to find more information.
There are some interesting companies that are using visualization tool. And this
kind of visualization tool provide insights to their customers, to their researchers.
So here you can find the landscape. You can see the whole landscape in any
topic. Or you can find unanticipated patterns in any subject. This site for me is a
usual way for looking at the data, looking at the content.
Or you can identify the early trends.
Another company who is working in the visualization, again, you can visualize the
whole genome and you can drill town, you can get much, much more insight,
much, much more intelligence on this kind of visualization. So those are helpful
for the specific niche areas.
If you want to see NIH Funding, this is another tool that you can look at how this
whole funding is distributed.
If you want to look at the whole content distribution in terms of keywords, you can
look at that way. Or if you want to see everything in a wheel, you can do that too.
But, again, how useful are this stuff I think still is -- there's a question mark on
this. How we are going to use? This is always question mark. How the
researcher going to use this? It's a great visualization, it's cool, it looks nice, but
how is going to be integrated to the workflow?
Again, it's another way to look at the cancer network in terms of content.
Now, there is a guy in Europe who is really, really know what he's doing. Moritz
Stefaner, he really understand information and visualization. So the way that he
present you can easily understand, you can easily interact with any content. And
here is some samples that he has done on different topics. But if you can bring
something like this to a science and technology, it will really help the scientists
and researcher to really interact with the content, to get more insights from the
content.
Again, this is a -- his -- the way that he did this is citation network for the journals.
And this is the citation -- journal network. And then you can see which journal is
citing is other journal. That you can get a quick snapshot of what's happening in
an area.
Again, this is the eigenfactor that shows the journals. And then you can see
which journals are citing which journals and which journals are being cited by that
journal. So that is important performance for any kind of visualization that is
important for administrator, for scientist or for researchers.
Again, this is another similar tool that shows the whole network area. This is a
beautiful image. But what I'm going to do with this? So how I'm going to use this
image? So again, from the product side we need always to be very careful,
okay, there's a coolness but what's -- how it's connected to the outcome.
And I think the other beauty that I see likely in the market is with the Web, the
creativity of crowd is really coming up. So when we look at with Many Eyes, the
IBM's product, people are using Many Eyes and they are creating their own
visualization on scientific content.
So for me from the product side as a publisher this is very, very important. So
scientists, researchers, they are using Many Eyes, they are using available
content, and they are creating applications on top of that content using a tool that
IBM's providing.
So what we are doing is, okay, we are also providing the tools to the scientists,
the content, the APIs so they can create their own applications. In this case as a
publisher, we are not going to come with all the visualization or all the multimedia
solutions. But what we can do, we can provide the tools and the content to the
scientific community and they can build these tools. So what we have done, we
have launched an application marketplace. In that application marketplace we
are providing the APIs and we are telling to do scientist researchers, hey, build
whatever you want to build. You want to build visualization, go ahead, build.
And then one of the applications that we got was from [inaudible] university, and
they created an expert search. I wasn't create this expert search so they created
an expert search. What you can do when you do a search on any topic, in this
case we did for I think Semantic Web and then I can see who are the most
prolific authors in my search results. So I'm trying to bring the contextual
application in my search results.
So that's the key. If you are going to bring visualization anything it should be
contextual. And in this case, again, this is their application, it's not my
application. We can view Jim Hendler's network and we can really go in depth in
Jim Hendler's network, we can play with that.
So this not my application. Again, the crowd created this application and they did
this in a much, much quicker way than I did. So what we are doing with the
applications here an example that we have is prolific author. So we partnered
with a company. They created this visual showing the most prolific author within
our content. Or what we are doing, we created this application that shows the
author's network in the system.
I think what we are pushing is that whenever application, whenever visualization
or multimedia you are going to do should be contextual, should provide insight on
the end users.
And one of the applications that what we are doing is we are bringing other -- we
are creating the integration tool, we are creating a probability tool so that we can
integrate other products, we can link to the other products. So one of the
products that we are linking is we created an a application called Brain Link. And
it looks to the content that we have and it extracts neurostructures and then it
links to the brain navigator product that my colleague, David Marques is going to
present to you now. Thank you.
>> David Marques: First I have to put everybody to sleep. Video. That's what I
wanted to do, right. Dot cam. And I want to pick up on a couple of things that
were mentioned this morning. And particularly one of the things that Rafael was
just saying now is talking about what people are going to do with the content. So
that's what I'm going to talk about is what people want to do with the content in
making it easier.
So I'll have to compress a two hour talk into 10 minutes. So how many people
are neuroscientists in the room? That's a good sign. This is all about
neuroscience. It's about people who study the brain and study what different
parts of the brain do.
So simplify it way down to say the key problem is I have some task that
somebody does or some animal does and I want to figure out what parts of the
brain are involved in that and how they do that, how they implement that. So a
connection between the structure of the brain and the function of what those
structures or what those pieces do.
We publish in book form we publish atlases of -- almost a variety of 20 species of
animal, rat, mouse, monkey, human and birds fish and I'll kinds of things.
So what we've done is we've taken though atlases which are plate after plate of
the histology; that is slices through the brain at certain locations and then outlines
of every one of the named structures that neuroanatomists called out. If you look
at a structure, if you just look at it an anatomy picture, it's sort of fuzzy, you can
see some major landmarks but you can't see the thousand different little
structures in each of the brains.
So what we've done is we've taken that atlas and put it online and help people
figure out, make it easier for them to actually do their work with it. So I'll talk
about two tasks that is they do. One is plan the research; that is plan to get to
that spot that they want to do something with in the brain and the other is after
they've done something, interpret what they've done to say did I get to that spot
or what change did I make on that spot of the brain? So those two tasks are two
of the things that we've talked about for the brain.
So you saw in Science Direct when you're reading an article and you read about
a structure in the brain, you'll get a picture of that structure as it is in that brain on
the species that you're looking at, and then you can click over to this page, which
is the quick summary of the printed atlas pages that have that structure in it.
First thing you notice is the structure, everything that's colored here is the
structure. I picked out the striatum. Now, that's actually made up of about eight
different substructures. So we color them immediately and we show them in a
structure hierarchy that is defined by the neuroscientist. And then the major
thing, there's a lot of things you can go in here and do here, but the major thing
you can do I'm going to talk about interpretation first, is you interpret your
research.
So the thing on the right here is an image, a slice through a brain, right, that you,
the researcher might have done from your animal, and you can compare that to
the drawing of the standardized brain that we've published in our atlases. And
then you can then do an overlay of those on top of each other and then match up
exactly each named temperature with each different part of the structure of your
brain.
And then there's all kind of different ways you can shrink it and do all kinds of
various morphing to get the match exactly right. So this is the interpretation part.
I'm interpreting what has happened here.
A lot of researchers will inject some kind of substance, radio active substance,
whatever, and they'll get little spots in a certain area, and they want to know
exactly what nucleus that's in. So they'll do this overlay process to interpret
where they are.
So it's a very simple visualization but a very powerful tool. And then they can
print this out and do all the usual things that you would expect.
The other side that is planning research if you're going to go in and do an
intervention. When you're thinking of like a small mouse brain, we're talking
about 5 centimeters by 4 -- by 9 centimeters, right? Millimeters. Sorry. 5
millimeters by 9 millimeters. So you've got to hit an exact spot in there.
So one of the things you do is you go to the atlas. This is where you're typically
the researcher will do, they will go to the atlas and they will -- let me go back to
the comparison and they will say okay, I want to hit it right here. They have a
spot, it tells them the exact coordinates of the spot. You can even -- we have a
calculator in there, too, if your animal is bigger than this or smaller we'll do a
calculation for you so you adjust your instrumentation to hit that exact spot. And
then you can say this -- I'm not going to do this. You can say this coordinate is a
note.
And then you go into the three-dimensional viewer. These are those same
highlighted structures in the 3D viewer. You got all your usual 3D look at from it
the top, look at from it the side, look at from it the front, play it all around, usual
stuff. This, by the way, our VTK model is build from IT -- the toolkit, in case
you're wondering.
And, you know, you can play with the transparency and you'll see why that's
important coming up. So now you're in that spot. And one of the things, as I
said, you can do is you can plan your research. So let's say I wanted to inject
into there. I go in and I -- hard to read these plans. Oops, wrong one. So I can
say, okay, here are my saved injections. Here's my plan right there. Let's load
this one. We're going to load that one in. Did I not load it? Okay. Let's try that
again. Lower -- there we go. All right. There it is. So now it's loaded in. Let's
show the injection.
So now you see there's an injection canula coming down. It's in that lower spot.
Obviously we have all the usual, you know, you can move it around, et cetera.
One of the things that's very important is you can show what structures you're
going through to get there. That might be important because you might want to
not disrupt other parts of the brain. So let's say I want to show the rest of the
brain -- this is the whole brain around it done in much less detail, right? So I can
highlight my -- my injection site. And then if I want to avoid structures, so this
down here, this list down here, is the structures my probe is passing through. I
can change the angle of injection to go around so I avoid certain ones of those
viewers.
This is a process that's routinely done because they want to avoid certain cortical
structures because that's part of the behavior they're going to study. So they
want to figure what angle to come in to hit that exact spot that they were looking
at. So you get the idea. Again, these are -- it's part of helping the researcher get
the job done for planning their research.
And then when they do their research, they often want to slice up their brain and
then compare it to the atlas. Now, what happens when they slice up their own
brain is usually they don't slice it exactly in the same plane that the atlas is in.
Because nobody does it exactly right. So we provide a slicing tool for them to
say let's now slice through the brain at a certain place and now you see I've
sliced through one plane. I have two planes. I can slice front and back. And I
can now change the angle back and forth to match the angle that I've sliced in
my histology, either intentionally or otherwise.
And then -- I'm not going to show it here, because it takes a while to do that. And
then I can say, okay, now create a custom atlas. So between plane one and
plane two, I will save off whatever frequency I want drawings of the entire brain at
that section at that plane. And now I've got a custom atlas that I can compare an
overlay against my actual results. All right. So again, we're trying to make this
as easy to use as possible.
One final visualization I want to show. I should have said -- called this out in the
beginning. And my apologies for not doing that. The 3D modeling stuff we built
in collaboration with the LN Institute For Brain Science down here in Seattle, they
did a lot of the early work on building the toolkit we redid for our own things and
added our own pieces but one of the things they did was they did a mapping of
the entire mouse brain for every one of 20,000 genes and where in the brain that
gene is expressed.
That data is available on a public API. And so what we've done is we've added
in -- let me get rid of the slicing tool here, what we've done is we've added in the
feature to search for any particular gene that you want to show. I've preloaded
some of the genes in here. It only takes a few minutes. But then I can show now
that's the gene expression for that particular gene, which is the dopamine gene,
which you notice is very closely matched to these particular structures. And you
can get, again, any of 20,000 genes that you could load in here, and of course as
you saw this morning you can set the threshold to get only the ones that are most
strongly expressed and so forth.
Again, now you're getting real visualization of where an exactly what's going on
here. You can do that in combination with the slicing plane and more better
isolate, if you look at it from the front, better isolate exactly where that is on this
particular front and back plane, et cetera. All right? So you're really bringing
together a good visualization of what you've done, where you've done it and how
that maps.
And finally, because I know I was only allowed a little bit of time here, finally one
of the pieces that people talk about all the time is, okay, the structures are all
right, and you can show all of the structures -- let's add in another structure. The
hippocampus a favorite one for learning and memory and Alzheimer's disease.
That's the hippocampus in case you haven't seen it before. There. Brought it
nice and bright. This is the hippocampus back here. You can do all the usual
things of clicking on them and highlighting them and finding out what they are
and hover over them, all that kind of stuff. Anything you can imagine doing with
3D I think we've tried to throw in here.
But what people really care about is how are these structures connected
together? Now, we don't have the actual point-to-point data loaded in yet
because it's monitor actually very complete.
What we have done is we've mapped out all of the fiber tracks themselves. So
anatomically where are these fibers going when they go from one spot to
another? And you actually won't find this anywhere else. You can find oh, 100
different fiber pathways that are loaded up here. And if you look up -- let's bring
in this one here, this one here wraps very nicely right around the hippocampus
because that's the main fiber track leading out from the hippocampus. Again,
you're starting to see all of the -- there's the -- all of the white matter. Let me
bring them up to high resolution as well.
You see all the white matter as it wraps around the brain structures. And then
the fluids, et cetera. And then -- there. People like to look at this as a nice sort
of quick little view from the top of the brain identifying all the cortical regions of
the mouse brain.
And then of course we have mouse brain, rat, monkey -- mouse brain, rat, and
monkey, one of the things that we've done that nobody else has done is we've
joined together the nomenclature for all of those. So if you're looking at one
structure in one, you hop to the other of those two structures -- of species and
you'll get the same structure. Haven't done that for the human -- I mean we, the
authors, I mean be careful here. I was a neuroscientists but that was 30 years
ago.
But our authors are neuroanatomists are doing that, and there's joining up the
human into that same nomenclature as well, which will be available later.
Finally we're providing this information as part of an API. And working with some
of our partners who do MRI research. MRI research functional MRI, whatever, a
lot less resolution so you can't get down to identifying that, you know, the
thousand structures. You can maybe identify maybe 100 or so. But the smaller
structures you can't, the nucleus structures.
So what they want to do, and so what some of our partners are doing, taking
their MRI data, mapping the 3D to our MRI models, we give them a whole set of
models, they match up the major structures and now you get linear
extrapolations that go in each -- distortions that go in each dimension, right?
So you map it up. They -- then we have an API. They send in the coordinates
and we give them back information. Here's where you are. Here's information
about where that structure is in the brain. This one I don't happen to have -- I got
a bad one here. I'll just randomly pick another one. There. So randomly pick
one.
And here's -- and then it shows all of the atlas plates in our atlas that contain that
structure for a reference, et cetera. Again, giving back information for them to
use inside their software as well. Opening up again the data exchange and the
overlay.
Really quick whirl wind, but ->> Lee Dirks: Perfect.
>> David Marques: Okay?
>> Lee Dirks: Well, thank you very much. And what we can do is open it up for
questions for Dave or for Rafael. All the way in the back.
>>: Yeah [inaudible]. Just to clarify, is this primarily for research what you
showed or is any of this in practice in terms of specifically things like linear
accelerator, stereo static, radio surgery, any of the gamma life procedures where
you need to take MRI data and have it mapped? Is any of this in practice?
>> David Marques: This project is only neuroanatomy, right? Because that's -we are linking it back to the neuroanatomical structures. You know, whether the
same things can be done or not, we are working on a paper to publish out exactly
what we've done to create these models and how we map them and so forth.
But honestly, I think we're not the leaders in that kind -- that level of that
technology. As I said, our partners are doing the mapping between the MRI and
our data models. So, you know, again, the same thing can be done elsewhere,
but we're not going in other than basing it on our neuroanatomical stuff.
>>: You have great visualization. So I had a question regarding some of the
viewing of the gene expression data. As we know, a lot of this data has been
accumulated over the last five or ten years. So you specifically mentioned that
there are 20,000 genes that -- for those genes that the expression can be
visualized on to a specific part of the ring.
>> David Marques: So there are 20,000 genes that the Allen Institute for Brain
Science has mapped out specifically for the mouse.
>>: Right.
>> David Marques: Full stop. They are doing more for human now and will, you
know -- so we have the API. We'll be able to pull those in as those come out.
Many other places -- there's tons of researchers that are mapping out gene
expression done a different way different techniques resolved differently and so
wind up in different structures and have different definitions. I mean, it's really
very complex.
And so we're making it available so that we can reach out to those others and
pull those others in as well. It takes time to go through each one of those
collaborations. Most of them don't have the open API that Allen Institute has. So
we just started there.
But you're absolutely right. There's just reams of data, some old, some new,
and, you know, connecting it out and allowing the individual to say I want that
piece or that piece.
We're also going the other way around. We link out -- I didn't show you that, but
there's a -- for any structure you can link out to any of our partner sites which are
mostly university sites that have their own specific research on connections
between one structure and another, for example. Computational modeling at
USC. We link out to structures that are involved in a computational model for this
-- how the structure works in the monkey brain, et cetera.
>>: Yeah, that's fascinating. And you've pretty much preempted my question.
But can a user who is using this sort of a tool can he or she, if she has specific
gene expression data, send it to a -- to one of the places that you mentioned?
And how frequently will they ->> David Marques: As soon as I can make that work. It's a matter of time and
resources, honestly. We're all dying to do that. I would say the thing that's going
to come first, before of the gene expression I think is probably going to be MRI
data. Because we have people knocking at the door saying can we send our
MRI data into your service? That service co-register -- that's a big tricky part,
right? Co-register our individual animal against your standardized model. Then
we can use the API, right?
So that service I think is probably going to be where we put resources sooner
rather than later. The gene expression data it's so far been a little bit more
complex to work with those people. I don't mean those people, as opposed to -- I
mean with the people doing that research. And so it's coming slower. That's all I
can say. But we're dying to do all of those things. It's a mere matter of how
many bodies.
>> Lee Dirks: Any other questions? Dave, Rafael, thank you very much.
[applause].
>> Lee Dirks: All right. Well, we'll go ahead and move on. And I'd like to
introduce both Behrooz and Lorrie. My colleague, Behrooz Chitsaz from
Microsoft Research. The two of them will be teaming up to partner on this
presentation around the ScienceCinema multimedia search and retrieval in the
sciences.
Behrooz joined Microsoft in 1991 during [inaudible] he was involved in more than
a dozen product chips, including the first three versions of Microsoft Exchange
and Windows 2000. He joined Microsoft Research in 2002 where he led the
program management team responsible for technology transfer for over 800
researchers worldwide to the product and services groups.
In his current role as director of IT strategy for Microsoft Research. Behrooz is
responsible for developing and executing on strategies for bringing various
Microsoft Research technologies to market.
And also do a quick intro for Lorrie as well, so we can facilitate the transition.
Lorrie Johnson is at the U.S. Department of Energy's office of scientific
[inaudible]. She holds a master of science degree in information sciences from
University of Tennessee and she's completed dual bachelor of science degrees
in biochemistry and zoology from North Carolina State University. And I didn't
know that. I'm actually a big Carolina fan.
>> Lorrie Johnson: I won't hold that against you.
>> Lee Dirks: Okay. We won't talk any more college basketball, I promise.
I'll hand it over to Behrooz and Lorrie.
>> Lorrie Johnson: Thank you.
>> Behrooz Chitsaz: It's a pleasure to do a dual presentation with Lorrie. We've
been actually collaborating for the past [inaudible].
>> Lorrie Johnson: Two years.
>> Behrooz Chitsaz: [inaudible] specific project. So ScienceCinema -- it's not
on? Is it on? It's not on?
>>: [inaudible].
>> Behrooz Chitsaz: Okay. Is that better? Is that better? I'll just speak louder.
Okay.
So ScienceCinema is a site that's in collaboration with Microsoft Research in a
project that we call MAVIS internally, Microsoft Audio Video Indexing Service.
And the idea is to allow to you search inside the spoken document.
So essentially treat spoken documents with audio-video speech just as you
would do textual documents. And when you think about millions of hours of
audio-video being generated on a daily basis, today the only way for us to be
able to get access to that is to the textual metadata that's surrounding it. And a
lot of time that doesn't really give you the richness of what's inside the actual
content.
I think the previous presentation mentioned something like hippo something. It
would be really nice for you to actually be able to search inside audio-video to be
able to capture that. Last year I was interested in finding out for example what
we're doing in -- around volcanos or if we had any presentation around volcanos
because of the volcano eruptions in Iceland. And I did the search on our videos
and I found actually two talks that were given at Microsoft on people working on
sensor networks on volcanos in Iceland. Which was really interesting. The kind
of thing that I wouldn't have found out if it wasn't -- there wasn't a capability to
actually search inside the audio-video.
So what I will do is just to put this in context, multimedia is very rich in terms of
research. Just so you know, GE isn't the only company that's doing face
identification and tracking people and tracking objects. We're also doing work in
that space. We're doing some really exciting stuff which I mentioned yesterday in
a meeting around 3D medical imaging, segmentation and the ability to be able to
find anomalies in 3D medical images which I think is going to have a huge
impact. And semantic extraction of inside videos. For example imagine I think a
good analogy is sort of around supports. That's always a really good one.
Imagine the computer being able to automatically commentate your favorite sport
program, whether it's basketball or soccer or football or whatever it is, for it to be
able to understand what a foul is, know what a three-point shot is, know what a,
you know, strike is. So the ability for the capability to be able to understand the
semantics of what's actually happening inside video. Those are some of the
work that we're doing. Focusing on speech, today there are -- it's a huge area,
lots of applications around speech and we're doing research in many different
areas of speech.
Today, though, a lot of the work in applications in speech is around using it as an
interface, using it as an interface to directory services, for example, using, you
know, bing and Google Mobile Search you can now get directory services. It
works very, very well.
And so there's a lot of work that's done in that space and a lot of applications for
accessing services on the back end. Many services today you can access
through speech. Also Windows has actually had speech to text for the past I
would probably say about 10 years, for a long time.
Now, what we want to do is sort of take it further, really looking at it from a -taking all the speech content and thinking of it as like essentially like textual
documents. The ability to be able to search inside that extract metadata out of
that and moving forward creating a high quality close captions. And in the future
the ability to be able to create realtime translation of speech to speech.
We actually demonstrated that at our event last year where we had somebody
speaking German and another person English and having it in realtime translate
from one language to the other language.
In order to do that, you need more than just being able to understand the single
phrase. In these cases you have, it's conversational speech or many different
speakers, different accents, different domains they are speaking in. So a lot of
sort of interesting challenges in this particular space.
In terms of speech recognition at a very high level, there is the first -- the first
process is to analyze the actual audio. There are really two categories of speech
recognition. One is phonetic based, the other one large vocabulary based. The
in the case of phonetic base you don't convert the actual audio stream into
words, so you're searching inside the actual signals, essentially.
In the second case, you have -- you do a vocabulary and that's high quality so
you get more accuracy because you can apply grammar to that. So at a very
high level you do -- you have these acoustic models that essentially take the
audio stream and statistically you model that and compare that with phonemes.
Then you create words out of those and apply grammar on top of that in order to
get the -- get the -- recognize the words out of that.
Now, one of the challenges of that is speech is very challenging. People have
been working on it for 50 years. And it just gets better and better and better. But
one of the things we wanted to do was focus on the search side of it. And once
you focus it on search, there are certain techniques that you can apply in order to
improve the search capability.
If you were going to just take the speech recognition output and you create
essentially a transcript. And you index that. You will not get a very good result.
And the reason for that is the accuracy that you get out of creating the transcript
is somewhere between 50 and 80 percent, depending on the speaker, depending
on the domain, depending on a bunch of other things.
So one of the things that we do is there are a couple of techniques that we use.
One is called automatic vocabulary adaptation, which means that you take some
of the textual metadata, the title, the speaker, the abstract, whatever that you
have. Do you some natural language processing on that. You extract some
keywords. You search on the Web. You get some documents related to that
particular content. You do more natural language processing on that. You
extract more keywords. You find out whether those keywords actually are in your
vocabulary or not and you essentially add it to your vocabulary if it's not before
you actually run the recognition.
And do you that two or three times in order to improve the recognition accuracy.
The other thing that you do is that we keep sort of word alternatives. So if I say
something like Crimean War, the system might think I said crime in a war. And
that might be sort of the high confidence on the Crimean War.
So in order to improve the contrast of search, we actually keep all these
alternatives. Because the user knows the context of what they're searching for,
there's a higher probability of actually get what they're looking for.
The other thing that we do is extraction of keywords and it's not just extracting
keywords out of the actual speech content but we also do this searching on the
Web, finding more documents that are related to the actual consent, speech
content and using that in order to expand the keyword set.
Let's say the document sort of mentions something like, you know, something
about Microsoft. Searching on the Web chances are you're going to get Bill
Gates and a bunch of other things, Silverlight and you're going to get SharePoint,
et cetera, et cetera.
So we now can get more information about that speech content by doing more
analysis by searching the Web and getting more things that are related to that
particular speech content. So that's another -- sort of the other things that we do.
And at the bottom I've got a link to our site.
In terms of -- one of the things, though, with all these techniques, the creating,
the doing the vocabulary adaptation, keeping the word alternatives, doing the
actual signal processing, all of this stuff is very compute intense. So one of the
things that we've done is we've actually integrated into our Azure based cloud
service. So all of that processing, that capability is all integrated in Azure, the
company or organization doesn't have to invest in the infrastructure in order to do
that, and that makes it easier to deploy.
So their interface to that is just that RSS feed. The RSS feed contains links to
the content that's being processed as well as the metadata like the title, abstract,
et cetera. And this information gets uploaded to Azure. Azure will just download
the content, do all the processing, feed back what's called the audio index blog
which contains all the words, the confidence levels, the alternatives, where they
appear in the audio and then that is now imported into SQL server. On the
right-hand side is essentially what happens in the organization, and any
database administrator is very familiar with what happens on the right-hand side
because it's just dealing with normal full text search, essentially.
So here I'm going to pass to Lorrie, partner here, and she will say more about the
Department of Energy as well as give you a demo.
>> Lorrie Johnson: Yes. Can you hear me? Good afternoon. It's a pleasure to
be here. Behrooz just described a little bit about the technology. I want to spend
just a couple minutes here talking about why this technology would be so
important to a federal agency such as the Department of Energy.
I know most of you probably think of the Department of Energy when you by a
new appliance or a new car or you put a new roof or windows on your home.
However, I wanted to point out that the Department of Energy is one of the
largest research agencies within the Federal Government. And here in the US it
does invest over 10 billion dollar each year in basic science research, clean
energy research, renewable energy, energy efficiency as well as nuclear
research, which is actually where we started years ago.
The immediate output from this investment of 10 billion dollars of taxpayer money
every year is information, knowledge, R&D results.
So the mission of my organization which is the Office of Scientific and Technical
Information, or OSTI, as we call it for short, is to accelerate scientific progress by
accelerating access to this information.
So we have been doing this at OSTI for a number of years, since the 1940s for
both DOE and its predecessor agencies. We started just after the Manhattan
Project starting to collect the information that people had done at that point.
Originally of course it was all in paper or microfiche as some of you remember
that format as well.
In the 1990s we started transitioning to electronic formats. And today we have a
number of specialized websites and databases which are geared towards two
goals. One is providing access to the Department of Energy's research
information. The other is to enable DOE scientists and other US scientists to
gain access to the information they need to do better research here in the US.
We have a few core products that I wanted to mention just briefly. Information
Bridge is our full text report database and it has full text for over 250 documents
since the early 1990s.
Science Accelerator is what we call a federated search product. It provides
access both to the reports within information bridge as well as R&D
accomplishments and project information and some other things as well.
And then lastly on this slide we have science.gov which we act as the operating
agent for the Cindy group. And that one not only contains DOE information but
information from 14 federal agencies.
And of course OSTI is also the operating agent for worldwidescience.org on
behalf of the Worldwide Science alliance. And for those of you who have been in
the meetings the past few days, you've heard about this one already. It does
provide access now to do 4 million -- 400 million pages of scientific and technical
information from almost 80 databases representing over 70 countries from
around the world.
Last June in partnership with Microsoft Research we launched a multi-lingual
version of this, which provides translation capabilities to nine languages giving
users the option to enter query in one of those nine languages, have it bring back
results, and then they have the option to translate as well back into their native
language.
All of these products, however, are text based. And as we've been hearing all
day today, we've got lots of emerging forms of scientific and technical
information. Certainly numeric data, multimedia data, social media. I'm not sure
if anyone's mentioned Facebook yet today. But those are different types of
information -- different forms of information that we all have to deal with. And we
do see continued proliferation within multimedia in the sciences.
These forms of multimedia often present special challenges and opportunities.
And just to briefly talk about a few of these. We have lack of written transcripts
for a lot of these things. There is no full text to search. So for those of us that
are used to full text databases, you know, suddenly we have this multimedia
video or whatever that doesn't have a corresponding transcript.
Metadata if it's available is often minimal. In many cases you might have the title,
the name of the presenter, a date. But for those of us that were trained in library
sciences, there is no thesauri, there's no subject categories. There's sometimes
not even a brief abstract or a description.
Another challenge is the scientific, technical and medical vocabulary that many of
us are dealing with. Behrooz gave an excellent example of how different words
in different context can mean different things. And certainly we all have words
like that in our fields where depending on the context it has a completely different
meaning.
Finally in the case of videos, these things can be very long. Most of the ones
that I'll be talking about here in a second are and hour or more long. For a
scientist or a physician maybe that's interested in just one particular experiment
or one surgical technique it would be very difficult for somebody to sit there and
watch and how long video when maybe they're only interested in two minutes of
it. So that's yet another challenge that could be a substantial time burden.
So to overcome some of these barriers and challenges, as Behrooz mentioned,
we have been partnering with Microsoft Research for about the past two years.
We had heard about Behrooz and the MAVIS team through another contact at
Microsoft. And this project originally began as an XD technical activities
coordinating committee endeavor.
What we have done specifically is collect video files from our DOE national
laboratories and research facilities. We went out basically to their sites and
collected anything that we thought met the criteria of being scientific and
technical information. If it was something, you know, like a promotional video, we
didn't include that.
We did create RSS feeds with metadata and the URLs were sent to Behrooz to
the MAVIS team. And I've had a couple people ask me about that process. It
was actually very easy. They just told us what we needed and our folks were
able to get that.
At that point, Behrooz' team took over and performed the audio indexing via
MAVIS. Then they sent back the audio index blob to us and our IT folks at that
point were able to integrate with it our SQL servers. So at this point, as far as
we're concerned, the product performs exactly the way it would on our other SQL
server based databases.
So the end result is that the user can now search for a precise term within the
video, be directed exactly to that point where the particular word was spoken.
So make it in the next slide. Lee, do you know if this is going to play the sound in
a second?
>>: I think it will.
>> Lorrie Johnson: Okay. Okay. So today I'm happy to announce that we are
launching the final product out of this two-year collaboration. This is now a public
website that we're calling ScienceCinema. It's comprised of about a thousand
hours of video from DOE. Of course it does use the MAVIS technology in it. And
officially launched today, as I mentioned. And this does represent a ground
breaking capability among federal agents in offering public access to audio
indexed video.
So acid mentioned earlier, there are some challenges with multimedia
collections, namely that you don't have full text, usually just search. But with this
technology as Behrooz mentioned earlier, you can actually conduct a search
much as you would a regular full text database.
Okay. I'm going to move over here just a second so I can point some things out.
Behrooz helped a lot with this interface. He gave multiple suggestions over the
course of two years. So for a couple of folks that have seen it over this time
period, thank you. As you can see, there are 35 results for the search term
biofuels. You see a few familiar fields here, title, presenters, date.
This is actually a thumbnail of the entire video. So if someone did want to watch
the full thing, I think this one is about 55 minutes long maybe. And it's 342
megabytes for anybody who wants to know. So it is big.
These snippets here are the actual occurrences of the term biofuels. I'm going to
expand that, show more, so you can see all of them. So within this particular
video each time the word biofuels was spoken you have it identified and then
[inaudible] just a little bit.
And then the most recent thing that we've done to this interface was actually add
a timeline to the bottom. So each of these dots represents one of these snippets.
So in this case, you can see there's a cluster of the words -- the word biofuels
here at the beginning, a few more in the middle and then some at the end.
So in some case you know, you might see a cluster representing maybe five or
10 minutes of video and decide to just go there and watch the full thing. Let's try
a snippet. There we go.
>>: All of these challenges into turning biomass into biofuels. And right now
we're working both in the energy biosciences institute ->> Lorrie Johnson: So you heard him say the word biofuels.
>>: [inaudible].
>> Lorrie Johnson: I'm going to play a couple more here.
>>: With biofuels you heard Jim talk about we're going to have different crops
depending on ->>: There's a lot of different biofuels and, you know, I mean I hear you speaking,
you keep on talking about this alcohol-like ->> Lorrie Johnson: So the user could basically go through this whole video and
listen to each time the word biofuels is spoken. So that's a big improvement we
think over somebody having to watch a full hour's worth of video and listen that
closely.
We do have -- I do want to do a different search very quickly. Okay. This search
is on the words energy efficiency, so it will do phrases as well. I haven't tried like
any long sentences yet, but just to give you a quick example of a phrase, I'm
going to scroll to the second one here. There we go.
>>: I respect the importance of energy efficiency. [inaudible] because ->> Lorrie Johnson: Okay. So you could search for, you know, any word, a
phrase. I believe it will do boolean and ORs as well. I won't do that just in the
interest of time. Let me go back to the Power Point.
Okay. With the launch of ScienceCinema today, we're already looking towards
the future, as well. We do expect to receive additional content from our DOE
researchers both in the university and the research laboratory community. We've
actually modified our processes at OSTI to accept now multimedia forms and
information. So we do hope that the ScienceCinema product will grow much
beyond the thousand hours that we're currently offering now.
The second thing, we do plan to integrate this website into the WorldWide
Science project, along with hopefully connection from CERN and a few others.
And we'll be showing that at the EXTI annual conference in Beijing this summer.
One thing that Behrooz and I have just kind of touched upon and need to come
back to is the creation of high quality automatic closed captioning. That is very
important, especially within the US government agencies that you do offer some
kind of closed captioning for a lot of these videos.
And then finally he mentioned the ability maybe to do multilingual translations
capabilities at some point on videos. And we would hope that that would be a
possibility in future as well.
So finally I want to extend a personal thanks as well as one on behalf of the
Department of Energy to Behrooz, Lee, Tony, the MAVIS team and to Microsoft
Research for partnering with us in this endeavor. It's certainly been a pleasure to
work with Behrooz. And we look forward to continued collaboration.
>> Behrooz Chitsaz: Vice versa. Same here.
>> Lorrie Johnson: Thank you.
[applause].
>> Lee Dirks: Are there any questions? Yes, there is.
>>: Well, it's transformational, that's for sure.
>> Lorrie Johnson: Thank you.
>>: I noticed when you had the energy efficiency sometimes the snippets were
just the words energy efficiency, which aren't particularly helpful, and other times
it had some context around it. And it seems to me if I were looking at -- if I were
searching and using this, I would be using the transcript -- I don't know if I would
even want to listen to it if I'm going to just hear that word at that point. But if
there's enough context around that word, it's the transcript with just the most
valuable part. I almost don't need the -- I mean, at that point I might want to
listen to the whole thing. But if I know there's enough context around it, I'll know
oh, no, this is about something else, I don't want to go after that.
So what explains the differences sometimes?
>> Behrooz Chitsaz: So, yeah, there's a couple of things. One is around the
context. That is -- that's designed by the algorithm so you can actually specify I
want three seconds of context, go back, you know, four or five minutes or
whatever. You can certainly do that.
>>: [inaudible] specified.
>> Behrooz Chitsaz: Well, it's -- it can be specified for the corporation basically.
So we can configure that.
>>: [inaudible] the user.
>> Behrooz Chitsaz: No, not either user.
>> Lorrie Johnson: Right.
>> Behrooz Chitsaz: Yeah. So it can be configured for that particular -- it can
be, in fact, configured for the user as well if we wish to do that. So the
capability's there to do that is what I'm trying to say.
What is -- but there's a difference between an actual transcript and what you're
seeing here. Because what you're seeing here is essentially searching this
lattice which does have all the word alternatives. In order to generate a full
transcript, what you need to do is essentially walk the highest confidence path
which may not be the correct path to walk. Like the example I gave around
Crimean War versus crime in a war.
So my transcript would probably say crime in a war, while I actually Crimean
War. So that is the difference between search and an actual transcript. But in
the future, there's ways of improving the actual accuracy, and those are things
[inaudible] doing, personalized vocabulary or personalized acoustic models for
the particular person.
There are a lot of techniques that can be used in order to actually improve that.
But speech recognition for in general case is still a very, very challenging,
challenging job.
>>: So when I see just a couple words there, does that mean that it's spoken
slowly?
>> Behrooz Chitsaz: Yes.
>> Lorrie Johnson: Probably.
>> Behrooz Chitsaz: So that's essentially what it is.
>>: The timeframe ->> Behrooz Chitsaz: Exactly.
>>: [inaudible] catch that.
>> Behrooz Chitsaz: That's exactly right. So it's more related to time is what
you're saying. Exactly.
>>: But would I presume that you could go backwards.
>> Behrooz Chitsaz: Yes.
>>: Grab the arrow backwards [inaudible] get your context --
>> Behrooz Chitsaz: Good point.
>> Lorrie Johnson: Uh-huh.
>> Behrooz Chitsaz: Absolutely.
>> Lorrie Johnson: Uh-huh.
>>: [inaudible].
>> Lorrie Johnson: Right.
>> Behrooz Chitsaz: Thank you for that. That's very important.
>>: What he's saying then is he doesn't want to have to listen to it [inaudible].
>> Behrooz Chitsaz: If he can read it [inaudible].
>>: Without slight wait [inaudible].
>> Behrooz Chitsaz: Yes. Yes. There's absolutely. I just remembered now
what the previous presenter mentioned. It was hypothymia, right? Hypothymia?
Who was the previous presenter?
>>: Hippocampus.
>> Behrooz Chitsaz: Hippocampus. Okay.
>> Lorrie Johnson: Yes.
>> Behrooz Chitsaz: So that wasn't a word that was part of the title of the talk or
part of the transcript or part -- part of the description of the talk. However, the
presentation that visualization that they showed was beautiful visualization. So it
would be great for me to be able to later on, if I -- if I was in the medical field and
I wanted to search and find out did anybody in Microsoft mention this ever, I
would love to be able to get that particular presentation. And yet today I won't be
able to do that because -- well, actually in Microsoft you can because we're in
indexing our content. But in many other places you won't be able to do that,
because you're searching the text, and the text is typically -- what's actually
spoken is much more content than what's spoken than what's actually in the
surrounding text.
>>: What about other languages? I would love to see those for let's say
German.
>> Behrooz Chitsaz: Yes. I actually it's interesting you mention German
because the lead researcher is, in fact, he's originally from German. So this is
he's actually done the language both and the architecture supports multiple
different languages.
In order to introduce a new language you need some training data, close to I
would say starting off about 300 hours to 500 hours of training data in order to
train the system then add the vocabulary and then things like that. So it is -- it is
doable in other languages. Right now we're sort of focused on the English
language and sort of understand what it means to improve the accuracy and then
moving to other languages.
>>: Is it part of the [inaudible].
>> Behrooz Chitsaz: Yes. It's definitely -- it's definitely a part of something that
we're thinking about doing for sure, creating -- even creating tools to make it
easier to introduce other languages, absolutely. And I mention -- it's interesting
you mention German. Because we actually did the translation, the realtime
translation was, in fact, between English and German.
>>: So if you wanted to perhaps do a project with the National German Science
Library, that's who you would want to talk to. [laughter].
>> Behrooz Chitsaz: Okay. Great.
>> Lorrie Johnson: I've got her card.
>>: Good advertising.
>>: So I'm curious. I may have missed this. But can you say a little about how
you handled disambiguation and homonyms and other types have nyms and
when people are using, you know, sort of keyword level searching?
>> Behrooz Chitsaz: So that can -- that can actually be -- there is a level of sort
of the text search, so there's a lot of techniques in order to do that as part of text.
One of the things we can do is in fact integrate into the bing logs and the search
alternatives and what people have done. So there's the speech recognition and
then there's this search. And we can sort of improve that. We haven't done that
yet. But we can certainly improve that by adding the same techniques that we do
on the Web. So it's essentially taking whatever bing has done and sort of using
that in order to improve the search experience.
>> Lee Dirks: This will be the last question.
>> Behrooz Chitsaz: Yes.
>> Lorrie Johnson: Yes.
>>: I was curious with the ScienceCinema, when you have Department of
Energy researchers adding new video, is that something they can do dynamically
and on automatically or is that something where you send the video back to
Microsoft again and then add it to the website?
>> Lorrie Johnson: We have a process within our office at the Department of
Energy where the researchers actually submit files via whether they're text files
or in this case now metadata, video files. It would then go into our regular
system, which contains, you know, lots of formats. Then if it was identified as a
video, we would flag that as a candidate for ScienceCinema and then at that
point we would probably collect a number of videos before we would then send
them to Microsoft to be audio indexed. So it wouldn't be, you know, a real
instantaneous process. I mean that would be great at some point. But, you
know, it's easier I think to index a number of hours rather than one video at the
time.
>> Behrooz Chitsaz: The interface, just so you know what's happening on the
Microsoft side, what we do is we essentially take that RSS feed that has a link to
the content, we read that content in order to index it in Azure, and then it's
deleted. So there's no -- there's no trace of that in Microsoft. So it just, you
know, that's basically -- we just -- we need to read it in order to be able to index
it. So that's the only thing that we essentially do with the content.
>> Lee Dirks: Well, thank you very much.
>> Behrooz Chitsaz: Thank you.
>> Lorrie Johnson: Thank you.
>> Lee Dirks: Please join me in [inaudible].
[applause].
>> Lee Dirks: I'll let Sebastian get set up and I'll do his introduction.
Dr. Sebastian Stueker is the leader of the research group Multilingual Speech
Recognition at Karlsruhe Institute of Technology. His work currently focuses on
automatic speech recognition for simultaneous and offline speech translation
systems.
He received his doctoral degree from the University of Karlsruhe on the top of his
acoustic modeling for the underresource languages in 2009.
In 2003, he received his diploma degree in informatics from Universitat
Karlsruhe. His diploma thesis being on the topic of multilingual articulatory
features and their integration into speech recognition systems.
He has extensive experience in speech translation and has been working in the
field for more than 10 years. He is known for his work in multilingual speech
recognition, particularly on work on strategies where data is insufficient or costly
to obtain a large scale integrated speech translators.
And we will hand it over to you to tell us more about Speech Processing
Applications in Quaero.
>> Sebstian Stuker: Thank you very much for this kind introduction. Yeah, so
what I will do is probably complement a little bit the previous talk. And since I'm
coming from a university background I will also give you a little bit a glimpse into
the future what Microsoft might be doing in the next couple of years.
The title of the talk is Speech Processing Applications in Quaero, so the first
question I should be answering probably is what is the Quaero project? Well,
first of all it's not a project. Quaero is actually a program. And it's a French
program. It started out as a French-German program but due to some political
difficulties in the end it turned out to be a simple French program with German
participation. That's the official language.
German participation being the Karlsruhe Institute of Technology, formally known
as the Universitat Karlsruhe and the RWTH [inaudible] University. And then our
French partners and being funded by the French state.
And speech technologies are one of the very important parts in this Quaero
program. Quaero itself is all about addressing multimedia content, making
multimedia content searchable so that you can search through a large databases
of video, audio, and text. And in order to do that, as pointed out in the earlier
talk, it is helpful to be able to deal with the speech that is happening in the
multimedia content.
Another important feature is to be able to deal with the vision part of the
multimedia content and other parts of the Quaero program are actually dealing
with it.
But in this talk, we will concentrate on the speech part of it, and especially I will
concentrate on a speech translation part of it. And when it comes to this
conference or workshop today, one of the questions we should answer is how
can technology actually advance science?
And when I thought about that, if you think about it, one of the important parts
can have an impact is actually academic lectures and talks. A lot of scientific
knowledge is disseminated in the form of lectures today at these workshops,
obviously. And these lectures are regularly now recorded as today at this
workshop or as in lecture hauls at the universities all over the world. And in order
to be able to access this scientific content, it is necessary to be able to search
through that.
And then of course besides being able to find the correct scientific content, we
also have to bridge a language barrier. English is not the only language in the
world, English is not the only language chosen to give academic talks in the
world, and most of us will be able only to speak a very limited of languages. So
there's a lot of scientific content potentially out there that we cannot access
because of the language barrier.
So I will also hint and show how speech technology in the form of speech
translation technology can actually help to bridge this language barrier that we
have.
Other parts besides are finding our content, but I will not talk about that in much
detail, is already set. There is this huge wave, tsunami of information being
thrown at us in science.
So in the previous talks you could see how it could simply jump to the part in the
video where certain keyword is mentioned so you don't have to watch the whole
video. Other important or interesting parts that we've worked on in the past and
-- but which is currently not my specialty is how can we summarize content
automatically so that you can sort of condense the content of a lecture to the
most important parts so that you can reduce the time that you need to take in that
lecture.
So the Quaero program. It's a five year program. It started officially in 2008. It
had a sort of longer preparatory period where all the political issues were sorted
out. And it started out with France and Germany realizing that France and
Germany are not spending enough money when it comes to funding research.
The United States traditionally have been funding research from a government
side with large a little over very long time and in Europe that sort of has gotten
out of fashion.
Japan, for example, suddenly realized that that was the same in Japan, and so
they started to hike up the public funding for research again. And then Europe
realized, hum, maybe we should do something about that as well. So Chancellor
Schroeder and Jacques Chirac, the French president back then, got together and
decided we need to do something about it and started this -- and agency that
was supposed to fund innovative projects. And one of these first programs to be
funded was Quaero. And it's a pretty large project. It's -- it has a budget of about
200 million Euros. Over those five years. And the French state is funding 100
million Euros for that budget.
And as I said, it's a program not a project. Program because it's made up of
multiple projects. And there are two types of projects in this program. One is -type is application project where industry is heavily involved and where they are
certain projects -- certain applications that they want to research and develop
and where they get funding from. And in order to have really innovative new
projects they are two technology projects.
And the role of the technology projects is to advance the state of the art in all
different kinds of technologies, such as speech recognition, machine translation,
image processing, retrieval technologies, you name it, whatever you need. It's a
real wide spectrum. I wouldn't be able to enumerate all the technologies.
But these technologies are forwarded in the technology project and then
transferred into the application projects in order to advance gift bed applications.
All the application projects deal with multimedia are content in certain ways and
they have this nice arrow where they sort of have a spectrum. On the one side
you have content providers and the other side you have the end user. So you
want to have tools for content providers in order to organize the media
databases.
Let's say you are a news station with a large archive of broadcast news that you
have produced over the years. You want to be able to search through that
database and to pull up clips from the past, et cetera, so you need tools from
that.
On the other side of the spectrum, you have a user with a mobile device who just
wants to watch a personalized video, his personalized TV program. So he has
sort of the other side of the multimedia content view. And the program wants to
provide technology that actually realizes to do that.
So when it comes to speech application projects and Quaero, they are basically
these technologies being researched in the basic research project. On top there
is automatic speech recognition. And the reason it's automatic speech
recognition is still a very challenging problem.
So projects for speech recognition have been out there for a very long time and
has been researched for decades and decades and it's still far from being a soft
problem.
In the Quaero project, we're currently addressing seven languages, the core
languages being English, French and German obviously, since it is a German -a French program with German participation, and English being one of the major
languages, especially when it comes to communicating with people -- with
different languages.
And there's a growing group of other languages that seem to be interesting,
especially when it comes to the property of the languages you want to select
languages that are somehow different from the main languages such as English.
So currently it's Russian, Spanish, Greek, and Polish. And there are going to be
two more languages over the course of the project.
Four patterns are actually involved in doing the automatic speech recognition.
That's the KIT, then LIMSI, part of the CINS in France, the RWTH in Aachen and
Vexes Research are now being called Vocapia. I think they just formed a new
company or renamed themselves. They also do speech recognition on a
commercial basis.
Then another technology that you need and that's being researched is speaker
diarization. Speaker diarization has two facets. One the you have a huge chunk
of audio and you just want to cluster the audio segments into groups that are
homogenous with one speaker. So you have to narrow speaker IDs and you
want to know which speaker spoke when.
And then there is what is called political speaker tracking. Doesn't necessarily
have to be political, they just chose political as an application where you have
speakers that are known beforehand, you know you want to look for speech
coming from that known speaker, you know his name, for example, politicians,
and then you search the whole database and try to find all the clips where that
particular person has been speaking.
Then language recognition. So you try to decide that audio file which language
actually is it in. Then it's more complicated than it first seems. The reason being
is humans are so good at it. So it's hard to imagine why a machine should have
actually trouble with it but it is -- needs some work.
Then emotion recognition dialog and vocal interfaces have been part of the work
package note. But actually have not been worked on that intensively because
they haven't been asked for by the application project providers.
So the question is why they didn't want it. And first everybody thought it might be
interesting but then it sort of turned out nobody really had an idea how to use it in
the application. So it faded out a little bit.
And then of course we have machine translation, and then when you combine
machine translation and automatic speech recognition you get this new field of
speech-to-speech translation. And in order to measure progress and in order to
really drive people to achieve progress, the project is using something which is
called coopetition. This is an artificial word consisting of cooperation and
competition. So what we do is we have yearly evaluations where you have a
common test site, you evaluate your technologies. We partner one [inaudible]
technology on that common test set and in the end you measure who was the
best, who had the best performance. And then you exchange what you did in
order to be so good. And what were the real tricks, which techniques gave you
improvements. And that is the cooperation part.
So you then later exchange and next year have been knows the tricks from
everybody else and then you have to come up with new tricks in order to be the
best.
Okay. Automatic Speak Recognition. The speaker before already talked a little
bit automatic speech recognition, gave a little bit of an overview how it works. I
will not do that because actually it's too complicated in order to talk about it in two
minutes.
When it comes to applications for automatic speech recognition we have done lot
of work in interaction with machines. So if you want to direct humanoid robots,
you want to control computers, appliances, but also a human-to-human
integration. For example, speech translation needs automatic speech
recognition as its part if you want to have people with different -- speaking
different languages interact.
And we've been working in that field now for over 15 years, almost 20 years.
And also we've demonstrated live translation scenarios, among different
languages. And then a different field is if you want to observe the human user
and you want to predict his or her needs.
For example, you've seen this smart room with the -- from GE with the video.
We've had projects where we did something similar. We also had a video
observing the users in the smart room, their actions. But we were also listening
in to them. We were performing speech recognition in order to predict their
needs in addition to observing them visually.
Speech recognition is difficult. And why is it difficult? Well, actually once
question by a researcher. She asked me, oh, speech recognition seems similar.
I'm working on medical images. It's much harder, it's multidimensional. And
speech recognition is one dimensional signal, should be easy.
Well, actually it's not. Now, the reason why it is not easy is speech is very
variable. When you look at it as a recognition problem and that is the way you
look at it in science, you have the problem that the pattern -- every time
somebody speaks the same thing, the pattern that you work record with your
microphone looks different. Even if you have the same person saying the exact
same thing under the exact same conditions, the actual physical recording that
you will make looks different.
And dealing with this variability is very challenging and that is been tackled using
statistical methods. In real life it is really difficult because now you have different
microphones, you have different recording distances. The environment is
different. You have all sorts of -- sources of noises. People are talking and are
usually cross-talking. So you have several people talking at the same time and
you have to sort of concentrate on one speaker only. And so on and so on.
Speakers are in different emotional states, have different accents, have different
kinds of voices, et cetera.
Humans are good at automatically adapting to that. For machines it's actually
difficult. And as you've heard, speech recognition systems today learn automatic
models from large amount of annotated corpora. And you need to collect and
annotate this corpora, which is very expensive. So that makes it easier for big
companies because they have a lot of money they can spend on that. But that's
why we've been working also on methods and trying to sort of reduce the
dependency on these large amounts of data.
And then you have the field of machine translation. And machine translation has
also been researched on for quite a long time there. And if you talk to a machine
translation researchers, they are sort of two different kinds of approaches.
The first approach that people used was to do a rule based approach. So what
you basically try to do is you took a sentence, you tried to extract the semantics
into re-presentation that is independent of the language and then you try to
generate a sentence in the new language that had exactly the same semantic
information.
Works pretty well if you have written the right rules. Because all the rules had to
be written manually. So it's a lot of work. And there are actually companies out
there that have been working in that field for 20, 30 years, and they've started
now to have finally enough rules in order to tackle a language in a very large -- in
a larger domain.
But then IBM pioneered a new technique in the field of statistical machine
translation where instead of humans writing large amount of rules by hand, which
is time expensive and money expensive, you had just machines learn
automatically from parallel corpora. So the machine was learning by itself by
simply looking at examples and constructing statistical models that are very
similar to the ones that you use for automatic speech recognition.
So also machine translation is difficult for other reasons. Not as much variability,
but things like word order. So different languages have different word orders. So
the machine has to be able to reorder the words, and that can be quite difficult.
For example, if you translate from English into German, the Germans have this
very unfortunate habit that they sometimes tear apart the word. So it could
actually happen that you -- that is one word in English, the word needs to be torn
apart into two pieces in German and they go in two different opposite directions
into the sentence. And then it's really hard to tackle.
Also word fertilities. So one word in one language might be multiple words in
another language. Or multiple words in one language might be one word in the
other language.
And then you have ambiguities that you need to resolve. So here comes the
question from semantics that we had in the previous talk. So if you look at the
English word bank, it could be either the financial institution, it could be the edge
of a river.
In German the exact same word spelled exactly the same way could refer to a
financial institution or could actually be a bench. So these are things you have to
deal with. You have to -- in order to be able to translate correctly.
Okay. And that I will skip for now. Maybe I will get to that later.
So if you now combine automatic speech recognition and machine translation
you get speech translation.
It is very disappointing to see, but it's a fact nowadays speech translation is still a
simple concatenation of automatic speech recognition and machine translation.
In the previous talk you heard about how using these word lattices which is
nothing else but a compact representation of different alternatives of recognized
sentences from the automatic speech recognition how using these work lattices
instead of the single best output can improve the search when you are looking for
keywords.
People having trying to do that for speech translation as well. Unfortunately they
didn't have very much success. Wasn't -- they didn't get out as much
improvements as they were hoping for. When we talk about speech recognition
and speech translation, we usually distinguish between two different scenarios. I
call that the offline versus the online scenario.
The offline scenario is what you've seen in the last talk. We have a large
database of videos and you want to translate them or you want to search them.
So you can take your time. You're not in a real hurry to transcribe the video or to
translate the video. Also, you've got the whole material that you want to translate
in one big chunk. And that gives from you a technical point of view several
advantages. You can do very nifty things like you go through the whole material
and can take the whole available knowledge, the whole information that is in that
one recording segment the recording into good sentences, to identify different
speakers in the recordings, to adapt yourself to this different speakers in an
unsupervised manner by first doing a first recognition and then using that
recognition to adapt yourself to the different speakers and then improve your
recognition.
So you can do several passes and you can take time. You can go burn a lot of
CPU time in order to achieve as good a result as humanly possible or as
machinely possible, I should say.
So the only thing that actually limits the amount of time you are allowed to spend
is until when does it need to be done? For example some companies operate in
a way they get the data and they provide the result overnight or at what pace is
additional dated added to your database? For example, if you look at YouTube
there is insanely large amounts of data added every hour, every minute. So I've
got a timer. It says 19 minutes 11 seconds since I started speaking.
>> Lee Dirks: [inaudible] leave some time for questions. [laughter].
>> Sebstian Stuker: Okay. Okay. So in the other cases the online,
simultaneous translation. So you have to keep up with the translation, you have
to keep up with the speech recognition which is very time consuming. You have
to go through a lot of tricks in order to be able to really process your data in a fast
way.
So here comes another science part. Academic lectures. Academic lectures
happen to be online and offline. They happen live in the lecture hauls at the
university, and they're being recorded and put into large databases. Here are
some examples of large databases. For example, MIT started to record all the
lectures during the course of the OpenCourseware. Carnegie Mellon university
has the Open Learning initiative where they also collect all the data.
And now the latest thing is and there seems to be very successful is the iTunes
U, like university service, which has selected large amounts of data and also
videos of lectures from more than 12 countries. So that's where the translation
part now comes into the things.
Besides teaching at universities academic lectures at conferences, et cetera, are
also now routinely recorded. And one thing, even though we're short on time I
want to show it, when the talk of speaker from [inaudible] Tim, where are you?
Not here anymore? Here talked about how he has different multimedia things
and how to visualize the research, et cetera, and how to make things available to
the public. There's something called the Ted lectures, technology, entertainment
design. They have sort targeted the general public. And they talk about
technological things. And so what he forgot to mention I think is basic public
talks in order to bring the [inaudible]. So this is for example one of the Ted
lectures actually in the large [inaudible]. Trying in a sort of entertaining but still
scientifically sound way introducing the collider to the general public.
So in order to be able to advance science we can use speech technology to work
with these lectures. We can try to find the relevant information by using
information retrieval. We can try to summarize the content. They have
techniques for that. They are not very advanced yet. It's still a research topic.
But it's something we should look into the future.
And then of course we need to overcome the language barrier. Currently what
people are doing is they're using broken English at the conferences.
And usually English is -- everybody thinks everybody speaks English, especially
in the academia. Actually that's not true. English is not as widely spoken as you
think, even in countries -- areas like the European union it's less than half of the
average people that actually are able to communicate fluently in English.
And you should never forget where you often in a position where we don't have
as much difficulty learning English as other people because our languages are
very close to English.
Imagine that some cultural revolution takes place and to demographics,
tomorrow Mandarin will be the lingua franca and put yourself in the position now
having to give a talk in Mandarin at an international conference. And then you
probably will feel what pain they are going through in learning English. So I won't
talk too much about how it is important. Believe me, it is important that we keep
up the language diversity in the world that not everybody of speaks English
because they're speaking different languages brings along a whole different ways
of thinking.
The language that you speak influences heavily the way that you think. If we get
rid of all languages and only leave one language or of one language left we will
actually get rid of a large diversity of thinking. It will be a large loss to us and all
science advances.
Okay. The last five minutes that I have now I would like to show you some
things. So at Microsoft was hinting at that they want to in the future they want to
do multilingual translations. So let me give you an example of what that might
should look like when Microsoft is going to do it.
So this is something that came out of a European project that was finished in
2007. So what the project did back then is they recorded or they used the
existing recordings of European parliament lectures and started to translate
them. So this is one of the examples that ->>: Ladies and gentlemen, I'm delighted to be here --
>> Sebstian Stuker: So this is back then British secretary of state speaking at
the European parliament. And there we have our translation to Spanish. So we
did different translation directions. [inaudible] we also have German translation
in there.
We've also been working with other data. So this is actually now an offline case
where you have a database [inaudible] processing. We also tried something
what happens if we apply that through a [inaudible] to a different scenario. So
this is a system actually running in realtime. It's recording but there we
[inaudible] take this time of taking a long time of translating it. So this is now
from English into -- from German into English which is a very challenging speech
[inaudible].
Also what we've done in the past in 2005 we actually demonstrated or we
actually did show a first our simultaneous lecture translation system. So
currently we are doing -- personally I'm doing a lot of work on advancing the
simultaneous lecture translation system that we developed that simultaneously
would translate or can translate lectures in our case from English into Spanish as
shown here.
So this little video shows how that system works. You have a speaker that the
audience recorded. It's transferred to a [inaudible] works on a normal PC and
then it translates and then the translation is brought to the audience. And we
have some different kinds of modalities how to bring that. We have these
[inaudible] devices. We've gotten some prototypes from [inaudible]. What they
do is they produce a very narrow bit of audio. So the idea is that you have
different areas in the audience, the Spanish speaking, the French speaking and
then you have different areas there.
The other things that we're working with are subtitles. So we have these
[inaudible] where you have subtitles displayed to you personally. We're working
with subtitles projected on to screens. Then the background you can see some
people wearing these goggles and we're also working with these [inaudible] in
order to [inaudible] this is what the audio device sounds like. That's what it looks
like. So it produces this narrow beam of audio.
And then another thing that we've done -- skip that. Another thing that we've
done is you heard -- you've seen -- it is very time consuming and very, very
intensive to do these speech recognition and translation. You need a lot of
computation power. So we decided to put it all on an iPhone. Actually back then
when we started the research, it wasn't an iPhone yet so it was first a compact
handheld. And most of the translation systems that you see nowadays do need
an Internet connection. They record the audio then send it to a server, have it
processed, translated. You send back the result.
It's sort of unfortunate if you are in a foreign country, the roaming costs will kill
you. [laughter]. It's good for the companies providing the service but for you as
a consumer it's better to have it all in one device. So we actually now have
[inaudible] now has this spinoff company where he commercialized that
[inaudible]. So he's [inaudible] decided to have fun on vacation. It's called
Jibbigo to have that in one of their commercials.
And basically two way translator, different languages [inaudible] English, Spanish
version that all runs on the simple iPhone. And it's -- since it has to run on the
iPhone, it's for our tourist domains so it can't do the whole lecture translation
thing. For that we need at least a laptop size device. But for tourists travelling
abroad, things like an iPhone, iPad also works on nowadays already sufficient.
So if you want to play around with that, I have two versions with me, otherwise
you would have to pay I think 25 bucks on the [inaudible] site. [laughter].
With the lecture translation system I didn't bring that with me today. I can show
you that. I can also show you, if you want to look at what the Quaero people
have been doing in the field of information retrieval there's actually a nice website
that you can play around with. It's the Voxalead -- Voxalead News Service that's
a French company, and they've used speech recognition technology in order to
index news videos, news clips. And you can then query the news clips either
radio shows or TV shows or if you look for Microsoft you get even from BBC it's
latest one or this is about Bill Gates at the economic forum in Davos. And it's
available in multiple languages. So currently you have French, English, Chinese,
Arabic, Spanish and Russian. More languages I guess to come, because I know
the company that is providing the speech recognition technology also has
multiple -- many more languages in their toolbox.
So that is just so that if you see that there is actually multiple companies
nowadays providing this kind of information retrieval capability. This isn't
scientific document scenario but as I said we're currently working on the
academic lecture scenario. We've been working a lot with the Ted lectures. So
there's also work going on in that area.
And that is really something that is useful for science if you are able to access
language -- if you are able to listen directly to the languages off an ongoing
presentation or if you are able to actually access across languages lectures that
have been prerecorded in the database. And that is with respect to this
workshop I think the most important point that that is really worthwhile thing to do
and that it will advance science significantly if these products get out and the
core technology improves even further to a point where it is applicable in multiple
languages and as reliable as humanly possible.
Okay. So I guess my time is up now.
>> Lee Dirks: All right. Thank you. Thank you very much.
[applause]
Related documents
Download