>> Zhengyou Zhang: Good evening, ladies and gentlemen. ... lecture is sponsored by Technical Society. Talk about what's...

advertisement
>> Zhengyou Zhang: Good evening, ladies and gentlemen. Welcome to our lecture. This
lecture is sponsored by Technical Society. Talk about what's hot, what's best, what's going on in
the industry, and so we are so delighted to have Dr. Li Deng with us tonight, and Li is our
engineer of the year this year, 2015, for Region 6. And you will be amazed how much he has
achieved, and he has 60 patents, international patents, four books and 300 more to 400 papers,
technical papers, really great papers. And he has a lot of honors and awards. He's 2013 IEEE
SPS Best Paper Award, Editor in Chief, IEEE Transactions, Audio, Speech and Language
Processing, Editor in Chief, IEEE Signal Processing Magazines, Technology Transfer Award,
Gold Star Award 2011, IEEE SPS Meritorious Service Award, General Chair 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, and Board of Governors
of IEEE Signal Processing Society, Board of Governors and Vice President of Industrial
Relations, Asia-Pacific Signal and Information Processing Association, Publication Board, IEEE
Signal Processing Society, keynote speaker, Interspeech, September 18, 2014, and he is a
keynote speaker at 12 China national conferences on computational linguistics and so many
keynotes, I just don't want to say it anymore. And then he is just an amazing person, and so I
think you will enjoy this evening.
>> Li Deng: Thank you very much for the introduction earlier, before you came, yes. So today's
topic, I originally put the topic called recent advances of deep learning, so many of my
colleagues said that, well, I should take the opportunity to talk about not only the recent
advances, but also maybe some history of how Microsoft got into this area of the research. And
this is a very hard area now. People talk about artificial intelligence, machine learning, deep
learning. Almost every single day, you get news coming out from the media, so that happened
actually over the last three or four years, maybe two or three years, so the more recent ones, you
get a lot of acquisitions, small company acquisitions. Microsoft doesn't acquire -- hasn't
acquired any companies. Only Google acquired a lot of companies in this area, so going through
this lecture, you are going to see why. So I'm going to provide a selective overview of a whole
bunch of projects done at Microsoft for deep learning, and I'm going to show you exactly how,
historically, how deep learning got started from here and how it was advanced and then how the
whole industry actually is working in this area now in terms of state of the art. So I first of all
want to acknowledge many of my colleagues at our Deep Learning Center at Microsoft Research
and also with the collaborating universities and many of Microsoft engineering groups, some of
whom are probably sitting here, who are collaborating with us to create some impact of some
project. So almost all the material that I will talk about today are public. We either put those in
the CNN or New York Times. All this, this is a summary of a whole bunch of things that we put
together, and I'm going to point you to some of the new activities we are planning to do -- not
much detail, because they are not public yet, but I'm going to say whatever I feel comfortable
about saying. Okay, so first of all, so the first real topic is the how and when and why Microsoft
got into deep learning, and I'll start with deep speech recognition. So this is actually very early,
early work that we started doing with academic pioneers. So, for example, there's no laser
pointer here. Okay, but it's okay. So I probably can just talk about this. So officially, Microsoft
got into deep learning research approximately towards the end of 2009. So we worked with
Professor Geoff Hinton at University of Toronto. Actually, the collaboration started earlier than
this. We actually worked right before this one, and he traveled to this building, actually in my
office. We worked together on this in what's called the deep learning for speech recognition.
And that -- actually, there's a long story about how it led to this workshop. So during the
workshop, there was an extremely high interest in doing deep learning for speech recognition,
and then he consulted with university academics of research coming over here, and we learned a
great deal from them. And then Li Deng, he sent a couple students to work with us, so we
devoted a fair amount of resources into the collaboration, and then we actually make the whole
thing work well. So now, every single voice interaction system in the world, fair to say,
including Microsoft, Google's, Amazon's, is using this technology. So I'm going to tell you how
this whole thing started as part of the presentation. So the track, so we started doing somewhere
around 2009, and we worked quite a number of years in collaboration with academics, and most
of the work done is right in this building. And we created this technology called the deep
learning speech recognition. That was the first time demoed by my boss, Zhengyou's boss, our
boss, Rick Rashid. Now he retired, after he gave this talk -- soon after he gave this talk. He was
actually in China giving his talk, and then that story was featured in the New York Times in
2012, and John Markoff was the reporter, and he interviewed me right in this building, as well as
with Geoff Hinton, and the story is very, very amazing, and this is the real-time demo system,
where Rick Rashid spoke English. He doesn't speak Chinese -- and then it's automatically
recognized in English and gets translated into Chinese and using Chinese language synthesizer to
speak aloud, and it sounds like him. So many of you, my colleagues way over there and then
people saw that, students -- this is a big students of 3,000 people, and they shed tears. They said,
wow, now they can talk with their mother-in-law easily, and this is a joke, right? A lot of
Western people, they marry with Japanese, they couldn't speak with their -- and now, they were
able to speak without any difficulty. And that happens in 2012. So now the underlying
technology used in this demo actually is what's called deep neural networks, so we call it
context-dependent deep neural network hidden Markov model. It's all technical, and they were
invented about two years ago through our collaboration with University of Toronto. And this is
the very important figure here, so this is number of years. So if you sit in the back, you won't be
able to see that. This is 1999, and this is 2012, so this is about the time. So this is a speechrecognition error rate. So in 1993, when people started evaluating this word error rate for
spontaneous speech recognition, just like the one I'm talking about, error is 100% almost.
Everything is wrong. That's about 20 years ago. And then people are working very hard. So
every single year, the error drops. Look at how much people -- so this kind of evaluation was
sponsored by DARPA. They put tons of money to individual groups, including companies. IBM
is one of them. Many universities have participated. And then whoever got the best result, they
got funding next year. It's a very stressful test. So if they're not doing well, they get cut off, so
that's why it drops so much. So every single year, it comes down here. Now, roughly about
2,000, things don't change anymore, so error stays the same, about 20%-some of error. So it's
actually stayed about the same for 10 years. So in 2009 and 2010, so when we first started doing
deep learning in collaboration with Toronto, error rate dropped from like 20% something to 15%
some. And then for Rick Rashid's demo, that was two years after that, and the error dropped to
7%, and every time, most of the things he said got recognized correctly. Look at how much it
moved from here to here, 10 years, it's just dropping down here. Essentially, error dropped more
than half for this, so this became usable. And the underlying technologies, I don't have time to
go through all the technical details about how this deep learning actually works here, except to
let you know that the deep learning essentially is to extract speech information and put them into
layer-by-layer representation in the neural network. At the end, we used about up to 12, 13
layers.
>>: Can you give a basic definition of DNN. Is deep learning the same as deep neural
networks?
>> Li Deng: So for this talk, maybe you can think about it that way. So deep neural networks,
in speech recognition, maybe 90% of the work is done by deep neural networks, and there are
some additional new advancements of deep learning, like recurrent neural networks, like
memory network, like deep generative model, like also convolution neural net, which are a little
different from DNN. So think about DNN to be maybe 70%, 60% of deep learning, and the
other ones I don't have time to go through. Well, in the image recognition, I'll probably say a
few words about some more different type of network, but they follow similar structure, so the
features that you have layer by layer representation and having the information from bottom up
all the way to the high level, to allow you to do very effective discrimination, pattern
recognition, which is the same. So now, for image net, I'm going to go through later on. The
number of layers goes to about 60 layers. It's amazing. People just keep getting better and better
results, and speech, roughly 10 to 15 layers are good enough, but if you have one single layer,
you hardly get any good results. So those are the main advances achieved by Microsoft
Research during about this year. And that led to -- so in 2013, MIT Technology Review, and
this is a very reputable technology journal, they put deep learning to be top 1 out of 10
breakthrough technologies, and one of the reasons they put deep learning to be here precisely is
because they cited this translation speech in real time from our boss, Rick Rashid's, demo. So
that's a very influential -- that's 2012 and 2013. So now, I want to tell you a little bit about how
this invention gets spread out into the entire industry. Now, MSR is doing that. MSR is pretty
late. They hired a few people from our group. Google, actually -- so Professor Geoff Hinton is
one, I have to credit him a great deal. He's really a pioneer from university. Now, Google
acquired his company without actually coming, just work himself and two students. That
happened in about 2013, and we are trying to do that as well. We didn't bid high enough. But,
anyways, so there are a lot of interesting stories. He is the one who actually gets together with
all the 10 authors, including IBM people, Google people, and then his graduate students and
myself and my colleagues and together to write up this paper. So this paper, just about two years
ago, we get about 600 citations, a very highly cited paper. Essentially, it becomes still the state
of the art for this, so he managed to get shared views of four research groups, including Google.
So this is the only time competitors get together without patent fight, and then without any
argument. We all agree on every single word we wrote. It's about 20 pages of paper, so we
coordinated. Actually, he and I spent a lot of time coordinating all this, everybody is happy
about the conclusions. So this one becomes a very interesting kind of original paper that many
people follow up. And, of course, so once you do that, the larger-scale speculation successfully,
and everybody follows. And that includes Google now. There's our Skype Translator and
Windows Phone, Xbox. All the voice here is using the same technology I showed you. The Siri,
they're using that -- my boss got hired by Apple about two years ago, started doing all of this.
And then this is Baidu. Basically, all single companies, major companies, they are using the
same technologies, so this is very, very impactful. And also, university labs and DARPA
programs, now, every single team in that group, they used that technology, so this is a -- and I'm
very proud to say that they originally started in this building, right on the third floor here. Okay,
so this is an article from Businessweek. Everybody knows about it. So the race to buy human
brains behind deep learning machines, and that was about one year ago. So our boss, new boss,
Peter Lee, actually mentioned some very interesting work here. This is a fabulous article that I
use to recruit people. So he says, according to him, my boss, so I actually check with him
whether he really made that. He told me kind of. So including Microsoft, Google, so Google
actually spent about $400 million, purchased this company about one year ago, and we didn't do
that. He said that the Google find themselves in a battle for the talents. Microsoft gone from
four full time, about full-time employees, about four people that were working together in the
time that I talked about, going to about 70 about one year ago, so now it's more, and maybe
about 100-some people in the company. Maybe some of you are here. And we would have more
if talent there were to be had. And he said last year, the cost of top deep learning experts was
about the same as top NFL quarterback prospects. Do you know how much they make? So I did
ask. I didn't get an answer from this. Well, that's what our boss said, so I'm not sure. So this is a
very good recruiting statement there. He put it in Businessweek, so I can quote that. Okay, so
now I'm going to go -- so this is not just speech recognition. Speech is the beginning, so now we
are moving from recognition to do translation, to do understanding, so I'm going to spend three
minutes talking about each of the projects at a high level. Okay, now, for the speech translation
project, we have the wire covered this story and the CNN covered this story, all within one year.
So we talk about how Skype uses the artificial intelligence and deep learning to build the new
language -- I do have a video, so if I have time, I'm going to show you some of the video, how
this technology works. Now, we also have this speech understanding task, which is used, so
recognition is not the whole story. You have to understand that for Siri to be able to talk to you
and for Cortana to be able to talk, Google Now. They all used the technology called the speech
understanding, but now, so this is one of the very popular models. This is a little bit different
than a deep neural network. It has a recurrent characteristic, so the time becomes the layer in this
type of architecture, and it's very useful for any time series problems, like speech translation,
understanding and also for something like time series analysis. I read some papers that people
are using this model to do stock market prediction, but I haven't seen, but even if they use that
successfully, I don't think they're going to publish anything. Otherwise, they are going to -because it's kind of competitive. But recognition is not competitive. Everybody can get a
[indiscernible], so they have to be recurrent, so we wrote some papers on this. So the next topic
I'm going to do very quick. There are so many projects I want to do, so I'm going to selectively
go through some of those. So the next -- this is a big subject called the deep image recognition at
Microsoft, and Microsoft is a latecomer for this area. So I have some very interesting stories to
tell you how this whole thing started. So in the fall of 2012, the same thing for image
recognition. It's very similar to speech, except the history is a bit shorter, and also, it's not
government that sponsors the competition, but Stanford University actually sponsored that. They
don't sponsor it. They create the database to evaluate people's different teams of competition.
So before 2012, people used this test called the ImageNet, so it's about 1,000 categories of
images. So the purpose is to give you about 1 million samples. They train them. Everybody
trains for every system. Before 2012, they're all shallow systems. Essentially, they do
handcrafted feature engineering. They use SVM, they use sparse classifiers, just one single
classifier are based upon the handcrafted features to do that, and everybody's doing that, and
these are the typical error rates, about 26%. And that has been staying for about the last several
years, so ever since the competition was created, that's about three -- about maybe five years ago.
The first two or three years, the error also keeps dropping, up until this time. I think the previous
year, 2011, also error rate is about 27%. It's about rather than in speech, you get 10 years of
stable error. They got about 27%, 26% error over here to two years, two or three years. And in
the same year over here, this is Toronto's group by the same professor, Geoff Hinton and his
group, and they actually knew their speech worked, so that was two years after that. And
essentially, if you have that belief that technology is good, and he put the best people in his team
to apply the same technology, using convolution. It's a bit different than deep net, but similar
kind of concept, many layers. At that time, they used seven layers. And all the sudden, they
drop error from 26% to 15%, just like speech recognition. And that created a huge amount of
excitement. So what I want to say over here is that, so during the fall 2012, that October 20th or
somewhere -- is about October -- and that announcement of the results was made public.
Typically, the way they do competitions is that people submit the result. At Stanford University,
they evaluate. You have no idea how well you do, because the test is withheld. So when the
result was actually announced, this is so good, many people don't believe it. So I think on the
day the result was announced, I was actually here together in this room, in the very room, having
the lecture on machine learning. There is also Chris Bishop, some very famous scientists in our
company. We are doing actually company-wide machine learning lecture. I was one of the
lecturers. So right before I give the lecture, so I get an email. I got an email from my cell phone,
and Jeff Hinton sent an email to me, saying, Li, take a look. Look at how much margin it is to go
from here. And then I wasn't sure how genuine it is, so I sent that to the company, all these
people doing [indiscernible]. And they come back saying, there must be something wrong with
this one. It's too good to be true. Most people didn't quite believe all this, but anyway, so
actually, in that lecture, I actually also recorded, if you want to take it. I put this number over
there. I said, I don't know whether we could believe it or not, but this is just a wonderful number
over here. It turned out that the following year, all the computer vision group actually duplicated
-- not only duplicated it, but also improved it better, including our Microsoft Research Group, so
we are actually a latecomer in a sense. Okay, but anyway, so this is that particular year, just one
single year. It makes this huge impact in image recognition, and I think University of Toronto
took all the credit for this. So this model is a shallow model somewhere around here, and just
the first year of deep learning, it dropped error down to about 15%. The second year, that
includes New York University and also a couple of companies, including Microsoft. They're all
participants in this competition. The best result is about 11%. It dropped error down a little bit
more. Now, this year, 2014, the results were three or four months ago, last October, when the
results were announced, the error was dropped down to 6.6% by Google. So they put the best
people there and they built up the network up to about 20-some layers with the parameters, and
without parameters, it's about 60 layers, and a lot of parameter you don't change. So between 20
to 60 layers. It's just getting bigger and bigger, bigger and deeper, and the error is down to about
6.6%. And then our Microsoft have a very nice group in Beijing, Microsoft Research in Beijing.
About February 5th, this year, that was about three months after the competition results, they
published a paper, dropping error down to 4.8%, 4.9%, so this is right here, this paper, by our
colleagues in Microsoft Research in Asia. This is the thing. And it's just wonderful, so this is
the number down here, and this is the number from Google competition. Yes?
>>: You said 1K?
>> Li Deng: 1,000.
>>: That's the number of nodes?
>> Li Deng: No. That's the number of classes at the output. How many categories you can
assign to the image. Okay, so this number, and then I think the day, one day after this number
was published, publicized, there was a huge media coverage, and then the title of the media
coverage says that now Microsoft beat Google. And then the day after that, Google published a
paper in the same -- no peer review, but just okay, and the media coverage also came back,
because Google also immediately published a paper, not completely written. I think they look at
this and then they reported 4.8%, and then the media title is that now Google regains -everything happened within about one month, within 1.5 months ago. But anyway, this is very,
very interesting. That's the end of image. Yes?
>>: What's the method for classification? In other words, you're saying there's ->> Li Deng: Oh, this is very simple. So you built layer-by-layer repetition.
>>: How are they building those layers? In other words, how are they building the classifiers?
In an automated way, or is it still ->> Li Deng: Everything is automated. The whole point of deep learning is that you have an
algorithm called the back propagation. It actually gives end-to-end learning. You actually
provide the target at the top layer, and then the learning goes all the way to the bottom with the
raw features.
>>: After it is learned or it ->> Li Deng: No, it is not learned. Correct. Structure has to be built, simply. So image, if you
just directly use deep neural networks for image, it's not going to work well. You have to use a
converted structure, and that structure is well known. Okay, so next, so my time is almost up,
and this is actually just my introduction to everything. So the real work that was done after all
this was done by what's called the semantic modeling, and that's getting much, much more
interesting now, and I think Microsoft is really taking the lead in this area, although our company
don't actually have a lot of publicity outside, because a lot of work has interesting value to
Microsoft. So we do publish some of the work, so I'm not going to talk in detail about all the
business implications of this, just to tell you what kind of technology we have developed. You
can imagine what kind of business applications you may have. So we have -- in order to do this
kind of modeling, we established this technology center, so I'm actually managing the
researchers and developers doing a whole bunch of applications you're going to see down the
road, next 20 minutes. So the model we built is called deep structured semantic model, DSSM,
so think about this one as the deep semantic model, so we do have a whole bunch of publications
here. I'm going to talk strictly from publications over here, so actually, at the end, I do have a
list of publications if you want to know. Here you can see that. And, of course, applications, we
don't talk too much, but I'm going to give a little bit of hint as to what kind of applications we
have been doing. So I'm not going to go through detailed technical details, except to let you
know that -- okay, so the whole point is I should say that, when you talk about semantic model,
that means that you have the sentence coming in there. If you have business activity coming in
there, what is the underlying semantics? What does it mean? If you know what it means, you
can do a lot of things for email analysis, for web search, for advertisement. All these things can
be done if you understand what the information meant by computer. And that's what's called -so the meaning, the structured meaning enabled by this model is what's called the semantic
model. So the whole point of this model is that we -- yes, there are a few sort of technical points
here, so it's that we train this model using various signals. I'm going to show you what kind of
signal this is, but that signal comes from different applications. But most of them, we acquire the
signal without human labeling effort. Otherwise, it's too expensive. So for example, one of
these type of signals that doesn't require human labeling is the click signal for Bing. We have to
have Bing in our company, so every time you search on the Bing or search Google -- if you
search Google, Google got your signal. They use that to do whatever they do. Like if you use
the Bing, Microsoft has that signal, and we use that signal. You actually donate that signal to
Microsoft, to Google, so you don't think that you could benefit. You donate everything to them.
And then it actually tells you that information from this to this is related, and that's an extremely
valuable signal, and you don't have to pay any many to get it for the company viewpoint, and we
take advantage of that, of that signal. So I'm going to show you some examples here. So I'll
show you -- so rather than giving you all the mathematical description of this model, I'm going to
show you an animation to show you exactly how this model works, and then I'm going to show
you a whole bunch of applications for this model. So suppose that you have a phrase called a
racing car, so you have to understand what's the meaning of a racing car. So it's a car, or it's used
for racing, but when you put together, how does computer know what you mean? So the way
that we do it is that we built a neural network, deep neural network, many layers, matrix,
nonlinearity, and then at the end you get a 300-dimensional vector, so that vector, you don't
know anything about it. So in the beginning, you randomize everything here. You code this in
terms of whatever symbol. You code that in the symbol. Coding information in a symbol is very
simple. You just look at the dictionary. The semantic meaning doesn't get captured when you
do this. When you just multiply them all -- so if you don't train all these parameters, there is no
meaning extracted. They're just random numbers. So when you initialize them, you get these
vectors, 300-dimension vector, arbitrary. So racing cars giving you this vector. Now, it gives
you something of Formula 1. Are they related to each other? They're related to each other,
although there's nothing in common with the words. If you use a symbol, nothing in common,
you don't learn anything. Now, if you do deep learning, you come up with the connection. I'll
show you how it is. But during model installation, nothing is going to happen, right? This is
random, so this one can be here, it can be here. They don't have to be the same, even if the
meaning is the same. Now, when you get another one called racing towards me, for example, it's
random, as well, so it has the same word racing -- racing, are they related to each other? Not
really, right? Even if the words are the same, they don't mean the same thing at all, right? And
these two are very similar to each other in the concept, but they don't have the same words in
common. So if you use the symbol to code the information, like most of the natural language
processing people are doing, you really don't know much about their connection, right? So that's
the initialization. Nothing is learned here. Now, the next step is the training, so this is a learning
process, and that really is the secret sauce of how everything works. So in the learning process,
we compute what's called cosine similarity between semantic vectors. So when we initialize
this, this goes to here, this goes to here. They are random, right? They are not the same, even if
the meanings are similar, and these two are random, as well. Now, when we learn -- so when we
learn the weights in the deep learning network, what we did is that -- okay, sorry. So when we
learn the semantic vector, we actually know that when the user clicks, puts this racing car,
Formula 1 website may show up and people get clicks. We know that. So if we know that, we
know that they are related to each other. Therefore, during the learning process, we force the
cosine distance between this and this to be small. We make the distance to be as small as
possible, and that's the objective function. And since this almost never gets clicked when the
user uses racing car, then we want to make them as far apart. So we want these to be close to
each other. We want these to be apart from each other. So we develop this formula. It's just a
function saying that if they're close to each other, positive example, we want the numerator to be
big. That means cosine to be small. Cosine needs to be small, and then everything else we want
to be big. So numerator being small, denominator to be big. So numerators are the ones that has
the click information that means the computer knows that they're close to each other, and those
which are not clicked, we make that to be the denominator, make it large. So when you take the
ratio between the two, you optimize them, you force that to be close to each other. You force
them to be far apart to each other. And then you do back propagation, and I don't have time to
go through all the detailed algorithms, for those of you who know. Now, after everything trains,
then you do the same thing again. See, the racing car, whatever it is, now these are trained, are
learned. Once they are learned, Formula 1, wow, you know what they're saying? We get to the
end of the training, we force them to be the same, so they are similar. Therefore, you can
actually rank them. So racing towards me, when we train them, make them apart, as much apart
from each other. Therefore, you have many other documents. You can use cosine distance to
rank documents for a specific query. Therefore, when you type in something in the network, the
right page will show up, even if the right page has no similar words to you. It's a very, very
intuitive idea, and we use that as the basis for many applications.
>>: Are you actively forcing those apart, or just bringing those together ->> Li Deng: Both. So if you optimize that cosine of the ratio, you do both. Yes. They do both.
Okay, so I'm not going to go -- well, except that this all is done by GPU. Okay, so this is just
example for web search. So there are so many other things that we have done. Question
answering, it can be done this way. Automatic application recommendation can be done this
way. Natural use of interface can be done this way. A whole bunch of things can be done this
way. Our center explores many, many of these applications. I'm not going to go through all the
detail, except that we published a lot of them, but usually, we don't publish in a big fire. We
publish in the small ones so people don't notice that one. But now I said that. People can
obviously -- but anyway, I'm going to show you two examples, given the limited amount of time.
One is automatic image captioning. So that's a very interesting task. So not only do you
recognize the image. You want -- let me go through this, so image captioning simply means that
if you have an image here, and then what you want to say is that people, when you have the
image here, you want people to write down sentences. People will write down a stop sign at the
intersection on a city street. So it's not a recognition task anymore. It's called captioning
generation. It tells the story of an image and of a video or of whatever, a movie or something.
>>: Self-driving cars will need these.
>> Li Deng: Yes, eventually, it might. So I'm not going to go through -- so the model that I
showed you earlier actually was used in this system. Again, we actually published this paper in
CVPR. We have about 10 people working together on this in the summer with four interns.
We've worked hard on this, and Google at the same time has a paper on this, and then Stanford
has a paper there. So that paper submission was about two months ago, and then the minute they
submitted, they called New York Times, and the New York Times published a big paper about
their work on the weekend. And we saw that, wow, we should have called them as well. So we
called them, too, on a Monday. They said, that's old news already. But anyway, so all these got
accepted. Six papers got accepted, because the six groups are doing simultaneously. Stanford
has the paper, Toronto has the paper, Microsoft has the paper, and it's a big race over here.
Everybody, and they don't even -- typically, people go to academics, CVPR and IEEE, they wait
until conference to come here. No, they already publish in these open sources, and they call the
media and media covers them. It's kind of a very strange world now, and all because of deep
learning. Without deep learning, people wouldn't do that. But, anyway, I'll show you, but I don't
want to go through all this, except to show you some results. Yes, all these are -- but these are
the most interesting results to show there. So this is a photo, and then you ask the machine to
describe what they are, and the machine will say a group of motorcycles parked next to a
motorcycle. It's okay, not too bad, and these are human enabled. Human enabled, there are two
girls are wearing short skirts and one of them sits on the motorcycle. And do you know why?
People say, well, machines don't care about girls. And there for other examples, we get some
very interesting results. So of course, if you compare this to this, do you prefer this or do you
prefer this one? Maybe you prefer this one. Some people may prefer this one. Some people say
equally well, so we actually did crowdsourcing to ask people. I don't have to go through it here.
And about 25% of the people on the Mechanical Turk say that machine is as good or better than
human annotation. So the example I gave you, maybe it's equally good. Well, maybe not. So
the majority of them still prefer humans, so we're getting there. So next year, we probably
expect this year to be 35%. The following year, maybe 50%, and then you see what I'm saying,
so I think this technology is getting very quick.
>>: Can you distinguish the human? Like if you give them ->> Li Deng: Yes, so if I don't tell you this machine is human, are you able to tell? Which one
do you prefer? Well, I don't. Okay, if you can tell that, you belong to this. That's not bad. But
how about the other one? So for this picture, which one do you say? A woman in a kitchen,
preparing food. A woman working on counter near kitchen sink preparing food. Maybe
something, right? That could be similar. But anyway, we choose good examples here, but there
are better examples, and that's why it's only 23%. So you get the gist of what's going on here.
So now I'm going to give the next example here, and this example is a real example in Microsoft
Office. We released that. There's an article written up on this. I would like to tell you about
what it is. So this is called the contextual entity search, how do you search? So I'm going to
show you a very good example, I'd like you to go home to try. So I'm promoting Microsoft
products using deep learning, and that's called -- so the article, Microsoft published that. So it's
called Bing Brings the World's Knowledge Straight to You with Insights for Office. And the
underlying technology, the machine-learned relevant model is actually empowered by deep
learning our group actually contributed to this. So I'll show you how it works, okay? So if you
opened up Microsoft Word and your kid probably is writing some article, and he will say
Lincoln was the 16th President of the United States and he was born in -- I forgot. He wants to
write the essay. Normally, do you know what he does? Typically, you go to Wikipedia? You
type in. You can even type in Lincoln. If you type in Lincoln, a movie may show up. You have
Abraham Lincoln, and then you get the information, you copy back. Now, for this Office
Insights, this is a new feature for Office -- you don't have to do anything. You just right-click
here on Lincoln, and then this will show up, right inside, right in your Microsoft Word. And
then you say look at this, you copy answer back here. But how does it know it's Lincoln? So the
movie Lincoln will show up, the car company called Lincoln will show and the town in
Nebraska will show up. You know why? If you just put Lincoln there in your Wikipedia, all
three may be there. You probably have to choose one of them. It's not, because our deep
learning takes into account the context. So if you click over here, automatically, Microsoft
Word, you'll get surrounding. It sees that, oh, they talk about American, United States of
America, it has to be Lincoln, not the other Lincoln, and that provides a lot of [indiscernible], so
that is going to make using Microsoft Office more productive, and that's the kind of productivity.
It's one single example of how this deep semantic model is working. And the reason why ->>: It can show up as Abe Lincoln. It can be anything, right? I need some context to it?
>> Li Deng: So the context is here. It's automatic, yes. Exactly. So the whole point of this
work is that the X and Y semantic model will use this to condition on this when we train that
cosine distance. Therefore, if you see something like this, this will be close to each other.
Therefore, this Lincoln will show, because in the training, similar kind of things were put
together. Yes?
>>: Question. Is this just implemented in Office 365, so it's got to ->> Li Deng: No, it's only in Office. It's only Word Online. It's not everything.
>>: What I mean is, it's online, meaning Office 365. That's where you've got the cloud behind
it. It's not running on your PC.
>> Li Deng: Correct. That's correct. It has to be on the cloud. Otherwise, you don't do all the
deep learning computation. That's a good point. But 365 hasn't put that in yet.
>>: I expect they will, though. And now I know what Insight will mean when I see it.
>> Li Deng: Yes, exactly. So keep that in mind. So if Microsoft announced -- it actually was
announced in just a small version of Microsoft Word.
>>: Speed to run it? Is it super-fast?
>> Li Deng: It's super-fast. You don't notice any difference. So I'll show you some negative
examples for deep learning. For this cloud, if I talk with experts, I show a negative example, so
people know that I'm not talking about flowers only. I'm also talking about something negative.
So one negative example is called the malware classification. So for this example, when we
have different neural networks, we don't really gain very much, and then we understand why.
Most of the people doing deep learning think about it as a black box. So since you have gone
through so many examples, you already know what tasks will be good for what kind of model.
And the reason why this is very difficult to improve using more depth is really because the data
that we got is actually adversarial in nature, so we actually published a paper on this about two
years ago, on this negative example, but the result is still good. It's just one single layer neural
net is as good as multiple layers, and in the past, people had a very hard time understanding why,
because the more layers you get, the better arrangement you could get. Now, for speech, for
language, for image, this is all true. Now, for some data, it may not be true, so just let you know
that not everything is good for deep learning, and then if you want to know why, you have to
really have to good scientific training to understand why, and the whole experience. So this is
the same task that we failed using deep learning. It now becomes a standard task in Kaggle
competition. Does anyone know Kaggle competition? Now, this year, we actually put in this
task over there, so I heard about -- anybody know about this competition? So can you say a few
words? How many groups participated in this.
>>: I am not doing this. I did that sea plankton thing, which came up just before this. I did that,
so I haven't even started doing this.
>> Li Deng: So the result that we published two years ago may be verified or may be disproved
by this competition, and that competition how many days to go? 47 days to go. So usually, it's
the same thing. Everybody puts -- typically, people get about 100, 200 groups submitting them.
The winner will get about $12,000 or something, but you have to put all the source. These are
the companies that actually collect all the data, the source, and the pricing of the company from
last year, in the previous year's -- last year quit to become president of a startup company, doing
deep learning. Because in the past, deep learning, under his leadership, Kaggle actually has so
many competitions. Always he told me that in most of the tasks, deep learning won, because he
wanted to it by himself other than letting people do it. So he got this company doing it, and he's
now a founder of this company. So we are going to see the result for this now, but I would
expect that deep learning probably will not be the top one, if it's consistent with our result. So
the summary is that -- of this talk, I'm going to quick finish very soon. So Microsoft actually
started this deep learning about 2009. Now, we invented this type of DNN model, revolutionized
entire speech recognition industry from the 20 year old system to the new deep learning system
now universally adopted by both industry and academics, and everyone knows. And Microsoft
created the first successful example of deep learning at industry scale. Now, deep learning is
very effective now not only for speech recognition but also for speech translation. Microsoft did
a tremendous amount of work on this, and also image recognition, and this now is in image
tagging for OneDrive, Microsoft, and we are late to start this, as I showed you earlier, but we
caught up pretty quickly. Image captioning, the one I showed you that hasn't gone to any
product yet, someday, you are going to see that coming soon. Maybe not -- I don't know whether
Microsoft will be the first one to come out. Maybe not. And the language understanding, I
showed you a little bit of this semantic modeling. I didn't have time to talk about all this. Some
of those I probably do not feel comfortable talking about. User, a whole bunch of things -- and
the enabling factors for all this success is that we have a big data set for training deep models and
we also have very powerful general-purpose GPU computing. It's very important. Otherwise,
we couldn't even run all these experiments. And the final important factor that many people
ignore in this community is that we do have a huge amount of innovation in deep-learning
architecture and algorithms. So in the beginning, we got a comment on some of these things
saying that is neural network the only architecture around? If it is, then there's not much
innovation. It's fixed. Now, to do the kinds of things, semantic modeling, architectures are so
varied. A lot of innovation is coming into how to design the architecture based upon the design,
the deep learning principles and also your understanding of the domain problem in order to
design the architecture to make that effective. So there's a huge innovation opportunity here, and
the innovation from the Microsoft Research standpoint is how to discover some distant
supervision signal which are free from human labeling. If you require human labeling, too
expensive. It's just not going to scale up. So in the one search example I showed you, no
human, right? Automatic. User, you gave the signal to us, and we used it. If you don't give us,
it's not going to succeed. And how to build, and after we understand how to get this supervision
signal, then we need to know how to build deep-learning systems grounded on exposing such
smart signals, and the example will be DSSM I showed you earlier. I think that's pretty much -now, another summary. Yes, this summary actually is written for experts on deep learning. So
for speech recognition, I would say I'm very confident in saying all low-hanging fruit are taken.
No more. So you have to work very hard to get any gain, and so for that reason, my colleague
and I wrote a book, about 400 pages book, summarizing why this is the case. And the people,
they work very hard to do that, so it's good to get out, to get other, faster progress on. So every
time you go to the conference, a lot of work that you see, beautiful pieces of work, they don't get
as much gain. A tiny bit of gain, much more work, not like three or four years ago, just one
GPU, dump it in there, a load of data, do a little innovation in the training, you get a huge
amount of gain. Now, the gain becomes smaller, and this is not to discourage people from
working on this one. This is actually to encourage serious researchers and graduate students who
want to write PhD theses to get into this area. So low-hanging fruit and if you're a startup
company, don't do that. If you do that, unless you find some good application to use the
technology, not the technology by itself. Now, for image recognition, I would say most of the
low-hanging fruit are taken, so now it's getting -- the error rate just dropped down to 4.7%. It's
almost the same as human capability. So if you get more data, you get more computing power,
you can still get error down a little bit more. Not much more anymore. Now, for natural
language, I don't think there are many low-hanging fruit there. Everybody had to work very hard
to get the gain, including all the deep-learning researchers. Now, for business data, the
advantage, there's a new research frontier. Some low-hanging fruit there. I'm not going to say
too much about this. Now, also, for small data, many people think that deep learning may not
work well, and that's wrong. People always say that for small data, use traditional learning data,
don't use the deep learning. Big data is deep learning. That's wrong, and there are so many
examples for Kaggle, and we're going to hire somebody who actually won that competition to
join our group to do this. So deep learning may still win. It's not guaranteed. For big data -- for
perceptual data, especially, it's big. Deep learning always wins and wins big.
>>: Perceptual meaning?
>> Li Deng: Speech, image, human sensory, and then maybe gesture, maybe touch, that kind of
thing. I think deep learning will get it -- I have some theory as to why that happened. Partially
it's because if you look at the existing human systems, it can do better. The neural network
really simulates many of the properties of this. There's no reason not to do well. And be careful.
If the data has adversarial nature or if the data has some very odd variability, which I don't go
through here -- we have tried some examples -- be very careful. It may win or may not win. So
for those of you who are interested in these kind of problems, security problems, pay attention to
the new Kaggle competition. When they come out, you'll probably get better. I could be wrong,
okay? I'm actually curious. I work on this with some very strong people. On that, we haven't
got any gain, and if anybody gets some better results, that will be very interesting. I could be
gone. I'm very open to it. But so far, I am not wrong yet. That's before that competition result
came out. Now, issues for near-future deep learning. This is I think my last slide. Now, for
perceptual tasks that I showed you earlier, speech, image, video, gesture, etc., now, the issue
here, and this is a very important issue. Everybody in this community asks the questions. None
of those has answers. I'm just throwing out to you to make you think, and these are all the issues
I myself think a lot about this. I'm looking -- every time I go to the conference, I look for some
answers, whether they address any issues. If they address these issues, it will be interesting. It
could be surprising. So with the supervised data, like speech has all the supervised data. Image
has a lot of supervised data. The issue is what will be the limit for the growing accuracy with
respect to increased amounts of labeled data? So far, it hasn't reached the limit. Speech kind of
reached the limit. We can see that if you get more data, you don't get as much gain. Now, for
image, we haven't seen that limit yet. You get more data, you still can get better result. So we
don't know where the limit is. Now, once we get the limit, then we want to ask the question,
beyond this limit, or when the labeled data, which are typically very expensive to collect, if the
collecting more and more data becomes not economical from a company viewpoint, then will
novel and effective unsupervised machine learning emerge, and what will they look like, whether
it's deep generative model or some other kind of model? And many people actually are working
on this from an academic viewpoint. So when you go to academic conference, you look at
people doing unsupervised model generative model, don't throw them away, although they
haven't seen any great result compared with the supervised learning that I showed you earlier.
So I always -- every time I go to the conferences, I go to these sessions. I don't go to whatever
sessions I know about, because they might show some good result after the limit is reached, and
the limit probably will be reached within two or three years of time. That's my expectation.
Now, for cognitive tasks, including natural language, reasoning, knowledge, which I showed you
a little bit about natural language, the issue is that whether supervised learning, like machine
translation for example, beat the non-deep learning state-of-the-art system like speech and image
recognition? So far, machine translation -- there's a beautiful piece of working coming out from
Google just about two months ago, from this conference called -- three months ago. Just
beautiful piece of work. They matched the best result for the competition. They haven't got a
chance to beat them yet. So every time we get together, for a conference, they keep telling me
that they've got a better result. I never trust that, but so far, it hasn't shown that it's beating that,
so we don't know whether it will actually do as well. Because this is an easier task than this.
The cognitive is the higher level of the brain activity, and this is the lower level of brain activity.
Now, and then another issue is how to distill the distant supervision signal for supervised deep
learning? So the example is that DSSM, we've already got many low-hanging fruits, and then
what is the best way to exploit this still information -- like web search would belong to this kind
of class problem. And then you have to think hard how to get that information into your system,
and we have -- I'll give you an example of DSSM how to exploit that information, and this is the
low-hanging fruit. Once you realize that, that collected information will help you to increase
your numerator and decrease your denominator. You know how to do that, then you get a lot of
the bang out of everything. That's a very easy problem, and as a matter of our deep learning
engines are pretty much the same. We just get different supervision signals, we get different -we actually get different -- the objective function we obtain in the same way. And the next
question is that, whether the dense vector embedding, which I showed you earlier -- I didn't use
that word, the embedding. The embedding is that you get all the symbols, you embed them into
a vector, load them into a vector. So whether the vector embedding with a distributed
representation, whether they will be sufficient for a language problem. I personally believe,
based upon some of the literature that I have seen, it could be quite likely to be yes, but that
dimension has to be very big. And the alternative question will be whether we need to directly
encode or decode the syntactic semantic structure in the language, like until about six months
ago, my feeling is yes. We need to have somehow to be able to able to recover the structure so
that you can do reasoning. After I read all these, Google's paper, Facebook, a few just beautiful
sets of paper, we haven't gotten a chance to do research in this area.
>>: Your language translation has to do that, for example.
>> Li Deng: No, no.
>>: You don't have to recover the semantics and then reconvert them?
>> Li Deng: No, just the recurrent network that will just speed all this information up, and that's
the one that I mentioned. This is amazing. The first time we read the paper, we said that's too
good to be true. It turned out to be true. But anyway, this is outside the question. Now, for big
data analytic tasks, which I don't have time to -- I'm not going to go through all the detail, the
question is whether vector embedding is sufficient, just like in language. It's not clear here.
Should the embedding method be different from those for language? It's not clear. What are the
right distance supervision signals to exploit for these kind of tasks? We don't know. Whether
there's also data privacy issues or whether we need to do all this kind of analysis, embedding
encryption -- basically, deep learning has just -- when I say near future, I mean two or three years
from now. People will go through all these issues. I expect half of these questions will be
answered within the next two or three years. So that's the end of my talk, so I give you -- I'm
open, ready to answer any questions. I'll give you some lists of selected publications from our
group and from Microsoft. And this is only selected group over here, selected publications over
here. Thank you very much. Yes, question.
>>: In one of the slides, where you showed the deep neural network architecture, its
dimensionality is actually continuously decreasing from 100 million to 500K, so it's always
decreasing. From my experience with some image classification work, one -- like the first layer,
if you do an expansion and then go down.
>> Li Deng: That might work. That is called the bottleneck feature, but for images that may be
true. That may be true. But if you read the ->>: I've seen the current results with that.
>> Li Deng: But if you read this recent competition, the paper by Google, by our Microsoft Asia
people, it's just standard convolution network. You do convolution, you do pooling, you do
convolution and then pooling again, which does give you ->>: I think the same effects could possibly be replicated by two layers where you end of project
up and then go down?
>> Li Deng: I don't believe so, because this task has been done by how many, dozens and
dozens of groups. They have been doing that since 2012, for three years. I have never seen any
architecture looking like what you talk about. So I think what you want to do is that you actually
want to try your architecture on one of the image net tasks. If it actually gives you better results,
I think you will be very famous. Yes.
>>: What is the prospect on a regression-type problem. All of the things that you mentioned are
more classification type.
>> Li Deng: Classification?
>>: Yes, more regression-type problem.
>> Li Deng: Regression problem? So actually, the ranking problem, the DSSM that I talk
about, actually is a ranking problem. Well, it's not really regression. It's called a ranking
problem. Now, for regression problem, this one is just as good, so it's like the only difference to
be made is like the upper layer, rather than being a soft max, you assume it's just a mean square
error.
>>: But are there any dramatically good results?
>> Li Deng: But so far, we don't have any competition results for regression. If there were, I'm
sure you will get them. Yes.
>>: So the neural network needs a lot of data, so what's a lower bound of the data? Like how
many images?
>> Li Deng: Okay, so for ImageNet, it's about 1.2 million images, which is considered to be big
before, but for speech, it's very small. Speech, we used about -- the data that most of the people
use, it's up to about a billion samples in terms of range, and that's the average for speech systems
that we have seen. Now, for 1 million, but there are some other tests with more data, so actually,
the more the better. But the whole point is that up at about 1.2 million for images, deep learning
already showed dramatic advantage, so I expect that the limit is lower than 1 million. And then
in some of the tasks that I showed the Kaggle competition for the chemical or drug discovery
task, the number is only about 40,000 or something, and they got better than anything else, but
they have to do a little more work, how to regularize. If you don't have all the data, you have to
do more work. The [indiscernible] machine does more work.
>>: Okay, so just one general question. For the training, my understanding, suppose for image,
for over there, it can be for paging, scanning, translation, discussion, all kinds of stuff. For
speech, it can be different dialogue or different noise or look at -- how can we know when the
model is overfitting, or for example, is ->> Li Deng: I know. So the convolution, for example, in image, only deals with the shift.
That's good. You are talking about so many other things, and that's why you have many of
what's called feature maps, not just one single maps. If you have one single feature map, you
only can deal with one type of distortion. You have many, typically, you get about how many,
20 to 200, somewhere around there. So for different feature maps, you will hope that different
maps, because you initialize them differently, they will capture different kinds of variability, and
that probably would answer to you. So variability, the overfitting, so okay, so if you have lots of
data, overfitting may not be the problem. If you have small enough data, overfitting becomes
serious, and there are many techniques developed in the deep learning community that solve that
overfitting problem, like drop up, drop out technique. If you have small amount of data, always
use drop out. If you have small amount of data, always use pretraining, so those are
[indiscernible], using those types of models to initialize them. Therefore, you actually have
partially solved the overfitting problem. Then, when you do back provisioning, you tend to have
less overfitting problems, and that will be the right answer.
>>: Okay, thank you.
>>: And we have last one.
>> Li Deng: Yes, question.
>>: Yes. So you mentioned GPUs as being kind of one of the enabling factors, and clearly for
training, there's been some huge resources thrown at these problems, but one of the other trends
is we're seeing companies like NVIDIA put more GPU technology in portable devices, and I'm
just wondering, how does that play out? Are portable devices doing the classification, doing the
speech recognition, or do we always have to maintain connections to a larger cloud? So does
GPU power in a portable device buy it ->> Li Deng: So most of the things actually are on that base of voice search, so the information -all the computing is done in the cloud, so the result gets transferred back. So within this device,
you really don't have to put all those things. But also, I have seen a lot of startup companies in
speech recognition, especially in China, they actually built the embedded deep learning model in
here, and that requires -- they do some accuracy, but as long as they find the right application
scenario, losing some accuracy, it's okay. So I think there is a big balance, but in the US, most of
the devices they have seen actually use the cloud.
>>: But looking forward, do you see that shifting?
>> Li Deng: I don't see in this country, no. I really don't see it, because so far, transmission, it's
so low. The bandwidth is large enough.
>>: And we'll have 5G in a few years.
>> Li Deng: Yes, so I personally don't see that.
>> Zhengyou Zhang: Okay, thank you very much.
Download